XaiJu
Fairlane Raymundo
Fairlane Raymundo

patreon


AO3’s Data Was Scraped For AI: What To Know

My take: I was never into explicit fanfic but I always thought many of these writers have real talent. Instead

Of leading these people towards actual writing, their style has been stolen by people who are taking advantage of technology.

———-

News/Updates

Hi all—as you may be aware, there’s been an incident regarding the Archive’s data being used to potentially train generative AI.


It seems that a user by the name of nyuuzyou conducted an unauthorized scrape of the Archive, both artwork and writing (as well as at least seven other websites) and uploaded the dataset to the machine-learning website Huggingface. This only scraped publicly available works—archive-locked works do not appear to be a part of that dataset. The works in the set are from as recent as March of this year, and comprise all publicly available works before then.


AO3 is aware of this, and they have filed a DCMA takedown to Huggingface, where the data has been made temporarily unavailable (aka nobody is currently able to use it for training). In response, the uploader filed a counterclaim to try to get it reinstated—though as Huggingface’s Terms of Service don’t allow uploads of any content the uploader doesn’t own the rights to, it’s unlikely that their counterclaim will succeed. However, the user also uploaded the dataset to two more websites after the Huggingface takedown: modelscope and datafish. These two sites are based in China and Russia respectively, places that do not always respond to DCMA takedowns—however, the upload to modelscope does appear to have been taken down/deleted as of writing this. (We also cannot link to these websites as Reddit has them shadowbanned).


The website Paperdemon has more information about the timelines, other websites affected, and how to request a DCMA takedown to Huggingface (which will hopefully not be necessary, but a good resource in case the counterclaim succeeds.)


As scraping like this is unfortunately hard to control, the best option we can recommend as a subreddit is to lock your works to only be available to registered archive users (as they are less likely to be scraped, though this is not foolproof). For readers, if you do not have an account, you will need to make one to be able to view archive-locked works. You can find a link to our most recent invite request thread here, or add your email to the signup waitlist on AO3 to get an invite directly in a few days.

Comments

Also it's a big coincidence that some of the contract work I've been pondering and considering involves AI training. There's a company looking for 3 different levels of Biology educators to train AIs. I'm not sure how I feel about it... but I suppose it will happen no matter how I feel. My son was also looking at a company hiring for general AI trainers and they give a bonus if you have someone to refer because they want pairs of people to interact. It's an interesting thing.

TheBioExplorer aka BTS7Plus1Forever aka Snow 777

Oh wow... I totally missed this. I generally don't have a problem with fanfics either. They are just doing what all writers do and drawing on people and experiences that they are familiar with... and then using their imagination to make a story but they didn't change the names. For instance take all the BTS members.. change it so they instead did follow their alternative career paths... but somehow meet in 2025. Then what? Any reader knows it is fiction. If someone tried to say it was a true story... passing it off as celebrity news gossip for instance... that's not good.

TheBioExplorer aka BTS7Plus1Forever aka Snow 777


More Creators