XaiJu
Sinclair Kosh
Sinclair Kosh

patreon


Week 10: 1074/13016

Yeah we're about 1000 posts on from where we were before when I posted in r/skarchive (where I "live" on reddit) and that was 6 hours ago, so doing the math, we get... fuck it, I don't know.... lots. I'd say this script is looking at a runtime measured in days rather than hours.

In part because of that and the fact that the children will be home from school courtesy of Alfred stuff has been pushed back a day or so. I'm looking at Tuesday morning Sydney Time to kick things off.

As it's running this script spits out line after line of filepaths of the files that are being downloaded. /app/downloads/blah/blah.png etc. When it's skipping something it's "seen" before or something that already exists on the hard drive where it was going to download this file the line is greyish, if it's a file it hasn't seen before and is newly downloading it's green. it's a vestiage of some of teh scripts I orginally configured/modified to first start doing the teh archiving way back in, I'm pretty sure it was 2019 or so. Might have been earlier, but 2019 was when it started getting a bit more serious. As things have been updated/re-written etc I've kept it like that because it works, its' easy to see at a glance what's happening and how things are progressing.

There's been LOTS of green the last 6 hours which kind of surprised me. To the point I was a little concerned. I thought maybe I hadn't actually collected as much as I'd thought, or I'd lost it or deleted it. (lets not talk about the various files I may have lost over the years). So while I was waiting for things to finish I went in and opened up the drive on the NAS that houses the archive and related stuff.

I started working on one of the folders, this particular one was "authors" back when I first started this if people were asking for a story they often knew the name of the story or the very least the author. So downloading the files by author made sense. It made it easier to find what I was searching for. Also I hated the {deleted} author on reddit and wanted to ensure we had an author name. I've also talked previously about, at the time, my almost obsession with collecting as much metadata as I could about the stories/files etc.

Back then I wasn't dealing with the volume I am now, so scripts ran a lot quicker. As I began to help people and learned what was useful and what wasn't I refined my process. So I'd make a change and often the easiest thing to do to update everything was to reset teh tracking and run the script from scratch again and grab everything with the new info I needed/wanted.

I would much rather have multiple copies of things than actually miss stuff and figured that was a problme to work out later... Sadly, later is NOW.

Sometimes the changes were significant. I mentioned above that I had an authors folder, well I also had a daily folder, that was a modification to the script that would collect things and store them based on year month day eg. 2023/12/25/blah/blah.png  because more and more people were asking for stuff that they remembered was posted "around Christmas". When the bullshit with Imgur came along i ran every script I had from scratch and even developed a few new ones, including some that tracked things by reddit's post ids

I've mentioned previously about duplication in teh archive, that currently sits at about 1.5TB in files on my NAS.

Here's a real life example. (granted I did pick somewhere that this isssue was BAD :) )

So we have a file.

N:\SinclairKosh\downloads\reddit\authors\AveryAces0828\The Only One For Me, Part 3\SextStories_AveryAces0828_vgu2f1_imgur_2022-06-19_kWHp7Kx_The_Only_One_For_Me_Part_3_004_6H2qGMh.jpg

I might look like a really poor attempt at a secure P4ssw07d! but it actually tells me everything I need to know.

The problem is when you get the following

N:\SinclairKosh\downloads\reddit\authors\AveryAces0828\The Only One For Me, Part 3\SextStories_AveryAces0828_vgu2f1_imgur_2022-06-20_kWHp7Kx_The_Only_One_For_Me_Part_3_004_6H2qGMh.jpg

That's the same image, the easiest way to tell is the last bit of the filename, just before the .jpg that's the actual filename that the file was stored on imgur with. So what's the difference... and why?

We'll come back to that... because we have to move on to...

N:\SinclairKosh\downloads\reddit\authors\AveryAces0828\The Only One For Me, Part 3\SextStories_vgu2f1_imgur_2022-06-19_kWHp7Kx_The_Only_One_For_Me_Part_3_004_6H2qGMh.jpg

Same image, lets not forget...

N:\SinclairKosh\downloads\reddit\authors\AveryAces0828\The Only One For Me, Part 3\AveryAcesStorytime_AveryAces0828_vgu21d_imgur_2022-06-20_kWHp7Kx_The_Only_One_For_Me_Part_3_004_6H2qGMh.jpg

and finally

N:\SinclairKosh\downloads\reddit\authors\AveryAces0828\The Only One For Me, Part 3\AveryAcesStorytime_AveryAces0828_vgu21d_imgur_2022-06-19_kWHp7Kx_The_Only_One_For_Me_Part_3_004_6H2qGMh.jpg

They all live in the same folder, the one for Part 3 of Avery's story The Only One for Me and they are all EXACTLY the same image. That chapter also happens to have 29 images in it.

So how did I end up with 5 copies of the same image (x 29) in this folder. Not to mention I can pretty much guarantee there's more copies however I'd guess that most of them will probably be identical (filename included).

In this instance 4 of them are easy to explain. Those who know of Avery will recognise her personal subreddit where she posts her stuff along with the other main subs. There are two whose filename (the bit after The Only One for Me, Part 3) SextStories and two whose filename begins with AveryAcesStorytime. If you look closely at the either two (the same thing applies to both pairs) you'll notice the only difference is the date 2022-06-19 and the day after 2022-06-20. This is how 'annoying this can get' orginally that date came from imgur and was the date on the image that was downloaded from there.

That was great, no problems and I'd downloaded huge amounts of data like that. Until one day someone reports to me that a post I'd uploaded on narratophile had some images out of order. I couldn't understand why as it was something I'd worked on extensively because I knew it was such a problem for authors on imgur.

It took me an embarrasingly long time to work out what was going on because I was so focussed on their being a problem with the upload process and how it was ordering images. and then I was looking at the ordering of the images in the database and then display code and everything else, until I finally realised what it was.

If I showed you the folder list of that folder you'd see all the files grouped together and you'd also notice that the filenames were identical until they reached the 004, the 3 digit zero padded number near the end of the filename that was very deliberate, because when sorting alphabetically that means everything would match until the number and then it would be sorted in number order (which is why there are zeros in front of it).

In the end I think I discovered the error by looking back at the orginal reddit post, someone commented there either there was an error in one of the images (a name wrong or something) or one of the images were corrupt. The author responded and said he'd fixed it an uploaded a new image to imgur. That was fine when you were viewing stuff on imgur, but it meant that when the images were downloaded, the date for that file was like 3 days later or something, so that meant when it was uploaded (I just grabbed entire chapters and dropped them into the uploader) to narratophile and sorted that image was at the end not in the middle.

This meant I had to fix the script to use the reddit post date and had to redownload EVERYTHING, so this is a common issue especially for older posts. So that takes care of two of the posts, the other two should be obvious, they are just from different subreddits. It's not so much of an issue now, but in the early days when I didn't know as much tracking what I'd downloaded and what I hadn't was a lot harder. So i decided early on to not rely on the tracking as much as just download more rather than less, you're better to have two copies differenly named than be missing something entirely.

I said earlier, sometimes the change was large, sometimes it was small, and that's where the third one comes in. If you look close the only difference is it doesn't have the author's name in the filename. That tell's me that version was an earlier download.

My process has changed massively now and is focussed entirely differently and is also much more reliant upon the database I've created and replicated in the website. The filenames you see above were from a time when you'd share a zip file of a story or, if you were organised, a zip file of a bunch of stories and was a direct response to my frustration that so often authors names and other important data was lost. By storing it in the filename it would help mitigate that.

Now I just import all the critical data from over 15,000 posts and make that avalable on the web.

Fuck that 5x filename shit.

Seems like a plan to me.

Until Next Time,

Kosh

Comments

Appreciate the technical breakdown of your process and the struggles you have been facing in this project. Doing gods work :P

matt dez-jeenz


More Creators