XaiJu
Fornax
Fornax

patreon


2022-08 Data loss postmortem

In the last week of July pixeldrain lost about 6% of all files it was storing due to a database crash. Here is what happened:

In the second week of July I went on a three week vacation. In the week before I left I had upgraded my ScyllaDB cluster from version 4.6 to 5.0. Unbeknownst to me this new version had a bug.

The database is divided up in shards, each shard runs on a single CPU core in a single server. The shards are responsible for a table and a commitlog. Tables are for data which has been saved in permanent storage, and commitlogs are where new write operations are stored before they are added to the tables. Due to the bug the commitlogs were not being integrated into the tables and they kept growing. Eventually the disks of the server machines filled up with commitlogs and crashed. During the crash the commitlogs would get corrupted and the data in them was lost. This cycle kept on repeating. Losing data on a regular basis. Deleting a row in Scylla also counts as a write, which explains why a lot of files which had previously been deleted suddenly reappeared on user's accounts.

The storage servers which actually hold your files periodically check each file to see if it needs to be kept or removed. Due to the data loss a lot of file entries had been removed from the database, causing the storage server to believe that the file did not exist and thus removing it.

While that was happening I was laying on a beach in Spain.

When I got back from Spain I learned that all of this had been happening. I filed a bug report after I had figured out what the problem was exactly. Meanwhile I also learned that restarting the server would also flush the commitlogs, which fixes the data loss issue. From that point I restarted the database servers every week.

The issue has now been fixed (https://github.com/scylladb/scylladb/issues/11223) and the database is stable again. This should not happen again.

There are a few things I have changed to make sure files are not lost again in case of a database problem:

I hope you enjoy using pixeldrain as much as I like working on it. I'm still learning a lot about running a large scale website. The best part is that this project keeps challenging my skills as it keeps growing. I am forced to learn new things on a regular basis which keeps it interesting.

Greetings,

Wim.

Comments

Ha, never change anything before the weekend and definitely not before vacation!

Matt Schulte


More Creators