sondehub

Database outage

Added 2022-04-09 09:10:51 +0000 UTC

We recently had a DB outage and I wanted to explain what went wrong.

First some background. On Friday I upgrade our Elastic Search 7.9 cluster to OpenSearch 1.2 - this went well (however we are still monitoring performance of this change). This change was rolled out by myself manually. This was done to ensure a smooth deployment of the DB.

Today during a deployment of some new amateur features to SondeHub (there will be some future posts about these) a terraform apply was executed. Terraform is the tool we use to manage our infrastructure configuration as code. This change included updates to many of our resources. Unfortunately during this apply I didn't notice an update to the OpenSearch cluster. As the cluster had been manually updated on Friday and the respective version wasn't updated in terraform configuration, terraform considered the desired outcome was to downgrade the cluster back to ElasticSearch. As there is no way to perform a downgrade terraform replaces the resources (deleting the cluster and creating a new one).

This was noticed very quickly during the apply - unfortunately there is no way to cancel a deletion. Snapshots are taken every hour and after contacting AWS they were able to restore the DB. This entire process took just under 4 hours.

During this time SondeHub continued to operate however in a very degraded state. The tracker showed cached locations of all sondes, and live updates were still processing. Behind the scenes new data was being queues in SQS ready for the DB to return.

In the end no data was lost and the queues were processed once the database was restored. I'm terribly sorry for any issues this caused. I've already implemented terraform lifecycle policy on the OpenSearch cluster to prevent this specific issue from reoccurring.

So a quick recap:

- API / DB was down for ~4 hours
- No data lost
- Site still worked in a degraded form

Things we've already improved:

- Added a lifecycle policy to the DB to prevent destroys

Things we can do better:

- We relied on AWS for the snapshot restore - we could take our own snapshots however this may be too expensive / complex
- I can check terraform plans in more detail

Unfortunately in the end we have to accept a certain amount of risk to keep operating costs low but we try our hardest to reduce these.

Feel free to ask any questions in the comments.

~ Michaela