sondehub

An update about prediction failures

Added 2022-03-18 08:26:27 +0000 UTC

We were alerted that predictions were old / delayed. Upon investigating it appears that over time the function that processes our predictions was taking longer and longer to process predictions. It appears that in some edge cases some radiosondes were causing Tawhiri (the software we use for predictions) was taking over 30 seconds to process.

When we had several of these events occur at once it pushed the latency of that function longer than its time out of 5 minutes and predictions would never finish. The cause of this still isn't fully understood. Once that issue was identified I looked into running the predictions in parallel. This way if a handful of sondes took longer than 30 seconds, the others will work in parallel and eventually finished.

This change was successful however this caused issues with Tawhiri that we didn't anticipate. For some reason running many requests to Tawhiri started to cause issues. We've never seen this before - even when benchmarking Tawhiri.

After a few requests Tawhiri would stop providing good results - as if the dataset was corrupted.

A working query against the same dataset would return a "Prediction did not complete" error message a few minutes later.

It seems like some of the queries we were running were causing worker timeouts and the new workers were unable to see the dataset.

I tired:
- Restarting the service to make sure the dataset was correct
- Not running predictions that are less than 0.8m/s vh
- Reducing how many parallel requests are made

In the end what seems to have resolved the issue is switching the worker type for gnuicorn from sync to gthread - having 1 worker and 20 threads.

I'm not sure exactly why this has helped but we are no longer seeing worker timeouts (good since there is only one worker) and performance appears to excellent. Regardless the error rate dropped away and we started getting good predictions again.

Future Improvements

One of my frustrations with this incident is that we weren't aware of the issue until people told us that predictions weren't working. There are two reasons for this:
- poor alerting on the predictions service
- Tawhiri lacking health check

There's currently no alerting when the prediction service takes too long or the error rate is high. I plan on fixing this.

Tawhiri in its current configuration is only tested to see if its web service is running - not if its providing a good prediction. I intend to address this by having it test running an actual prediction. This is non trivial as I need to make the health check take into account the current date and time but shouldn't be too hard to implement.