XaiJu
sondehub
sondehub

patreon


Various updates

Howdy all,

Just wanted to provide an update as to whats been happening in SondeHub land. I've been travelling (and got caught up in a bit of a travel issue) so I haven't had a huge amount of time of time to dedicate to SondeHub.

First up, Luke has done an amazing job of converting our front end to leaflet. We've always wanted to get this done, and a huge thanks to Luke to putting in the work to make it happen. We had two reasons for using it, Google Maps costs are significant and we were just creeping over our free tier limits, and the second is leaflet is much more extendable. One happy side effect is that leaflet is blazing fast!

Luke was very patient with myself and implemented many many many new features that I've wanted to see from years ago. Clicking on a radiosonde now refreshes the telemetry with more detail, clicking on radiosondes now opens the correct plane in the list, and the website can now be installed as an app on iOS, Android and Chrome! Very cool stuff.

From a backend perspective not much has changed, however I suspect I've been able to isolate some of the performance issues that have been occurring on the backend from time to time.  Our backend performs a lot of ElasticSearch aggregations over a request time frame. In some cases a user might have left a tab in the background, slept their computer, or otherwise paused SondeHub website for refreshing for several days. When they wake up the website again it tries to query a large time frame.

I thought this wouldn't be a huge concern as the query will time out, or get a too many buckets exception from ElasticSearch, however a user can fall into a gap where it's a lower enough number of buckets for ElasticSearch to attempt the query but large enough to exhaust all our memory. When this happens the query will take up huge amounts of memory and this can lead to some interesting results. Eventually the memory pressure will get too high, and the node will stop performing correctly. AWS may attempt to replace the node during this time frame adding to the load on the system.

What's worse is the search tasks can only be cancelled at certain times in their life cycle, so I suspect that these tasks cause the node replacement to get stuck. While this is all occurring the client is still trying to make these requests making the problem worse and worse.

There's a few solutions to this problem that I've already implemented. The first is setting correct limits on the time frames of data that can be pulled from the API. The second is setting timeouts on queries. I'll monitor to make sure these changes stabilise the platform and make it more reliable.

As I'm writing this I'm still tinkering the cluster configuration so expected some downtime.

Moving forward with history

The next big ticket item is dealing with history. We only really need to keep a couple months worth of radiosonde data in ElasticSearch to make the site run, however at the moment we are storing several years worth. At the moment we can't delete the historic data as a few of our API endpoints, websites and users rely on retrieving historic data.

I've been thinking about what the best way to provide historic data but also remove it from ElasticSearch. The solution I'm going to try out is having a scheduled task that runs every hour or every day that retrieves a list of serial numbers that have received frames for the last 48 hours or so. From these serials iterate over and build a history document all the frames that we have received for that sonde, then update S3 with those details. We'll likely need to check for an existing S3 object and merge with it if it exists to ensure we don't wipe any data for any relaunched or reactivated radiosondes.

This should provide us a single file per serial number that can accessed. I'm also tempted to do the same per day (rather than per serial) but I'm not sure how useful that would be? Maybe just a list of serials seen on a particular day might be useful?

Naturally the data will need to backfilled. This could be done from ES or from S3. Either should work fine.


That's all for today, happy radiosonde hunting!

~ Michaela.


More Creators