sondehub

Predictor performance improvements and ElasticSearch/Opensearch

Added 2021-12-06 00:30:42 +0000 UTC

Predictor Updates

We've been having issues with our home run predictor. When a prediction would run the server would lock and several requests would take a long time too come back. Previously we ran the predictor using a wind model that was shared via EFS (basically AWS managed NFS host). The predictor itself would memory map that file and read through it to provide a prediction.

I suspect the issue is with the way NFS and the Linux kernel locks the file during memory mapped reads. To solve this I moved the files onto the ECS task (docker container group) and we mostly got rid of the EFS component. This led to the next problem.

On startup the ECS task would download the latest model the run the predictor - while the predictor was running the downloader would constantly run and pull in updates. Over time we noticed the predictor would stop running correctly, and appeared to be due to wind model data corruption. I suspect this is due to how the wind model is being updated and some caching going on with the memory mapped file.

To resolve this we now subscribe to the SNS feed of file uploads from NOAA. When we detect the last file from the wind model being uploaded we trigger a new deployment. This new deployment will download the latest wind model. This system seems fairly efficient and means we typically see the new model being utilised in under 10 minutes from upload.

Here we are showing the typical switch over between the old wind model and the new wind model. It's performed using an ECS deployment.

OpenSearch / ElasticSearch

A couple of months ago we switched from ElasticSeach to OpenSearch and it was very clear from the switch that there was a significant and unexpected change in performance. We've been working with Amazon closely to try to and resolve this.

To help debug the issue we've actually setup both ElasticSearch and OpenSearch at the same time. Both are ingesting the same data, and every search query thats run is replicated on both servers. This took a significant amount of work but should help in debugging the issue.

Hopefully we can get to the bottom of this issue, otherwise we might be stuck on an older version of ElasticSearch for longer than we want.

That's all I have for the moment. Hit me up with any questions you might have about the platform :)