XaiJu
sondehub

sondehub

patreon


sondehub posts

Station icon updates and ingestion filtering

Station Icons

I haven't been feeling too well lately so not a huge update today but I did want to provide an update with a few things that have changed recently.

As you would have seen we ended up using the circles for the station icons. There's multiple reasons for this and we really valued your feedback. We tried a few concepts with the icon but the biggest problem we ran into is how it looked when scaled down, or on low resolution displays. Along with this, displaying the station icon actually takes a bit of a performance hit compared to just the plain circle. Once again many thanks to Luke for getting this all sorted on the frontend

With the icon problem solved the next part was colouring the circles based on patreon status. We had to build a small api that would match up reward statuses. We use the patreon API to check the note field for each user to determine which icon can be displayed. We have some flexibility here in that we could bring back pictures for certain patreon reward levels, but we haven't explored this further. 

Ingestion Filter

When we built the V2 API for SondeHub we focused on getting the bare minimum features to switch over to our own backend. One of the things that was left off was implementing payload filtering which means we've ended up with some bad data coming into our system. Until now we've be relying on the client software to correctly filter out the data - this doesn't work if users change or remove the checks however.

There's multiple reasons we might get bad data:

  • Incorrectly decoded frames - poor radio reception, interference, protocols with not enough error detection
  • GPS noise / jamming - when the radiosonde reports back the incorrect position due to faulty GPS readings
  • Multiple radiosonde on the same frequency - DFMs on the same frequency don't include their serial number in every frame so the data flip flops between serial numbers

So we now have some basic checks that occur when you upload data:

  • Lat/Lon aren't 0,0 (null island)
  • Altitude isn't over 50km (a weather balloon should never reach this high. This is SondeHub not RocketHub)
  • Altitude isn't exactly 0m (it's very rare that a radiosonde would be at exactly 0)
  • Satellite count < 4 (if we don't have a good fix then we discard it)
  • The date and time is correct, within a few hours
  • The serial number is correct format for the type
  • Outlier detection using z-score

With all these checks we sometimes throw out legitimate data, but it should be so small compared vast amount of data we do capture that it shouldn't impact anything. The benefits are huge though and we should see much higher quality of data entering SondeHub.

View Post

Station Icons Poll

We've been playing with various ideas around station icons and would love your opinion. (this is a non binding poll) Classic station icon, or simple circles. Let us know why in the comments.

Examples:  


View Post

New Patreon rewards! Station icons!

Along with the previous perks y'all already love from Patreon we are adding in supporter exclusive station icons for the $3, $10 and $30 tiers! While the colours aren't confirmed yet they'll look something like this.

I'll be sending out a message to all the supporters on the tier levels that have the station icon perk asking for your uploader callsign. This has to match exactly to what you have configured in autorx or TTGO configuration to work. We should go live within the next week or so, time permitting.

We've also added two new tier levels for sponsorship. These tier levels will get your companies logo and link added to the about page and onto our splash page.

View Post

History API improvements

Last weekend I was able to get a significant amount of working done improving the history endpoints. The APIs have been updated and you can see an improvement in the latest version of the SondeHub Python library, along with the sondehub.org/card endpoint. We can see in the screenshot the old sondehub client would take minutes to download all sonde data while the new client took all of 4 seconds. These improvements means that we are now only keeping a few months of data in ElasticSearch.

To achieve this performance we break the task into two parts. Roughly every 6 hours a scheduled task runs to query ElasticSearch for radiosondes in the last 24 hours and adds them to a queue.

A queue worker then queries ElasticSearch and the S3 archive for all the data available for that particular serial number, merges all the data and re-uploads to S3.  Having a queue made it simple to backfill all the old radiosondes.

This approach is cost effective and bandwidth efficient. The only downside is that there may be several hours of lag before radiosonde data lands in S3, but for most use cases this isn't a problem.

Bucket

Of course the SDK isn't the only way to consume the data. We also provide the S3 bucket open to anyone that wants to process the data. You can use the bucket explorer to explore it: https://sondehub-history.s3.amazonaws.com/index.html

One of the neat things we've done in the bucket design for this iteration is that the date prefix only includes a small subset of frames. The first frame, the highest frame, and the last frame. This allows us to use S3 like a mini database of radiosonde launch metadata. I look forward to see what comes out of that :)


That's all for the moment, have a great weekend!
~ Michaela.

View Post

Video, WebSockets in the frontend, predictions and history update

It's only when I look back at the last post do I realise just how much stuff has changed in the SondeHub ecosystem!

SondeHub video

First up a little treat. Due to the Melbourne lockdown AWS Melbourne users group was live streamed rather than the typical in person event. This has the bonus that the talks were recorded and published, one of which was myself talking about SondeHub infrastructure. 

If you want to check that out head to this YouTube link. 

WebSockets update

Luke has been smashing out updates! Recent frontend changes have brought websockets onto the website, and in such a seamless fashion you might not have even noticed. The bottom right hand corner now displays if WebSockets is connected and the amount of messages your receiving. This lets your browser get the latest updates as quickly as they come in, no more need to wait 5 seconds for the polling interval. 

For this to remain scalable we've had to make some more improvements to our websockets system so keep reading as I'll add some technical details down below.

Predictions

snh has done some great work dockerizing the prediction apps that we use. At the moment we use the publicly hosted Tāwhirimātea API endpoint for our predictions, and this has a few drawbacks. We are likely one of the heavier loads on their system, and while I'm unaware of any performance impact we might be causing it's probably best to run something ourselves. The other concern is that we can't maintain it, so it becomes a risk for us if there is service disruptions like we recently saw.

Since we now have docker images for these components we'll likely be running our own predictions shortly. What's really neat is that NOAA publish the required GFS data into an S3 bucket so downloading these will be cheap and fast from our backend.

History

Admittedly I haven't had a huge amount of time to continue on the history work, however it has progressed. Technical details have been worked through and some code has been written in this space. It certainly looks like a viable option.


And that's the update so far. Stick around if you want to learn some more technical details around websockets.

~ Michaela.


A deeper dive into WebSockets

Previously when we implemented websockets we were just targeting the sondehub CLI users and people integrating with our services. For this we could get away with a single node which could easily handly a couple of hundred users. Switching the frontend over to websockets we quickly realised that this wouldn't be scalable for times where meteorology organisations live stream, or share links to our site.

Luckily mosquitto (our MQTT/websockets broker) allows bridging of servers. This lets you replicate data on one server on another. So a simple plan was devised.

The idea is that we could have a single writer node, and an autoscaling group of reader nodes. The reader nodes would automatically connect to the writer and start replicating the data. We just scale based on the CPU load of the readers and we should be good.

There's a few technical challenges to overcome in this though. First off is having the reader nodes discover how to connect to the writer node. Now ideally you'd just have a single writer node but we also need to account for times where the writer node needs to be replaced. Further adding to the difficultly of this task is AWS ECS services only allow you to attach a service to either a ALB or an NLB, not both. An NLB would be perfect for this job as we need the readers to connect over the binary endpoint. So instead I used a little hack I learnt a long time ago.

Connecting mosquitto to another server requires lines like this:

connection to_writer
addresses 172.16.1.3:1883 172.16.1.4:1883
topic sondes/# in 0
notifications false
try_private false
bridge_outgoing_retain false
restart_timeout 3
round_robin true

So we need to define where mosquitto is connecting to, and in this case we have two IP addresses listed. One of these will be active and the other is only used when the writer server is being replaced out.

Create the smallest possible subnet you can in AWS. AWS will reserve some IPs for internal use . At time of writing the smallest subnet you can make is a /28. With a /28, the reserved IP addresses, that leaves 12 IP addresses our writer server could be on. This is no good as it would take mosquitto far too long to check each one of those IP addresses

Instead what we do is reserve 10 of the IP addresses which won't be used in this subnet. Under EC2 network interfaces you can create network interfaces which will allocate a private IP address without costing in $. These aren't used for anything and will sit idle.

This provides us with two remaining IP addresses that mosquitto could be running on, and is fast enough for mosquitto reader instance to find the writer instance on boot.

Now the next challenging part is autoscaling. ALBs by default will roundrobin requests. This is a fine approach if your doing short web requests but if you are using autoscaling group with long running websockets you'll end up with very unbalanced load when a new server is added. 

Luckily there is an easy fix for this one. Under the target group configuration you can configure the algorithm to least number of outstanding requests:

Now the last thing I wanted to touch on with the websocket configuration is bandwidth.

When we switched from API calls to websockets we were expecting to improved performance and cheaper hosting. When we made the switch however we saw improved performance, and more expensive bill.

So what happened here? Well it turns out that mosquitto server doesn't support compressing of payload data. This means that due to the JSON nature of our payloads we were wasting a lot of bandwidth. When we were using the API the API would compress these for us.

So the solution here, after exploring a lot of options and yak shaving, ended up being adding compression to mosquitto. We are now running our own patched version and I think this graph speaks for itself.

Can you tell when we switched to our patched version? While the changes we made to make it work aren't suitable to be lodged as a pull request I've opened up an issue on the upstream project with the details on how we added it in.


Thanks for sticking around for the dive into websockets. Happy sonde hunting.

~ Michaela.

View Post

Various updates

Howdy all,

Just wanted to provide an update as to whats been happening in SondeHub land. I've been travelling (and got caught up in a bit of a travel issue) so I haven't had a huge amount of time of time to dedicate to SondeHub.

First up, Luke has done an amazing job of converting our front end to leaflet. We've always wanted to get this done, and a huge thanks to Luke to putting in the work to make it happen. We had two reasons for using it, Google Maps costs are significant and we were just creeping over our free tier limits, and the second is leaflet is much more extendable. One happy side effect is that leaflet is blazing fast!

Luke was very patient with myself and implemented many many many new features that I've wanted to see from years ago. Clicking on a radiosonde now refreshes the telemetry with more detail, clicking on radiosondes now opens the correct plane in the list, and the website can now be installed as an app on iOS, Android and Chrome! Very cool stuff.

From a backend perspective not much has changed, however I suspect I've been able to isolate some of the performance issues that have been occurring on the backend from time to time.  Our backend performs a lot of ElasticSearch aggregations over a request time frame. In some cases a user might have left a tab in the background, slept their computer, or otherwise paused SondeHub website for refreshing for several days. When they wake up the website again it tries to query a large time frame.

I thought this wouldn't be a huge concern as the query will time out, or get a too many buckets exception from ElasticSearch, however a user can fall into a gap where it's a lower enough number of buckets for ElasticSearch to attempt the query but large enough to exhaust all our memory. When this happens the query will take up huge amounts of memory and this can lead to some interesting results. Eventually the memory pressure will get too high, and the node will stop performing correctly. AWS may attempt to replace the node during this time frame adding to the load on the system.

What's worse is the search tasks can only be cancelled at certain times in their life cycle, so I suspect that these tasks cause the node replacement to get stuck. While this is all occurring the client is still trying to make these requests making the problem worse and worse.

There's a few solutions to this problem that I've already implemented. The first is setting correct limits on the time frames of data that can be pulled from the API. The second is setting timeouts on queries. I'll monitor to make sure these changes stabilise the platform and make it more reliable.

As I'm writing this I'm still tinkering the cluster configuration so expected some downtime.

Moving forward with history

The next big ticket item is dealing with history. We only really need to keep a couple months worth of radiosonde data in ElasticSearch to make the site run, however at the moment we are storing several years worth. At the moment we can't delete the historic data as a few of our API endpoints, websites and users rely on retrieving historic data.

I've been thinking about what the best way to provide historic data but also remove it from ElasticSearch. The solution I'm going to try out is having a scheduled task that runs every hour or every day that retrieves a list of serial numbers that have received frames for the last 48 hours or so. From these serials iterate over and build a history document all the frames that we have received for that sonde, then update S3 with those details. We'll likely need to check for an existing S3 object and merge with it if it exists to ensure we don't wipe any data for any relaunched or reactivated radiosondes.

This should provide us a single file per serial number that can accessed. I'm also tempted to do the same per day (rather than per serial) but I'm not sure how useful that would be? Maybe just a list of serials seen on a particular day might be useful?

Naturally the data will need to backfilled. This could be done from ES or from S3. Either should work fine.


That's all for today, happy radiosonde hunting!

~ Michaela.

View Post

WebSockets

Last post I mentioned one of the improvements we've made is reducing the cost of the WebSockets feed. I want to spend sometime to explain what it is, what it's for, why it's important and our current implementation details.

In a traditional HTTP setup your client (eg browser) has to request data. This is called polling, as the client has ask, or poll, for new data. As you need to choose how often to poll, it usually means the client is several seconds out of date. It's also resource intensive and requires making decisions on how much data is sent back to the client.

WebSockets resolve some of these issues by providing a communication channel between the server and the client that's always open. The server can send data to the client without the client requesting it. This means that the client no longer polls and all we need to do is send new data to any WebSocket that happens to be open.

SondeHub is pretty unique in that we collect as much data as possible with a fairly quick upload rate. The SondeHub network is huge, with over 300 receivers. 

A lot of the usefulness of the data comes from having such a large network of receivers (thank you!) however having a large network is only as useful as the applications built on top of that data. If we were to lockdown access to just using the SondeHub website there would be no innovation, and no way for users to build and develop things using the data they have uploaded to us. This is why providing the data back to the community is so important to us. We want to share the data that people upload. 

So how does this fit into WebSockets? Well one of the great things about SondeHub is being able to process live data, as it's being received. Anyone should be able to have the same privileges as we do to be able to process that data. Our WebSocket implementation allows anyone with a compatible MQTT WebSocket client connect and access the important. As WebSockets work in the browser this means that the data could be processed in browser, or within a server application making it quite accessible.

Our implementation 

The original implementation we used was AWS IoT. As payloads would arrive on our endpoint we would unwrap them, parse them and upload them to AWS IoT. To provide access to the WebSocket endpoint we would provide presigned URLs to the WebSocket endpoint. 

While this approach worked, its biggest problem was cost. AWS IoT provides an MQTT endpoint which simplifies running the broker for these messages, and has a bunch of useful features for managing IoT devices. We didn't use any of the special features of AWS IoT however, so a lot of the features went to waste and I can only assume that a large proportion of the cost went into funding these features.

So we've abandoned AWS IoT for our own home grown solution. For this we are using an AWS Fargate container configured as a service. We process hundreds of messages per second, rather than millions per second, so we actually don't need any fancy multi server architecture here and can settle for a single server. If we start processing many more messages we can scale the size of the instance. 

For our MQTT broker we are using Mosquitto, an opensource MQTT broker that supports MQTT over WebSockets. We are using the prebuilt docker container that can be found on docker hub. One of the more tricky bits to this is getting a config into the container. There's a couple of ways to do this, such as baking your own container image with the config built into, however I opted for a side car approach using the aws-cli to copy the config files into a volume prior to starting. The side car configuration looks a bit like this:

It's not the most elegant solution, but it certainly works for us, and means we don't need to worry all that much about a proper docker container build chain.

The actual configuration of Mosquitto is fairly basic:

max_qos 0
persistence false
listener 8883 0.0.0.0
protocol mqtt
listener 8080 0.0.0.0
protocol websockets
allow_anonymous true
password_file /mosquitto/config/passwd
acl_file /mosquitto/config/acl
http_dir /mosquitto/config/html

As we aren't doing anything special with Mosquitto we can keep the config short and basic. You'll notice that we haven't configured HTTPS or TLS certificates here. This is because the application load balancer is performing the TLS termination for us, which save a bunch of time and effort. 

And that's about it. If your curious about building an application using the websockets endpoint you can check out the example pysondehub project : https://github.com/projecthorus/pysondehub

View Post

Backend improvements

Recovery markers

SondeHub Tracker now has a way of marking radiosondes as recovered.  

This required a small amount of changes on the backend to accomodate the extra data, but is easily something our ElasticSearch cluster can handle. I look forward to being to provide further analytics on sonde recovery rates!

All the hard work for this was done by the lovely Mark Jessop (VK5QI).

Websockets

The other major change that's occurred is the websockets / MQTT backend has shifted from the AWS IoT platform to our own managed Mosquitto backend. This is to both improve performance and to lower cost.

We are already seeing the cost savings from that change right now. To test out the performance I made a simple CesiumJS app (like Google Earth but in your browser) to track the radiosondes in 3d.

The demo is very crude, requires a lot of CPU and will eventually run out of memory but it's a nice to test show what you can do with the websocket interface.

I've left the demo up, but note that it may break or not work as only minimal testing has been done.

I'll try to put together a bit more in depth view of the websockets changes soon.


Demo:http://sondehubuitest.s3-website-us-east-1.amazonaws.com
Source code:https://github.com/TheSkorm/sondehub-cesium


Alarms

And finally I've implemented some basic health alarms to notify me of any issues with the SondeHub infrastructure. We haven't really had much issue in the past as all our infrastructure is self healing, but it's good to catch issues early.

View Post

More affordable backend

In the release retrospect I talked a bit about reducing cost. As it stands Patreon doesn't cover the cost to run SondeHub and I make up the difference from my own pocket, so there is incentive from me to ensure whatever is implemented is not just maintainable but also affordable. 

I've implemented the changes previously mentioned, which included removing AWS IoT actions from the ingestion pipeline, and batching up SQS messages. It's looking like these changes have saved roughly a third from bill!

Batching up SQS messages isn't without its own problems though. Since the queue worker make encounter an error processing a single payload it has to decide:

  • fail the entire batch of payloads
  • add a new message to the queue with the failed messages
  • make that batch as successful

The backend code isn't very robust at the moment for this, and we recently saw a message get stuck. In this case this was a station running an older version of radiosonde autorx and falsely decoded the datetime field for a DFM sonde.

Nothing bad happened in this case, it just meant that we kept trying to process it.

We also have a similar issue when it comes to accepting payloads on our HTTP endpoint. As we accept many payloads in one request, we need to return something to the client to give it an idea of what payloads successfully were uploaded, and hopefully provide more useful error messages. At the moment we just send back a 200 OK or a 503 on any failure. Likely need to fix in sooner rather than later.

Terraform

I've also taken the time to get real world and terraform a little more aligned. There's still a handful of resources that need to be terraformed (and also move some resources from my personal AWS account to the dedicated AWS account, such as the `/card` lambda function, and main CloudFront distribution)

View Post

SondeHub Infrastructure monitoring

I've put together a dashboard of our infrastructure. Mostly to help myself monitor what's happening behind the scenes, but I thought it would be useful to share it! Enjoy :)

URL: https://sondehub.org/go/status 

View Post

Release retrospect and infrastructure plans

First off, thank you to all that have signed up! Still a long way before all the infrastructure costs are covered, but we are getting closer. Tell your radiosonde chaser friends to sign up!

I wanted to quickly run through how the launch of the v2 interface has gone, and what the next steps are in the project.

Launch of the v2 interface generally went pretty well. There were some use cases and issues that we didn't account for until after launch however those were fixed up pretty quickly! Most of these were minor things like the usual short urls (sondehub.org/{serial number} didn't quite work.

Running off a cloud with a good CDN really helps as if the backend starts to struggle users only see the issue as increased latency rather than complete failure.

Our infrastructure was able to handle the load correctly which was great, and I'm feeling pretty confident that it'll be ok, even if included in something like a NOAA livestream or the like.  

We've  also seen some third party developers building in support for sondehub which is great!   

So onto the future plans:


1. Reducing cost.

At the moment we use 3 AWS IoT actions to upload to S3 in different filename formats. But switching this out to a single Lambda function I think we can save a significant amount of cost here. 

It's also highly likely that the IoT Action to SNS topic will get replaced out, and incorporated into the Lambda function that already exists for push the data onto AWS IoT bus. There's actually a bit of opportunity here to introduce some cost saving by batching up several packets into a larger request which could help on SNS and SQS costs.

2. Updating Terraform Infrastructure project

Our infrastructure is deployed as code using a tool called terraform. It allows us to automate deployments of the backend and all the cloud services we require. During the launch a bunch of changes were made that haven't been reflected in the terraform configuration that we have on GitHub. That'll need to be updated, and hopefully improved on so we can launch test stacks for development.

3. Backend Refactor

The backend probably didn't get enough time invested into it. A lot of the code isn't really in a maintainable state so I'd like to go through and rewrite that to be much easier to work on. 

API structure needs some focus as well, the majority of the endpoints that have been built are based around getting the SondeHub Tracker UI to work and not around developers usability.


View Post

Nearing v2 Release

Thanks for supporting SondeHub! As we reach closer to v2 release I wanted to share how some of our infrastructure is built.

Our infrastructure runs on Amazon Web Services for several reasons:
- It's what I'm most familiar with
- I'm able source discounts for AWS
- AWS provides high amounts of availability and scalability, letting us experiment quickly (though this is true for several other platforms as well, like Azure or GCP)

v2 infrastructure overview

SondeHub v1 grew organically in its design, however it did allow us to prove out some of our design decisions. One of the goals I had with the design is allow third parties to interface easily. The advantages SondeHub platform provides compared to other services that provide upper air data is that our data is provided live as data is collected. 

To let third parties gain access to this live data we utilise AWS IoT MQTT over WebSockets. A third party can use our API to connect to the websocket stream and start receiving telemetry with less than 30 seconds latency.

ElasticSearch Kibana showing altitude over 100 radiosondes

We also wanted users to be able to query the data interactively. Which is why we've invested much time in getting the data into an ELK stack where it can be queried using the popular Kibana dashboards.

11,000 messages receiver per minute during peak load

In SondeHub v2 we collect 30x more data. Every single frame is uploaded. This jump up in data processing caused us to hit a bottleneck in AWS ElasticSearch.

You have exceeded the number of permissible concurrent requests with unique IAM Identities. Please retry.

Working out instance sizes and ingestion method has taken a significant amount of time.

This is caused from the number of messages being received being too many for a single node to handle in AWS ElasticSearch. This can be worked around by batching up the messages prior to being inserted into ElasticSearch. We've built out that pipeline and over the last couple of days testing it seems to be handling the work load a lot better.

The other way to access our data is for historic purposes. Our dataset is growing quickly and we've been able to partner with AWS to store all our telemetry in S3. These buckets don't cost SondeHub so we are encouraging everyone to utilise data from here where possible.

And finally we have the SondeHub UI. Historically the way the SondeHub UI worked was proxying requests to and from HabHub however we have severely overgrown utilizing their backend. Mark (VK5QI) has done a huge amount of work forking the HabHub UI and adjusting it for our needs. For now the backend has a bunch of endpoints that provide HabHub compatible data format to make getting the UI up and running as fast as possible. 

Since we have more control over the data formats and the UI we can start introducing new features. For example we can now report the SNR from receivers, a much requested feature.

I hope this gives you some insight into how SondeHub works behind the scenes. If you have any questions just reach out :)

Michaela ~ VK3FUR

View Post