Data management
Added 2019-09-03 14:01:00 +0000 UTCWhen I started beestat I had no real idea how much data storage I would need. I spun up a cheap server, added 50GB of storage and called it a day. I started to get worried when the database hit 30GB. Little did I know that just over a year later I would have over 200GB of data...almost all of it in a single table.

As far as databases are concerned, 200GB isn't really that much. I have a very simple schema that is appropriately indexed. I can easily query this amount of data. Every time you view Recent Activity, Aggregate Runtime, or Home Comparisons you are running fast queries on this data.
So what's the problem? Money, mostly. That 300GB volume costs $30/month. It's covered via Patreon, but if I had less data I could re-appropriate some funds for useful things like backups. On top of that, just managing a database of that size becomes difficult. It takes a long time to move the data anywhere for things like maintenance or backup/restore operations.
The solution
This one is easy: Reduce the amount of stored data. Right now I have 7,000 thermostats, and each thermostat uses 11.8MB of storage per year. On average, I have 1.6 years of history for each thermostat. You can do the math; that's 132GB (plus a non-trivial amount of database overhead which I am ignoring for now).
The first thing I can do is to optimize what data I store and how I store it. By removing unused data, converting strings to pointers, and some other clever tricks like storing decimal values as integers (multiply by 10), I can get the numbers down to 4.3MB per year per thermostat. Now we're down to 48GB.
Next is an easy one. Stop storing high resolution data for so long. I am officially deleting old thermostat data after one year, cutting the database size down to 30GB.
The last bit of improvement can be had by utilizing compression directly on the database. This is sort of like putting all the data in a bunch of zip files before saving them. You lose some speed by doing this, but it cuts down the database to a meager 13GB. With everything together, that's a 90% improvement with very minor trade-offs.
Here what gets lost:
1. High resolution data older than 1 year
2. Ability to regenerate historical temperature profiles
That's it...and it's not as bad as it seems. I will still be keeping 100% of historical data in a summarized format. You get high resolution data (5 minute) for a full year, and after that it's grouped by day. This compresses a table that is gigabytes in size to only hundreds of megabytes.
Changes underway
All of these changes are underway. I spent several weeks testing and researching and can attest that the numbers above are not theoretical. As I implement these changes I am also seeking out other optimizations to improve performance. For example, I cut the number of database queries by 70% on the initial data sync for new users. Subsequent syncs enjoy the same performance boost. Individually these changes are small, but multiplied by thousands of thermostats it relieves a lot of server load.
On top of all this, the reduction in data is enabling me to begin the process of syncing historical sensor data. The exact details are unclear at this point, but I will probably sync the past 1-3 months of high resolution sensor data.
So lots of changes incoming this fall. This will be a pretty significant update behind the scenes and will require some downtime while I migrate to the new structure and do some server reconfiguration. Looking forward to having all this done!
Comments
I just got an Ecobee again after a couple of years in an apartment and I was so happy to see this project still going! I had to kick in some money to see how you were handling all this data and this is truly impressive! Awesome job!
Keegan Carter
2019-12-06 17:49:21 +0000 UTC