Tuesday, 13 December 2016
With the upcoming official release of our time series database as an open source database we’d like to reveal a bit more about the process. SiriDB is a time series database that has been in private development for two years now, mainly used for our own IT infrastructure monitoring system. In these two years we perfected the database to be capable of massive datasets, blazing fast and to never be offline. How did we do this?
In the beginning of 2015 we started to really up the amount of time series data we were collecting via Oversight (IT infrastructure monitoring system). In order to be able to handle this big amount we saved this data in Google BigQuery. Although we were quite happy with saving data in BigQuery, retrieving data was quite slow and we wanted to show data graphs to our users without waiting time. This meant we needed a time series database. Fast.
We tried several TSDB’s that were on the market at that time, but none were as effective as we needed it to be. This left us no other choice than to write one ourselves. We chose to write our time series database in Python. Python offers you a great selection of libraries and often you can produce a lot with relatively little code. This means developing in Python goes quickly. The first single server version was ready to use within a month and another two months later we had developed SiriDB in a way that it could scale the data over multiple servers and could activate replicas as to prevent downtime.
However, SiriDB was not operating without problems yet. Python uses garbage collection and depending on the amount of data in SiriDB a full garbage collection could take several seconds. When the system is doing this garbage collection this means that the system is not responsive and this is, we think, not acceptable for a database. Luckily Python has the possibility to tweak the garbage collection, we used this as a temporary solution. You can use
gc.set_threshold() for this.
After this we started working on preventing garbage so we could turn off the garbage collection. This was not an easy task, if you ever plan on using Python without garbage collection be aware of the trouble it will cause you. Except for in our ‘own code’ we also found bugs in the re module of Python 3.5 (issue25554) and the aiohttp (#579) library.
What do you need garbage collection for? Garbage collection is used to clean up unreachable objects in memory.
See this example:
l =  l.append(l) del l gc.collect() # => collects at least 1 unreachable object.
Another disadvantage of Python is that it takes up a lot of memory. In order to keep the memory usage down as much as possible we worked with computed properties a lot. These in itself, unfortunately, have the disadvantage that they are slower. In SiriDB it is possible to select series, shards or other objects based on properties, however computed properties are making these queries relatively slow.
We also wanted to provide SiriDB with fast algorithms to aggregate data and combine them. When we use only Python simple code often works faster than a better, in theory, algorithm. You can see this in the statistics module of Python. Calculating a median is done by the following code, where first the complete values are sorted and then the middle value is given back.
def median(values): l = len(values) s = sorted(values) h = l // 2 return s[h] if l % 2 else (s[h-1] + s[h]) / 2
To make aggregating and merging faster we made various C extensions in the previous version of SiriDB. These extensions made sure that queries were answered incredibly fast while the basis code was written in Python.
Now that we were writing some code in C, the idea started to come up to completely move SiriDB from Python to native C. In the beginning of 2016 we started this move. Initially we planned on keeping the new version backwards compatible with the (then) available Python version. However, this also proved to be a nice opportunity to make improvements that were only possible when we would break the compatibility. We have written our own serialize protocol named QPack (QPack-js & QPack) that SiriDB communicates with. Furthermore, we don’t have a HTTP server anymore in the core of SiriDB, instead of this we created a different project that supports different data formats like JSON, CSV and MsgPack (https://github.com/transceptor-technology/siridb-http). At that moment we still were the only user of SiriDB, because of that nobody noticed the breach of compatibility. This is the reason why we have SiriDB version 2.x at the moment.
Now that we moved to C we are able to keep track of series properties with low memory costs which makes queries a lot faster. E.g. we can ask the length property for all series in the database using
count series length. In our production database this query is returned in 0.06 seconds while the same query took about 40 seconds in the Python version).
We are very happy with the move to native C. For our product, Oversight, this means that the memory use of every SiriDB server (we currently have 6, 3 pools with 2 servers each) has gone down from 10GB to 2GB and as a bonus the inserts and queries are even faster than before.