byteslice

Faster searching

Added 2020-01-17 04:55:55 +0000 UTC

Someone new to Derpibooru may find themselves somewhat impressed by the sheer amount of functionality enabled by our search system. (Also, we apologize if you were ever confused by the minutia of the syntax.) You can search any number of our 450000 tags in any combination and number you please. You can use logical operators like AND, NOT, and OR to filter set membership any way you desire. You can sort by almost any numeric field you want. At the time of writing, I'm not aware of any other website that offers this level of flexibility in terms of searching and sorting.

This level of flexibility is enabled by our fantastic search backend, Elasticsearch. Elasticsearch is a distributed, near-real time search engine written in Java that uses bitmap indices to execute your search quickly[1], and VSM tf-idf scoring to sort your query results by their relevance. In general, this process of searching and scoring results involves loading all possibly relevant results for a query, assigning a "score" all of them based on their relevance, and then performing a distributed heap sort followed by a merge sort on the collector in order to bubble up the top results.

However, we only expose relevance in very limited scenarios. Furthermore, in most cases on Derpibooru, we are only interested in sorting by a "natural field" of the data, like the creation date, or even in the future perhaps sorting faves by the date you faved the image. This generic relevance-based sorting process entails, for possibly millions of results, loading the fields for the documents from a separate index structure called doc_values in Lucene (the library used internally by Elasticsearch to store its data), and then sorting the documents that way. These disk reads incur a lot of latency and use up a substantial amount of memory bandwidth. They also require a large amount of space in memory, sometimes incurring 20-30MB of memory per concurrent sort.

I recently hypothesized and designed a new data structure which I call a covering bitmap tree, which is as far as I'm aware completely novel and is not used in any program or mentioned in any literature, and implemented a set of related algorithms in a program I called "Protodash" written in Rust, derived from the quickselect[2] selection algorithm. To measure its performance relative to Elasticsearch, I loaded in a large copy of data I extracted from the Derpibooru data dumps. Sort memory usage is a tiny fraction Elasticsearch's usage, and the benchmarks, when performing identical operations, pretty much speak for themselves:

In addition, Protodash also handily outperforms Elasticsearch in parsing its input, managing to parse and load the entire Derpibooru dataset from JSON in under a minute. Elasticsearch, in comparison, takes around 5 minutes to load the dataset on my well-clocked CPU and high-performance NVMe storage, and the data have to be fed to the search engine in small batches to avoid overwhelming the server process.

I have decided that Protodash should also reconsider the Lucene decision to use immutable bitmap structures. A problem we face in our Elasticsearch deployment on Derpibooru is frequent index segment merging, which incurs a disk IO penalty. If mutable bitmap structures are used instead, the problem of segment merging is averted entirely, and the disk IO is not wasted performing merging tasks. This will have to have be considered carefully to avoid introducing excessive on-disk fragmentation of the bitmap structures. (In order to continue to take advantage of memory mapped data, this will require changes to the CRoaring library[3].)

Given that Elasticsearch is the largest regular CPU consumer in the Philomena application, this is the main project I intend to work on for the foreseeable future.

---

[1] Vicent Martí has written more on other practical use of bitmap indices here: https://github.blog/2015-09-22-counting-objects/
[2] https://en.wikipedia.org/wiki/Quickselect
[3] https://github.com/RoaringBitmap/CRoaring/