The covering bitmap tree
Added 2020-01-22 05:13:07 +0000 UTCBecause I was asked several times about how the implementation of this data structure is able to search and sort so fast, here is an overview of how it works. By publicizing the data structure and algorithm, the hope is that one day search libraries like Lucene will also adopt this particular method of sorting by natural fields of the data.

The above diagram is a simplified view of a special B+ Tree which contains bitmaps (fundamentally, an abstraction for a set of documents) at its leaves.
This tree is special. Each of the internal nodes contains a bitmap that is the disjunction of all the nodes beneath it. Hence the name, covering bitmap. The major advantage of this is that it makes range queries extremely fast, requiring very few operations on the underlying Roaring bitmaps to construct.
If you are familiar with a B+ tree, here is the simple recursive algorithm to select a bitmap containing all documents greater than some set value X.
1. If the current node is an internal node, first, find the child pointer you would traverse to find the value at the leaves, and recurse into that child. Then, union the bitmap you get from the result of the recursive call with the entire covering bitmaps of every child after the one that was traversed.
2. If the current node is a leaf node, find first key pair that is greater than the selected key, and begin unioning until you reach the end of the leaf.
This works for all unbounded ranges. For a bounded range, you would start the same in the internal nodes, but then recurse into the last child whose key is smaller than the intended key instead.
Here is a worked example on the closed bounded range [3,7]. Notice how 3 leaves of the B+ tree did not have to be visited.

My benchmarks show that this utterly beats the snot out of the BKD tree approach employed by Lucene. (I'm not an expert in Lucene, but as far as I can tell by my reading of the source code and related issues in the tracker, it seems that it needs to visit every possible node that could match in the BKD tree in order to add it to the search.)
In a set with 2236685 distinct ids, Elasticsearch takes 20 milliseconds to compute a pathological range query for ids greater than 1, and return a count of the results. Protodash, using the covering bitmap tree, takes about 180 microseconds to do the same thing. Note that both programs are using Roaring bitmaps.
There is no reason the same principle could not be applied to a trie instead of a tree in order to support very fast string prefix matching. In fact, I intend to do this as well in my implementation.
So what's the connection to partial sorting? Well, recall how Quickselect works:
1. Pick a (random) element as the pivot.
2. Partition the input space into elements smaller than the pivot and larger than the pivot.
3. Recurse into the half of the input that contains the value you were looking for.
Because the covering bitmap tree allows us to perform step (2) using only operations with compressed bitmaps, we can avoid the serious memory bandwidth and locality overheads associated with materializing the result set and sorting the documents using their actual stored field values (doc values). To be clear, it is still necessary to store the values at index time, and load the values at query time, in order to use the favorable pivoting behavior of Quickselect. However, the performance overhead of this is incredibly small compared to loading doc values for every element in the set and sorting all of them.
Here is an example from my code that uses bitmap Quickselect to quickly get the top document of a search by an i64 field.

The modification to the Quickselect algorithm to select the top N values, instead of the Nth value, instead is left as an exercise for the reader. (Yes, I am your math textbook now. Fear me.)
There is a related optimization that will let you page arbitrarily far into the results with barely any extra time consumed, but Elasticsearch cannot use this algorithm because it is a distributed search engine, and cannot possibly apply when the results are distributed. Note that Protodash is, by design, not distributed.
During benchmarking, while changing the branching factor of the B+ tree, the behavior of random pivot selection eventually gave me a fantastically good best-case result of 250 microseconds to search and partially sort over a subset of 269321 set values, with average cases in the 2 millisecond range and worst cases about 5 milliseconds.