byteslice

Random sorting using cryptography

Added 2021-05-23 04:53:04 +0000 UTC

It's been a while since I've provided any updates on Protodash. I admit that it's descended into the role of vaporware, mostly because implementing it in an efficient way has turned out to be extremely complicated. My original idea was to implement a new database backend with the bitmap tree as its native data structure, but I didn't have the energy to bring that to completion, and then school started again and I lost the ability to consistently focus on it.

But I've gotten at least some of my steam back, and some new developments with Elasticsearch have made this more pressing to revisit:

- It changed its license; using nonfree dependencies in an AGPL project is in poor taste at best, even if Philomena does not link Elasticsearch and thus does not fall under the restrictions of section 5
- Its terrible performance has started causing trouble for scaling Philomena down to save cost in practical scenarios
- Its high memory usage causes contention even on servers with plenty to go around

To address this, I have decided to revisit the search engine via bitmap tree implementation idea. But instead of implementing it via a custom database, I will probably take the following shortcuts:

- Still implemented in Rust
- SQLite or LMDB as the storage backend
- Bitmaps are rewritten fully when they are modified, just like in Elasticsearch

This allows making the following simplifications which drag this project into relative feasibility for a minimum viable product:

- All transaction and serialization support is delegated to the storage backend
- Bitmap operations and writing can be delegated to an optimized third-party implementation like CRoaring

Hopefully next time you see me write on this topic, I will have at least some working serialization implemented. Right now with the bitmap tree I am limited to memory-only benchmarks, which while they are very fast, are not necessarily a fair fight against Elasticsearch.

---

Attentive readers, for those who have been reading the previous posts, may have noticed that I was able to solve sorting with the bitmap tree, using minimal memory at query time. However, this sorting only covered sorting over stored fields. Elasticsearch does not only allow sorting over stored fields; in fact, its ability to do this is almost incidental, since it is really designed for sorting by relevance score, which is computed by a function that evaluates similarity to an input query.

For my use, which is much more database-oriented, sorting over stored fields will be the majority usage, relegating relevance sorting to the sidelines. But there is a problem. While in Philomena, direct relevance sorting is not used much (essentially only being used in the "related images" system), it is actually also used indirectly for random sorting and for the random navigation, which are used extensively. The way it is currently implemented in Elasticsearch is to compute a pseudo-random function which takes the image's ID and a random seed as input and generate a random float as output. The results are then sorted using heapsort to collect the top K and merged at the collector.

While this would almost certainly be faster in a Rust implementation due to performance advantages over Java, it would still destroy the main advantage of the stored field sorting, which was the minimal processor and memory usage even over extremely large sets. Elasticsearch can parallelize the processing by distributing it over each shard, but Protodash is not designed as a distributed search engine, so it needs a different approach.

The sort could be kludged by computing a hash ahead of time and storing it, but this doesn't allow seeding the randomness. Another potential approach is to generate random indices into the image ID array at query time and only pick those to load, but doing this introduces the possibility of collisions, and doesn't allow paging deeply into the result set for free the way the stored field sorting using the bitmap tree can.

This problem can be reduced to finding some function F with these properties:

- F is a bijection over integers from [0..n) to [0..n)
   * n is the cardinality of the set we want to get a random subset from
   * it must be a bijection because F has to map every index in the set to a new location
   * to get outputs to return to the user, we simply run F(0), F(1), F(2) ... and select the documents at the index F returns
   * to offset the search results by (e.g.) 25, we can do F(25), F(26), F(27) ... no results which came before F(25) will appear again

- F's output is pseudorandom
- F is seedable and can be influenced to output different permutations
* otherwise, there is no advantage over random sorting used stored fields

As it turns out, functions that match the description of F exist from cryptography. This specific problem happens to map neatly onto the role of format-preserving encryption, which is mostly mentioned for its use in encrypting things like telephone or credit card numbers in a database column that cannot accommodate large AES ciphertexts. Since format-preserving encryption is not really necessary in any sane database design, documentation on using it is relatively sparse. Since I am not using it for anything security sensitive, I decided to implement a format-preserving encryption scheme myself that would be suitable for Protodash.

A key observation for format-preserving encryption is that any existing encryption algorithm, like AES which works over 128-bit blocks, can be made to work to encrypt a number from [0..n) to [0..n) by simply re-encrypting the previous output with the same key until the algorithm generates an output in that range. This technique is called cycle-walking, and it works for any input. However, since the output of AES is from 0 to 2^128, and the largest sets we plan to work with would be about 2^22, each cycle-walk would result in an average of 2^106 chained invocations of AES to compute, which is infeasible.

A smaller cipher is needed. Ideally, we will use a cipher that has the minimum number of bits needed to represent all values [0..n), since the worst-case odds of ciphertexts failing to fit into the domain become extremely small with relatively few iterations. While a few smaller ciphers exist, most of them are fixed size, and so would result in needless time being burned on evaluations of the cipher function to find an evaluation fitting in the range. One cipher, the Hasty pudding cipher, is notable for having no fixed bit length, but is complicated to implement and perhaps provides an unnecessary level of security not needed for this application.

But we don't need to use any of those existing ciphers. There is an existing framework for generating arbitrary block ciphers called a Feistel network. The Feistel network splits its input into two parts and uses several rounds with a keyed pseudorandom function to progressively scramble the output bits (the PRF need not be reversible). Since this application is not security sensitive, we are free to explore any function for this purpose we want. After some testing I decided on the FNV-1a hash function, for its relatively good output uniformity and its extreme speed and simple implementation:

Using the Feistel network as a framework, I can use this FNV function to construct and evaluate arbitrary bit-length ciphers on the fly. I use a 64-bit hash function with 64-bit integers for the inputs here simply because this is far beyond the largest input I'd expect to handle, but there is no reason this could not be extended to use longer integers and a longer hash function if larger sets are being considered. I use four rounds as this is enough to generate a reasonable amount of confusion and diffusion.

Note that this code is an implementation of unbalanced Feistel network, where the left and right bit lengths can differ. For some reason, code implementing an unbalanced cipher was surprisingly hard to track down, so I am providing this code for reference.

Finally, there is the cycle walking algorithm itself which implements the overall format-preserving encryption:

This algorithm is fast. It is in fact so fast that I initially had trouble benchmarking its components because the function was executing more quickly than the timer resolution of Cargo's bench system. But after some tweaking and performing repeated iterations of the results to get an aggregate, the benchmark results are in:

This comes out to 40ns for each invocation of F for a set with 2.5 million elements, which is essentially negligible and far faster than sorting using the bitmap tree.

Elasticsearch doesn't even come close. It is of course using an entirely different strategy and takes about 300ms to evaluate a random function over the search results and push them through the heap. And to add insult to injury, because it uses a heap, it doesn't even allow you to page more deeply into the results the way using format-preserving can. So this is an all-around win.

The source code for this experiment is available here.

Disclaimer

I am not a cryptographer. Using cryptographic techniques to generate pseudorandom permutations is an interesting avenue for amateur exploration. In this specific case, it doesn't possess any risk if the encryption is too weak to provide meaningful security, or has issues such as side-channel attacks. Do not use this as an excuse to implement your own cryptographic primitives; always reuse standard implementations, as they will cover your use case if you need real security, and if they don't, think twice before blindly implementing anything you read on the internet -- you will probably do it insecurely.