Palantir: search with a twist (part one: memory efficiency)

August 13th, 2009 | Ari

magnifying glass

A Palantir cluster seamlessly integrates many pieces of proven technology. One of them is our customized version of the venerable Java search engine, Lucene. Search engine technology tends to be optimized for the common use case of indexing web documents (or similar information architectures) where you have a few search terms in each query and many, many documents as results. We want to leverage the inverted index capabilities of Lucene, but our data access patterns are a bit different than the typical use case: we need things like pervasive range-querying, different types of relevance, and dynamic views of the data based on security constraints. So in building our data platform, we’ve run into some interesting challenges that are pretty unique in the information retrieval realm, specifically:

  1. Raising memory efficiency
  2. Real-time indexing
  3. Preventing information leaks across access boundaries in an efficient manner

I’ll cover (1) in this post and (2) and (3) in a later post, due out in about two weeks.

Hit the link and we’ll delve into this topic.

Raising memory efficiency

We’ve addressed the issue of resource constraints, generally, in our earlier post: Bandwidth isn’t cheap. Disk isn’t cheap. CPU isn’t cheap. In that post, we posited “RAM to the rescue”:

On the other hand, some things in a SCIF are comparatively cheap. We never use boxes with less than 32GB of memory, and, in fact, lots of sites use 128GB of memory. RAM requires negligible power and cooling, and compared to disk, it’s relatively simple to install. It’s also easy to reconfigure the setup to use the additional memory.

While this is true, no matter how much RAM you buy, your users will find a way to use it all — search is no exception. In many of our environments, the search processes share hardware with other processes in the Palantir cluster, so while the OS may have 128 GB of RAM available, the search process’s VM has substantially less available to it. Compare this to a cluster of dedicated search nodes, where each node will have indexes sized to fit specifically into the memory available.

The upshot is that we needed to modify parts of Lucene to deal with tighter memory constraints than it was designed for.

Priority queue results accumulation

Most systems that implement search include some notion of paging through the results. We use a multi-level paging system, with the search server maintaining a server-side page for each query and serving smaller client-facing pages from.

Vanilla Lucene uses the following algorithm for accumulating search results:

  1. Load all matching results.
  2. Sort by some relevance metric(s).
  3. Return the top n results.

The results are cached as a server-side page in case the client wants to load more than the first n results. You can see where this could run into trouble: if the total number of matching documents is high, that’s a lot of wasted RAM while we winnow it down to the size of the server page. So we use the following algorithm:

  1. Construct a priority queue of constrained size with priority computed using the chosen relevance metric
  2. Stream through the results, inserting into the queue
  3. Return the set of results in the priority queue

Now we never need more RAM than the size of a server-side page to serve results. The downside is that if the client wants more than one server-side page, we have to run the search — in its entirety — twice (ouch). To avoid the first set of results, we adjust the priority queue to kick out all results that were in the first page based on relevance metric.

Using bitsets to optimize range queries

A range query can return a result set of very high cardinality – a range is a very compact way of describing a large set of matching terms (even if they are discrete values, like dates). One way to think about a range query of, say, 10 <= age <= 15, is that it expands to age = 10 OR age = 11 OR age = 12 OR age = 13 OR age = 14 OR age = 15. Rather than treat range queries in any special way, Lucene just does this expansion of the range and runs the query like a normal query.

Internally, Lucene stores a list of metadata nodes, ordered by document id, of each document that matches a given term. The algorithm goes something like this:

  1. Open the document id lists for all matching terms
  2. Walk the list pointers for each potential match such that you accumulate all the metadata for a given document.
  3. Pass all this metadata up to the query processor which decides:
    1. Does this document match the overall query? (remember that terms can be inverted)
    2. Use term frequency taken from the metadata to calculate the relevance.

This structure and attendant algorithm has some nice properties:

  • All documents are processed in a set order.
  • Everything is known about a document all at once.
  • It terminates in a single linear scan.

… and has one very nasty property:

  • All of the term value buckets that match the range must be open simultaneously.

This is not a big deal for most English language queries. However, for large ranges and the like, there can be thousands or even millions of terms.

The semantics of range queries have an interesting feature: a document that matches the range twice is not more relevant than one that matches once. (Contrast this with a simple term query: multiple matches do indicate higher relevance). Being able to discard the accounting of how many time we match the range leads to a huge win:

  1. We only need a single bit to represent a match
  2. We can process a single term value bucket at a time instead of holding all buckets open in memory.

Our search engine accumulates range queries into bitset objects, allowing for a very compact representation of results. We need much less memory than we did before since we only load one term value bucket at a time. And the algorithm is simpler: no more walking pointers or O(n) check before figuring out which pointer moves next.

The next episode

Tune in for Palantir: search with a twist (part two) in a few weeks. I’ll cover the following topics:

2 Responses to “Palantir: search with a twist (part one: memory efficiency)”

  1. Mark Harwood Says:

    Some derogatory and misleading information here about Lucene.
     
    >>Vanilla Lucene uses the following algorithm for accumulating search results: Load all matching results.
     
    No it doesn’t and never has. PriorityQueues are used everywhere.
     
     
    >>RangeQuery… has one very nasty property
     
    That is why the Javadocs tell you not to use it and the QueryParser hasn’t supported it as the default for quite some time.
    The alternative approach you describe is implemented in ConstantScoreRangeQuery which is the newer default in the query parser.
     
    Are you working on very old versions of Lucene? If so please target your comments at something vaguely recent or take the trouble to read the current documentation more closely before spreading bad advice.

  2. Katherine Says:

    Mark,
    Sorry if the blog post didn’t make it clear – we think Lucene is some of the best third-party searching software out there, which is why we use it as the base index/search engine, and we’re not trying to discourage people from using it. However, no generic library is going to fit our exact use case, and we’re outlining some places where we’ve run into that. The post also elided over some technical details for the sake of brevity – apologies if you feel that amounted to a misrepresentation of Lucene.

    About loading results: Lucene does use a PriorityQueue for the results, so our description would be substantially different from what actually happens if we didn’t want access to all the matching results. Unfortunately, our use case requires that users be able to see all matching results if they want to – in order to get n results from out-of-the-box Lucene, even if you plan to break them into k pages of size n/k, you have to size the PriorityQueue to hold n results. This means that you’re effectively loading all matching results – supporting paging over chunks of results is not built in.

    About RangeQueries: We initially started using Lucene in 2005, so some of our custom code has been replicated in later versions. Some of it hasn’t – for instance, we use the same range-query logic for both ranges and for broad wildcards, although only numeric ranges were discussed in the post. The latter does not have support that I know of as of Lucene 2.4 (the version we’re currently using, and the most current release as of the time of the original post).
    However, the 2.9 release that came out last week looks like it may have added similar functionality, using index-size-based cutoffs instead of the memory-size-based ones we use (I haven’t traced through all the new source code yet, but that’s what I’ve seen so far).
    And, since Lucene is a generic platform, some custom code has been replicated in a more generic version. For example, Payloads, which were introduced after we started using Lucene, are similar to but not as optimized for our use case as how we do security enforcement.

Leave a Reply


Palantir