Palantir: search with a twist (part two: realtime indexing and security)
October 27th, 2009 |
[A number of weeks ago, we published a post on the search technology used by Palantir. That post covered raising the memory efficiency of a couple of operations. This is part two of that series.]
The most familiar use of search engines is to index documents made available on the Internet via the hypertext transfer protocol. Forgotten names like AltaVista, names not-yet-really-learned like Bing, and, of course, Google come to mind.
This one, massive use case has a couple of properties that I’d like to highlight:
- Asynchronous indexing and querying – web search engines tend to use crawlers and indexers to build up an index of the web. After each crawl is finished, the new index is brought online for use by the query engine.
- Lack of access controls – all the data in the index is available to any query. In fact, most queries are (from the standpoint of the index) completely anonymous.
Palantir: not a web search engine
Search technology is just one part of what makes up a Palantir system. For us, it’s a way to quickly retrieve Palantir objects in a Palantir system, it’s not the whole of the application.
I’d like to highlight a couple of differences from the web search engine case. A Palantir system needs the following properties:
- Realtime indexing and querying – we need information to be available immediately as it changes in the system.
- Leak-proof access controls – we need the search engine to help us make sure that we don’t have information leaking across access control boundaries.
Hit the link to read more about these topics.
Realtime indexing
The Palantir platforms implement realtime indexing: as soon as an analyst changes an object in the system, it needs to be available to query. This could be a change to data in the object or a change to the security tags on the object.
From a programming perspective, this is pretty straightforward: a Palantir transaction will not commit until the search engine is finished indexing the new data.
From a search engine operational perspective, this induces some challenges. Asynchronous indexing allows the search engines to bring online a highly optimized static form of the index. Contrast that with realtime indexing, where every cycle spent optimizing the index is removing cycles from serving other queries and there is likely a human waiting for the optimizing process to finish.
When using the static index, a query only accesses one, optimized index file which then points to the documents containing the results. However, as changes and additions are indexed into the system, there is a lot of overhead to merging them into the master index.
Instead of merging and optimizing on every change, Lucene can keep around a number of smaller indexes that hold all the fresh entries. These are fixed-size append-only segments that are much cheaper to write to than the optimized and merged form of the index. So basically, these ‘dynamic’ indexes are linear lists of single-document indexes. When the search engine goes to run a query it has to follow this simple (yet expensive) algorithm:
- Query the static, merged index, accumulating results. (this part is reasonably fast)
- For each of the dynamic indexes:
- Open the file, incurring IO overhead.
- Query each single-document index and look for additional records or newer records that supersede one of the existing found results.
You can see how the overhead of this can quickly get pretty large as the number of dynamic indexes grows: it grows linearly with number of new indexed records. Compare that with the optimized index, which should be close to constant time for any given query.
To get around this, the indexer will only allow a certain number of these dynamic indexes to accumulate before it kicks off a background merge job. During the merge job, we take a noticeable performance hit, but by batching up the merge run we amortize the overhead away for an overall performance win. This hybrid mode didn’t require us to write any new code, but just to tune Lucene to give us the performance profile we wanted.
Preventing Information Leaks
The Palantir data platform has a fairly sophisticated security model baked in (see The White Videos for a more in-depth look at the security model). One of the features that we have implemented is the ability to show a narrower view of an object based the user’s permissions: the user only sees the slice of the data that they have been granted access to. Part of the complexity in implementing this was that we can’t even hint that the other, hidden data exists at all.
Search engines ranks their results by relevance, showing the matches to the query that it believes to be most relevant first. One common way to make these relevance calculations is by comparing the length of the search term or phrase to the length of the term that it matched. Consider the search term ‘king’: it will match the following phrases:
- “I’m the king of the world!”
- “King salmon are often found in the Pacific Northwest and are also known as Chinook salmon.”
- “Yes, my king.”
Using a length-computed relevance, the phrase, “Yes, my king.” is the most relevant.
Getting back to the Palantir object model: for each distinct set of permissions that an object has, we compute a different object label based on the properties that are visible to that particular slice. These multiple titles all go into the search engine. If we were to compute relevance based on the length of the phrase that matched, and the shortest match on the object is shorter than the match that is actually visible to us, we could return the object with a higher-than-obvious relevance. If we were to do that, we’d be leaking information, namely that there’s data on this object that the user making the query is not privy to. (Note that filtering of objects that aren’t at all visible to the user is done in a higher layer after the results have been accumulated and ranked by the search engine.)
Given this problem, there are two approaches one can take:
- Store all the information needed to decide which labels are visible to the user running the query and then use only the visible labels when calculating the relevance of a match. Note that is a pretty expensive operation.
- Don’t use the length of match to compute relevance. Note that skipping a relevance calculation is, obviously, a very cheap thing do.
Which do we do? Both.
When matching against object labels, the length metric actually lets us discern between better and worse matches. So in that case, we incur the cost of this calculation in order to return higher quality results.
However, when matching against things like document bodies, the ratio of the size of the match to the size of the search term starts to have less meaning but still has the possibility of leaking information in the query results. For fields like this, we turn off the relevance calculations based on length of match. The upshot is the we don’t have to store the permissions information in the index nor incur the cost of the permissions/views calculation for these fields.
A heartfelt thank you
To be clear, this post highlights the ways in which our search code diverges from the main Lucene code base. We’re huge fans of Lucene and have great respect for the developers that built and maintain what is probably the world’s greatest open-source search engine.





