Palantir: like an operating system for data analysis

November 6th, 2009 | Ari

If you’ve taken the time to peruse the Palantir Government analysis blog, you’ve seen numerous examples of Palantir Government as applied to interesting problems; they are recorded screen captures of our analysis desktop client. It’s a showcase of useful, meaningful, and compelling visual and semantic tools being used to do analysis on a wide range of datasets.

What enabled this analysis? Aside from the obvious hard work of our UI and analysis tools teams, it’s the flexibility and power of the Palantir data platform. More than just a scalable datastore, the Palantir data platforms act as robust and clean abstractions on top of data.

One of the early architecture decisions that we made when building both Palantir Government and Palantir Finance was to separate the respective data platforms from the end-user applications used to actually perform analysis. More than just following the client-server model, this separation made the data servers in both products into generic intelligence infrastructure for analytic problems, with our clients acting as analysis applications on top of those platforms.

And so, one way to look at our data platform is as an operating system for analytic applications. In this post we’ll explore the history of operating systems, understand why they’re so important and see how the Palantir data servers deliver the same potential to revolutionize the writing of analysis software that operating systems did to the writing of general programs for computers.

The OS: abstraction that begat a paradigm

In the early days of computing, when a programmer wanted to write a program, they had to understand the inner workings of the machine. Writing a program required understanding things like the bus interface of a specific model of hard drive when all that was needed by the program was the clean abstraction of a filesystem. The upshot of this is that much of the time and effort put into a given task was spent writing code to interface with the “physical” minutiae of the machine rather than implementing the solution to the problem that the programmer was trying to solve with their software.

This pattern was observed by J.R. Licklider and noted in his influential paper, Man-Computer Symbiosis (emphasis added):

About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know. Much more time went into finding or obtaining information than into digesting it. Hours went into the plotting of graphs, and other hours into instructing an assistant how to plot. When the graphs were finished, the relations were obvious at once, but the plotting had to be done in order to make them so.

Throughout the period I examined, in short, my “thinking” time was devoted mainly to activities that were essentially clerical or mechanical: searching, calculating, plotting, transforming, determining the logical or dynamic consequences of a set of assumptions or hypotheses, preparing the way for a decision or an insight. Moreover, my choices of what to attempt and what not to attempt were determined to an embarrassingly great extent by considerations of clerical feasibility, not intellectual capability.

This description of his time as a researcher was echoed in the work of the early programmers: they spent much of their programming time re-inventing the wheel and writing routines that were doing essentially clerical or mechanistic work related to the functioning of the hardware rather the core functions of their programs.

The operating system changed all that: suddenly (and by that I mean: with years of hard work, research, and incremental change) that noisy, inconsistent pile of hardware was transformed into a set of clean abstractions. The programmer was finally freed to spend time and energy on the problem they were really trying to solve.

And so we come to the modern era: dealing with the messy details of hardware has been replaced by the clean and robust abstraction of the operating system.

Three important properties of modern operating systems:

  • Hard boundaries between OS functions and process functions – in modern operating systems, this is usually accomplished with system calls. The process places the inputs to the system call in a known location and then asks the OS to perform some operation, like writing to a file or making a network connection. The OS may or may not perform the function, based on things like permissions, availability of resources, etc.

    The most important feature here is that the process never has direct access to the true resources of the machine — instead, all access to the machine’s resources are brokered by the OS.

  • Extensions of the abstraction in every direction – An OS like Linux is really, at its core, a kernel that does process scheduling and lifecycle, manages memory, and services system calls. Everything else is handled by some sort of driver. A driver might also be called, more generically, a plugin or extension. Drivers exist for everything from block devices (like hard drives), network cards, and filesystems to input devices and displays.
  • Designed as a general purpose framework – the operating system doesn’t actually do any computing; rather, it’s a set of services to facilitate processes using the resources of the computer. To that end, they’re not designed with a specific process in mind, but rather to serve a large class of programs, each designed and written to accomplish a different task using a similar set of resources.

Analysis: the modern computing task

The first computer, ENIAC, was conceived to do calculation of ballistics tables for artillery pieces — it was a glorified calculator. Lacking anything even resembling an operating system, it would just run its program. Its compiler? A group of six women who would configure the machine by hand with the program logic. The input for its first test run, a calculation related to the hydrogen bomb project, was approximately one million punch cards.

Times have changed: 40 or so years of the unrelenting march of Moore’s Law in computing power has given us something like an eight order of magnitude increase in the amount of computing power available per unit cost. Coupled with similar, more recent gains in storage capacity and network bandwidth, this has produced a world awash in data, crying out for analysis.

So the situation today is that we now expect to bring these considerable computing resources to bear on larger, more complex problems in the world. I’m talking about things like the spread of food-borne illnesses, understanding the connection between genes and protein expression, understanding terrorist networks, finding botnets in network traffic logs, and exploring influence networks in government.

These problems, while spanning a widely disparate areas of analysis, share some common traits:

The data is spread out

They are described by multiple data sources. Just to make things more interesting: the data sources don’t agree on their native representations of the real-world data. And finally, the real-world objects that the data are describing are actually described in multiple data sources, with no single source giving a complete and accurate representation.

The data schema are not human-conceptual

Rather than representing the data in some schema that maps easily into how the experts on a given problem think about said problem, the data stores in question tend to model data in whatever way was convenient for the creators of that particular data store. Put another way: people don’t think in tables, rows, columns, and XML snippets. These first-class data storage elements don’t usually map to real-world objects.

The data is sensitive

Whether it’s patient information, mortgage data, a law enforcement investigation, or sensitive foreign intelligence, there is often the need for foolproof access controls on the data.

Palantir: an operating system-class abstraction for analysis

A Palantir data server provides a similar class of services that an operating system does but focused on the specific needs of analytic tasks. Here I’ll focus on the model used by Palantir Government; Palantir Finance uses a similar but significantly different approach to delivering these services.

As you might imagine, however, they both start at a somewhat higher level than punch cards.

It starts with an ontology

The Palantir approach to analysis begins with a task-specific ontology: essentially, a human-conceptual description of the real-world problem that’s being analyzed.

It’s roughly composed of three pieces:

  • A hierarchical type system of the real-world objects that human experts use to think about this problem. We call these PTObjects, short for “Palantir Objects”.
  • A type system of properties that will contain the data describing these PTObjects. PTObjects are essentially typed containers for properties. This is where most of the detail of the ontology lies.
  • A type system of possible relationships between different types of PTObjects.

Within the ontology, there are numerous extension points that allow the customization of how data is imported, retrieved, and displayed (following the principle of extending the abstraction in all directions).

The data server takes the ontology as input and is agnostic to its content. This is where the principle of building a general purpose framework comes into play.

The data sources are mapped into the ontology

This part of the Palantir data server is a pattern that is very similar to an operating system’s notion of block device drivers. The difference? Instead of low-level storage systems like hard drives, we’re dealing with complex databases describing the problem at hand.

In an operating system, every block device can read and write blocks of data. In the Palantir data server, everything becomes a source of PTObjects.

Our data importer plugins, by analogy, fulfills the same role as a block device driver:
we build glue code to map the data source’s schema into the ontology and the connectors to surface the data itself wrapped up in PTObjects.

The data are composed into real-world objects.

Part of this mapping is composing real-world objects into composite PTObjects by resolving PTObjects together.

The operation of resolving is pretty straightforward: we basically union the properties of the two PTObjects into a new PTObject. The end result is a single PTObject that completely represents all the data about something in the real-world from all the available data sources.

As we do this composition, we keep track of where each property came from, down to the record level, in each of its original sources. (Note that most composed PTObjects will usually have at least one property that comes from two sources). By preserving the original identity of every atom of data, it allows us to later decompose these PTObjects into their constituent parts or, more importantly, censor a client’s view based what permissions they have for each of the original data sources.

This a fundamental operation in our system that doesn’t have an exact analog in operating systems — it’s sort of similar to taking multiple filesystems and mounting them inside a virtual filesystem tree, like Unix does. However, if each data source is like a filesystem, what we’re doing is essentially composing individual files from their fragments stored on multiple block devices.

Another analogy: at a level below the block device in the OS, this is also sort of similar to what a RAID0 device does, the difference being that our composition is based on the contents of the data itself rather than some previously applied, content-agnostic, decomposition function. The other difference being motivation: a RAID0 does it for performance, while Palantir is composing data to make it correspond to the real-world objects it represents.

The server exposes Palantir “system calls”

The interface that the Palantir data server exposes can be boiled down to two essential operations:

  • The client can download copies of PTObjects from the server. It may request them by id or perform some sort of search/query to specify a set of PTObjects. This is roughly analogous to the open() and read() system calls on Unix.

    Note that each client only sees the subset of properties for a given PTObject that it is authenticated for. This censorship of full PTObjects into projected slices is something done by the server on every load of PTObjects.

  • The client can send new or updated PTObjects to the data server for storage. This is roughly analogous to the write() system call in Unix. It, of course, entails a check as to whether the given client has permission to write to the given PTObject.

The server’s responsibility is the same as the operating system: only let the client do what it has been granted permission to do. In an operating system, the OS uses hardware features like protected mode to keep lower-privileged processes from accessing machine resources. Palantir uses network calls to achieve the same separation, by placing the client and server on different logical machines. The effect is the same: the client basically requests (rather than commands) that certain operations are performed by the server. The server uses its own rules to decide if the access or change is allowed and responds accordingly. And so the principle of hard boundaries is implemented.

The clients do the analysis

When an operating system yields to a process, that’s the time when the true processing begins. By the same token, in Palantir, it’s not until a client connects and starts searching, visualizing, and manipulating PTObjects that analysis actually starts taking place (even if the server is doing a lot of the heavy lifting).

The wide open future

So why is this exciting? I’m glad you asked!

It’s about taking analysis to the next level.

Let’s say you’re someone who wants to write an analytic task. Let me ask you a series of rhetorical questions:

  • Do you want to start with three disparate sources of data or with the data already mapped into a Palantir data server?
  • Which one is a better use of your time as a programmer?
  • Which one allows you to not repeat mistakes that other programmers have already made and fixed?
  • Which one is more like writing a program than an operating system?

Operating systems took us to a new level of expressiveness when it came to writing computing processes to run on computing hardware. It inverted that 85/15 ratio that Licklider talked about so that programmers spent more time writing the code that did the thing they were trying to create and less time mucking around with hardware.

More programmer time == better analytic tasks.

It’s about making machine learning easier.

Now consider machine learning as a field. Pretty much every machine learning task could benefit from starting with its data in something that looks like a Palantir data server. I’ve taken an informal survey of machine learning researchers and they agree: the 85/15 ratio still holds for machine learning.

Simply put: most of the time and effort in machine learning is spent getting the data into a form that you can actually apply an algorithm to! Now imagine if the starting point for that was a Palantir data server — now the machine learning implementer has a world of expressiveness open to them and time and energy are spent on the task at hand instead of the overhead of messing with the data.

Now, we don’t think that we’re building Skynet. Quite the contrary: we believe that platforms like the one we’ve built will allow machine learning techniques to be put in the hands of experts to augment their ability to look at the world come to conclusions about complex real-world problems by asking questions of the data we’ve collected. It’s about Intelligence Augmentation, which can use machine learning techniques and algorithms to build better tools, not creating Strong AI.

It’s about creating new markets

Let’s go back to the well of operating systems and look back at the history of MS-DOS: the first “killer” application on MS-DOS was VisiCalc (that screenshot at the top of this post), a text-based spreadsheet. As you know, VisiCalc was not the end of the story but just the introduction. MS-DOS, evolved into Windows, allowed application writers an (arguably) clean abstraction on top of commodity hardware in order to build the applications that users actually wanted. Today, we have things like web browsers, multimedia authoring software, virtual machines, and IDEs built on top of what is, essentially, the same set of abstractions that VisiCalc was built on.

However, the most important thing to note is that VisiCalc is credited with creating the market for commercial operating systems — businesses needed VisiCalc so they paid Microsoft for MS-DOS (and IBM for a PC). Without VisiCalc, there was no market for MS-DOS (most people, unsurprisingly, didn’t want to buy a BASIC interpreter).

We’re in the business of selling software and we agree with our customers: the Palantir approach has tremendous value. We’ve just started tapping the potential of this market. Think about what Oracle looked like in 1979, think what Microsoft looked like in 1980 — that’s Palantir in 2009.

It’s about the start of the analysis age

It can be argued that the operating system is the innovation that ushered in the “information age“. Without the operating system, there is no software explosion, which allows computing technology to actually be used on data in the world.

We think that we’re on the cusp of the analysis age, as imagined by Vernor Vinge in Rainbow’s End. It was something foreseen by Licklider in 1960, albeit with a timeline that was off by at least a few decades:

“…it seems worthwhile to avoid argument with (other) enthusiasts for artificial intelligence by conceding dominance in the distant future of cerebration to machines alone. There will nevertheless be a fairly long interim during which the main intellectual advances will be made by men and computers working together in intimate association. A multidisciplinary study group, examining future research and development problems of the Air Force, estimated that it would be 1980 before developments in artificial intelligence make it possible for machines alone to do much thinking or problem solving of military significance. That would leave, say, five years to develop man-computer symbiosis and 15 years to use it. The 15 may be 10 or 500, but those years should be intellectually the most creative and exciting in the history of mankind.”

It’s a golden age of analysis and we’re just getting started: we’ve got a lot of work to do, so if this sort of thing excites you, please come and join us.

Leave a Reply


Palantir