Palantir Config Server: lining up the ducks
March 6th, 2009 |

At Palantir, we build distributed software. When deployed at a customer site, our platform consists of several servers running on, and distributed across, a cluster of machines. When I first joined the company, deploying and managing our platform was tedious and time consuming. Need to install servers? One by one, login to the machines where they need to go, lay down their requisite files and manually configure them such that they can work together. Have to bring down a deployment for scheduled maintenance? One by one, and in the correct order, login to the machines where the servers reside and shut them down. Want to change the private keys and certificates used to secure communication between servers? Well, you get the point.
From a customer perspective, the complexity associated with the administration of distributed software represents a significant challenge. Not providing tools to help reduce that complexity impacted the overall usability of our platform. Furthermore, from a Palantir perspective, a non-trivial portion of our resources were being devoted to deploying and managing instances of our platform, both externally (by Forward Deployed Engineers working directly with our customers) and internally (by development, QA and support staff working to maintain and improve our product). Could we be more efficient? No doubt. Given our intense focus on customer satisfaction and the desire to grow / scale our business, action was necessary.
To see how we solved this problem, read on.
We stepped back a bit, taking time to reflect on our situation and understand the problem. Based upon our experience, what key areas would a solution need to address? We settled on the following:
- Lifecycle management.
- Ease initial deployment and upgrade.
- Handle coordinated starting, stopping and restarting.
- Configuration management.
- Track which servers are installed on what machines.
- Provide centralized management of server configuration information.
- Automation.
- Support encoding common management tasks based on best practices.
In addition to those three key areas, we also identified several important requirements. A couple that definitely warrant mention:
- Security.
- Extensibility.
After getting a good sense of what needed to be accomplished, we put effort into investigating if an existing solution would fit the bill. For a variety of reasons (i.e., available feature set, licensing constraints, etc.), we never found a good match. We did, however, come across several open source building blocks that could, when composed appropriately, combine to form the foundation of a homegrown solution. The Config Server was born.
Architecture

The Config Server works with remote agents to enable centralized deployment management. The diagram presented above provides an overview of our management infrastructure. Below is a brief discussion of each key component of our architecture.
- Agent – Agents are installed on every machine in a deployment. They are lightweight background processes that sit around waiting to execute commands submitted by the Config Server, interacting directly with the services installed on a given machine. Instead of implementing our own agent solution, we decided to leverage existing technology, the open source peer-to-peer Software Testing Automation Framework (STAF). From its homepage:
The Software Testing Automation Framework (STAF) is an open source, multi-platform, multi-language framework designed around the idea of reusable components, called services (such as process invocation, resource management, logging, and monitoring). STAF removes the tedium of building an automation infrastructure, thus enabling you to focus on building your automation solution. The STAF framework provides the foundation upon which to build higher level solutions, and provides a pluggable approach supported across a large variety of platforms and languages.
We added support for two-way SSL to STAF to enhance the security of our management infrastructure (specifically, to allow us to implement authorization based on self-signed certificates). But beyond that, no modification was necessary. STAF provides us with a robust solution for remote process invocation and file management, both absolutely essential for centralized deployment management.
- Agent Manager – The Agent Manager provides lifecycle and configuration management functionality for the agents in a deployment. It interacts with remote machines through SSH, using the open source Trilead SSH for Java library.
- Config Registry – The Config Registry maintains and provides access to all of the information the Config Server has about a deployment. It consist of the following:
- Agent Registry – The Agent Registry contains information about all of the agents in a deployment.
- Service Registry – The Service Registry keeps track of all of the services in a deployment.
- Config Repository – The Config Repository is a central store for configurations of the agents and services in a deployment.
- Package Repository – The Package Repository holds all of the service packages that can be installed in a deployment.
- Plugin Repository – The Plugin Repository houses all of the plugins that are available for use in the Config Server. Plugins are used by the Security Manager, Service Manager and Task Manager.
- Security Manager – We secure our servers and management infrastructure using public key cryptography. The Security Manager handles the generation and packaging of private keys and certificates. We perform private key and certificate generation using the Bouncy Castle Crypto APIs for Java. Packaging is taken care of by plugins in the Plugin Repository. For example, one plugin packages private keys and certificates into JKS files for use with Java, while another packages them into PEM files for use with OpenSSL.
- Service – Services represent the software installed on the machines in a deployment that drive our platform. They correspond to the servers we’ve built and the 3rd party offerings on which they depend (i.e., databases, entity extractors, etc.).
- Service Manager – The Service Manager interacts with agents to provide lifecycle and configuration management functionality for the services in a deployment. The actual mechanics of lifecycle and configuration management vary from to service to service. For example, starting service A might require invoking one script, while starting service B might require invoking another. For each type of service in a deployment, the Plugin Repository contains a corresponding plugin that embeds the necessary management logic. The Service Manager works with those plugins to get its job done.
- Task Manager – Managing a deployment requires performing tasks that go beyond lifecycle and configuration management for its constituent agents and services (i.e., log aggregation, database user creation, etc.). Such tasks are implemented as plugins. They make things happen by communicating with agents and / or directly with machines via SSH. The Task Manager interacts with the Plugin Manager to load tasks and coordinate their execution.
Functionality
How did we do with respect to our stated needs?
- Lifecycle management – The Agent Manager and Service Manager provide centralized lifecycle management. Initial deployment and upgrades, as well as starting, stopping and restarting servers, can all be handled directly through the Config Server.
- Configuration management – The Config Repository of the Config Server maintains information about deployments and provides centralized configuration management. The Agent Manager and Service Manager support the remote retrieval and application of agent and service configuration.
- Automation – The Config Server’s functionality is exposed via a clean and consistent Java API. Common management tasks can be automated by writing code against that API.
And what about some of our more important requirements?
- Security – All communication in our management infrastructure is secured using two-way SSL. A simple authorization mechanism, implemented using self-signed certificates, ensures that only the authorized entities (most notably, the Config Server), can execute commands through agents. Client access to the data maintained, and functionality exposed, by the Config Server requires password-based authorization.
- Extensibility – The Config Server can be extended to support new types of services and perform new tasks by implementing plugins and dropping them in the Plugin Repository.
Future
In the space of a few months, we built the Config Server to address several key needs and requirements related to the management of our platform. Our work has already begun to pay dividends. Looking ahead, there are several things we would like to do:
- Add support for low-level system management and configuration related to our platform (i.e., user and group management, firewall configuration, etc.).
- Implement multi-deployment management with support for features like staging, mirroring and migration.
- Autonomic Computing, integrating with our monitoring solution to implement platform self-management.
While we’ve accomplished a fair amount, plenty of work remains. We look forward to enhancing our Config Server and its associated infrastructure as we strive to make our platform one that is not only powerful and a pleasure to use, but also easy to manage and maintain.






Nice… Have you been able to quantify the time or resources saved by this implementation? Looks like a nice technology stack. Any plans for remote deployment, remote upgrade? Watchdog/recovery service? Just curious about plans to increase robustness over time without making it too complicated. Also, how are you storing data in your repository? Thanks…
March 10th, 2009 at 1:48 pm
Hey Todd,
While the folks who work with our software, from a deployment management perspective, have reported a significant reduction in the amount of time they spend doing so (i.e., deploying new servers, upgrading existing servers, etc.), we haven’t attempted to actually quantify the amount of time and resources saved.
The Config Server handles remote deployment. Agents are remotely deployed via SSH. Once they’re in place, we remotely deploy servers to managed computers using the functionality provided by STAF.
With respect to your question about a watchdog / recovery service… For our servers, we leverage the Tanuki Java Service Wrapper. Used in combination with our Monitoring Server, we’re able to get a pretty good sense of what the servers in a deployment. We have not yet implemented similar functionality for our agents (i.e., an agent watchdog and a monitoring component for the Config Server Agent Manager service). Long-term, that’s something we’d definitely like to do.
For persistence, the Config Server currently uses XML. For now, given the amount of data it currently needs to deal with, XML works fine.
Thanks for the questions!
March 10th, 2009 at 3:51 pm