Pipes: using unix pipelines for beautiful answers to quick and dirty questions

February 7th, 2007 | Ari

/loony/bin

As we approach a release at Palantir we usually cut to a stable branch that QA can start testing as a release candidate. Further bug fixing and testing may continue on trunk by the developers, but we code review changes before committing them to the stable branch. As the time to really cut the release gets truly imminent we start asking questions like:

What changes are on trunk that are not in the stable branch?

We’re less concerned with what the changes are and more concerned with who owns the changes. What really want to know is:

Do the changes on trunk represent pending changes that should be moved to stable or are they further development that shouldn’t be put into the stable branch for this release?

For the most part, the person that can answer that question is the coder who made the changes on trunk. To that end, what we really would love to have would be a report of all files in trunk that differ from the stable branch and who last touched the file. There isn’t really an svn command that will do this succintly, so I started thinking about how to accomplish this. I had an inkling that it could be all solved with a single Unix pipeline and so I set out on my way to craft such a beast. Here’s what I came up with in about ten minutes:

for name in `diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `; do
    author=`svn info $name | grep -E "Last Changed Author" | awk '{print $4}'`;
    echo $author    $name;
done | sort | sed 's/pgtrunk\\/src\\///' > difflist.txt

Which produces output that looks like this:

gbush com/palantir/foo/Bar.java
bclinton com/palantir/baz/Fargle.java

How did I come up with such a beast? I deconstruct this inscrutable wonder after the jump.

The first question that I’ll answer is: how do I know how to do this? I spend the vast majority of my days writing backend Java code for one of our enterprise products but it wasn’t always that way. In my last job before coming to Palantir, I was working as a senior systems administrator and my work email address was root@sourceforge.net. SourceForge.net is a complex site with a lot of Linux automation going on behind the scenes, and during the three years I was responsible for the infrastructure, I wrote a lot of sh scripts (which, of course, on Linux, is technically bash).

For those not familiar with Unix pipes, a quick overview is available here and the Wikipedia entry “Pipeline (Unix)” is also not a bad place to start.

So we start with this snippet:

`diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `

First note that this three command pipeline is enclosed in backticks (that key that’s usually below the escape key on your keyboard that also has a ~ on it). In shell programming, this means, “execute this command in a subshell and substitute the subshells output here.”

The first command is diff -r --brief --exclude=.svn pgstable/src pgtrunk/src. This is command that actually does the diff. (Yes, diff will compute the differences between two directory trees). It produces output that looks like this:

Files pgstable/src/com/palantir/foo/Bar.java and pgtrunk/src/com/palantir/foo/Bar.java differ
Files pgstable/src/com/palantir/baz/Fargle.java and pgtrunk/src/com/palantir/baz/Fargle.java differ
Only in pgtrunk/src/com/palanrit/foo: NewFile.java

We then pipe this through awk, asking awk to only print the fourth field on the line, where fields are defined by the default delimiters of whitespace characters.

At this point, we would have output that looks like this:

pgtrunk/src/com/palantir/foo/Bar.java
pgtrunk/src/com/palantir/baz/Fargle.java
NewFile.java

We pipe this through grep and keep only the lines that match pgtrunk to filter out the new file case. We’re left with:

pgtrunk/src/com/palantir/foo/Bar.java
pgtrunk/src/com/palantir/baz/Fargle.java

You’ll note a caveat for would be cut and pasters: we’re ignoring the new file case. Any new file in trunk and not in stable is not going to show up here. This is one place where this quick script is not comprehensive, but it was sufficient for our needs at the time so I didn’t jump through the hoops to deal with that case.

So let’s expand our focus a bit to this snippet:

for name in `diff -r --brief --exclude=.svn pgstable/src pgtrunk/src  | awk '{print $4}' | grep pgtrunk `; do
...
done

You can see that we’re that the output of that first pipeline was substituted into a looping construct. The for name in wordlist; do … done construct allows you to loop over a list of words that delimited by whitepace. In this case, it’s the the line-oriented output for the first pipeline, but it could also be a typed list of words. The shell will substitute each word in wordlist into the shell variable $name and then execute the list of commands between the keywords do and done.

The inner portion of the loop looks like this:

author=`svn info $name | grep -E "Last Changed Author" | awk '{print $4}'`;
echo $author    $name;

The first line sets the shell variables $author. The three command pipelines is parsing the output of svn info into a particular value and then using backtick substitution to set put the value into a variable. The output of svn info for a particular path looks like this:

Path: src/com/palantir/foo/Bar.java
URL: svn://svn/Trunk/
Revision: 14860
Last Changed Author: gbush
Last Changed Rev: 14860
Last Changed Date: 2006-10-10 00:39:53 -0700 (Tue, 10 Oct 2006)

So the pipeline is pulling out the username of the last committer on trunk for the path in $name and placing the value into $author.

Finally, we echo out that information on a single line, author first, path second, like this:

gbush pgtrunk/src/com/palantir/foo/Bar.java

And finally the, whole shebang is run through this command:

sort | sed 's/pgtrunk\\/src\\///'

sort will sort the output. Since we have put the usernames first on the line, this has the upshot of clustering all changes by username, giving each developer an easy-to-consult section in the email that gets sent out. The sed command is doing a regular expression search-and-replace that essentially strips out the leading part of the path, giving us just the raw relative path (to make the report easier to read). (Note that the backslashes in the regular expression replace pattern are there to escape the path elements of /, which are also used as delimiters in the replacement expression; in plain English, the sed 's/pgtrunk\/src\///' expression reads: replace the first occurrence of pgtrunk/src on every line with nothing.

Finally, > difflist.txt directs all output from the script into a file named difflist.txt.

I then used this to compose an email to the team, and soon stable and trunk were as in sync as they ever were going to be. And thus ends another exciting game of Clusenix.

Dr. Fun Clusenix Comic

4 Responses to “Pipes: using unix pipelines for beautiful answers to quick and dirty questions”

  1. quikchange Says:

    Looping in the shell tends to be at least an order of magnitude slower than it would be in a compiled language, which I find too frustrating for frequent interactive use.

  2. Ari Says:

    It’s true that shell scripts are slow compared to any given compiled language. However, their interactivity, compactness, power, and speed of iteration is hard to beat. This is especially true when integrating data from multiple sources (like the output of diff and svn) into a single compact report.

    I prototyped the above command, literally, in ten minutes start-to-finish and then emailed out the results to my team. Rather than having to fire up an editor and deal with a compile/run/edit loop I was editing and running a command line: hit up arrow, edit line, hit enter. Lather, rinse, repeat until you see what you want.

    And there’s another point to remember: I don’t even know (offhand) how do to do this in a compiled language. Writing that utility would require me to understand the API that’s available for each of those tools. Subversion has a public API, and there are countless regex libraries for every language, but I’m not sure how one would do the diff using library code (a cursory Google search didn’t show anything too promising). All of which is not to say that I have a fear of learning a new API, but I already know how to use these tools as using them is core part of what I do. Learning the Subversion API and reimplementing diff is not.

    And finally: writing a compiled solution would require a lot more code (even assuming that you wouldn’t have to re-implement diff). Note that the pipeline is only 42 words and 269 characters long. Even if you were to implement the pipeline as nothing more than system() calls so you could leverage the power of grep and diff, you’d still end up with a program at least three times the size.

    So it ends up being a net win in efficiency; the iteration speed and compactness are pretty key. The time for it to run is about 3 seconds. The time to develop a minimal compiled solution is about 30 minutes. 20 * 60 / 3 = 400. So I get to run it about 400 times before this solution costs me anything over a compiled solution. And I get to keep the flexibility editing the command line to tweak what I want to see in the report. (Exercise for the reader: add timestamps to the report lines that indicate the last modified time).

    So while I agree that the shell is slow at times, it’s often the fastest route to getting the information you seek.

  3. quikchange Says:

    Fair enough. The fact that it takes only 3 seconds is key though. If it took 2 minutes because there was a much larger number of items to loop over then the trade-off may not have been as convenient.

    Interestingly, your argument for using shell instead of C is basically the same one for using Python or Ruby instead of Java, C#, etc.

  4. Ari Says:

    Absolutely. But as with all things, the right tool for the right job has a lot to do with the job. Performance considerations will pull people back towards Java or C/C++. Interfacing directly with hardware will call out a need for C/C++, etc. In this case, shell was the right choice because it was fast, short, and I knew exactly how to accomplish it in that environment.

    For someone not as steeped in shell programming as I am, it might have easier to accomplish in Perl, Python, or even C or Java. However, when evaluating what is the right tool for the job, we have to assume some high level of knowledge of the available tools. I happen to know all of the above tools and I’m pretty certain that shell was the right way to go here. (I’d love to see a different approach that proved me wrong, however!)

Leave a Reply


Palantir

Bad Behavior has blocked 416 access attempts in the last 7 days.