When working on the command line, it's often useful to be able to take the output of one command, and use it as the input to another. All command line work is really data manipulation (what work isn't?) and commands and programs are the means of manipulation. A particularly quick and practical way to control those means are pipelines! Geertz, if he'd been feeling particularly digital, might well have commented that we are creatures bobbing in floods of data we ourselves have produced.
Anyway.
While there are variety of ways to communicate data between processes (IPC, shell variables, writing to files, etc.), in Bash, and Bash-like shells, a very common and straightforward means of doing this is with pipelines. According the Bash manual:
"A pipeline is a sequence of one or more commands separated by one of the control operators | or |&."
A little more background is in order. Normally, any time a *nix program starts it has three streams associated with it: an input stream, stdin
, an output stream, stdout
, and an error or diagnostic stream, stderr
. These are commonly referred to as "standard in", "standard out" and "standard error". Usually, when a program is launched from the command line, they are each associated with the terminal, and so all output, error information or input is sent to, or received from that terminal. In *nix, roughly, everything is a file, and as such stdin
, stdout
and stderr
are respectively assigned the descriptors 0, 1 and 2.
That's enough background to get going! For much more information, consult the Bash manual (the man page for stdin
may also be interesting).
The Basics:
To begin, consider the cat
program. Cat, without any arguments, reads from standard in, and writes it back to standard out:
$ cat
this is a test
this is a test
We wrote to standard input, and cat wrote it back to us on the standard output.
Another way to create input, other than typing in the terminal, is to use echo
, which takes input as a string argument and then writes it to standard out:
$ echo "This is a test"
This is a test
Now let's use one of those control characters mentioned above to connect the output of echo
to the input of cat
:
$ echo "This is a test" | cat
This is a test
What we did with |
, the pipe character, is connect the standard output of the command before it, echo
, with the standard input of the command after it, cat
! Easy and powerful.
Let's try something a little more real-world. If you run the following, you'll see a big list of all the processes running on your system:
$ ps xau
root 1 0.1 0.0 169548 10404 ? Ss 13:59 0:05 /sbin/init
root 2 0.0 0.0 0 0 ? S 13:59 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? I< 13:59 0:00 [rcu_gp]
...
Now let's filter that, to see only the shells that are running. To do this we can use grep
which takes text as input and then evaluates it using a regular expression. In its simplest form it will output (on standard out) each line of input that matches its argument. Assuming you're running Bash, you could do this:
$ ps xau | grep bash
miltiades 5725 0.0 0.0 9452 4048 pts/1 S+ 14:14 0:00 man bash
Our first useful pipeline! Piping into grep is very common. It's so common in fact that some grep tasks have their own programs. pgrep
, for example, does basically what we just did, but returns only the pid(s) of any matching process(es).
There's no limit (careful now!) to how long a pipeline you can create. Let's look at a couple of more slightly more complex examples.
Example 1: Extracting song titles from their file names
Let's try out this idea of manipulating data by piping it through different commands by removing all of the extraneous text from some typical mp3 file names. This kind of file name manipulation is a pretty common shell activity.
We've got the following files:
$ ls
01_Sgt._Peppers_Lonely_Hearts_Club_Band.mp3
03_Lucy_in_the_Sky_with_Diamonds.mp3
02_With_a_Little_Help_from_My_Friends.mp3
Let's start by removing the first three characters:
$ ls | cut -c4-
Sgt._Peppers_Lonely_Hearts_Club_Band.mp3
With_a_Little_Help_from_My_Friends.mp3
Lucy_in_the_Sky_with_Diamonds.mp3
We piped the list of files into cut
, and told it to remove the first 4 characters. The -c
option takes the starting position (i.e. position 4 in the string) and the following -
says we want to retain everything from that position on.
Now let's reverse the string, which we can do simply by piping our list into rev
:
$ ls | cut -c4- | rev
3pm.dnaB_bulC_straeH_ylenoL_sreppeP_.tgS
3pm.sdneirF_yM_morf_pleH_elttiL_a_htiW
3pm.sdnomaiD_htiw_ykS_eht_ni_ycuL
That looks pretty silly. But now we can run cut just like we did before, albeit starting from one position higher in the string, to get rid of the end of filename:
$ ls | cut -c4- | rev | cut -c5-
dnaB_bulC_straeH_ylenoL_sreppeP_.tgS
sdneirF_yM_morf_pleH_elttiL_a_htiW
sdnomaiD_htiw_ykS_eht_ni_ycuL
Now let's reverse it again:
$ ls | cut -c4- | rev | cut -c5- | rev
Sgt._Peppers_Lonely_Hearts_Club_Band
With_a_Little_Help_from_My_Friends
Lucy_in_the_Sky_with_Diamonds
And finally, let's replace those pesky underscores with spaces. To do this, we can use a program called tr
which "translates or deletes characters" (from its man page) in the data it reads from standard in. All instances of the first argument will be replaced with the second argument:
$ ls | cut -c4- | rev | cut -c5- | rev | tr _ " "
Sgt. Peppers Lonely Hearts Club Band
With a Little Help from My Friends
Lucy in the Sky with Diamonds
Easy!
Example 2: Selectively removing paths from $PATH
This time, let's selectively remove any path from the $PATH
variable that contains the string "local".
The $PATH variable contains a colon-separated list of paths that the shell will look through for programs. Like all shell variables, it's accessed with the $
symbol. We can see what it contains like this:
$ echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
Let's start by converting those colons into newline characters so that we can run grep
on each line like we did with the output of ps
. We can do this with tr
:
$ echo $PATH | tr \: \\n
/usr/local/bin
/usr/bin
/bin
/usr/local/games
/usr/games
We needed the extra \
to escape the second \
because it's a special character.
Now let's get rid of the lines with the string "local" in them:
$ echo $PATH | tr \: \\n | grep -v local
/usr/bin
/bin
/usr/games
The -v
switch inverts the selection, so we've matched all lines that don't contain "local". We can reassemble the individual lines into a single comma-separated string by doing the opposite of our earlier tr
:
$ echo $PATH | tr \: \\n | grep -v local | tr \\n \:
/usr/bin:/bin:/usr/games:
However notice that there is both an extra colon at the end, and it also didn't print nicely. Let's fix that with a quick sed
. The details are beyond the scope of the article, but sed
is short for Stream EDitor, and is an extremely powerful stream (i.e. pipe!) manipulator. This basically uses a regular expression to match the colon at the end of the string and replace it with a newline:
$ echo $PATH | tr \: \\n | grep -v local | tr \\n \: | sed "s/:$/ \n/g"
/usr/bin:/bin:/usr/games
Okay!
In these two examples, we've built useful pipelines using some common *nix programs to perform useful data manipulation!
Tripping over Standard Error:
Now that we're familiar with moving data between programs by using pipes to connect the output of one program to the input of another program, let's briefly consider that last stream - the standard error. Because normally both it, and standard out are both written to the terminal, it's easy to forget that they are, in fact, separate streams.
Consider the following command:
$ ls / /fakefolder
ls: cannot access '/fakefolder': No such file or directory
/:
bin etc initrd.img.old lib64 media proc sbin tmp vmlinuz
boot home lib libx32 mnt root srv usr vmlinuz.old
dev initrd.img lib32 lost+found opt run sys var
ls
reports that, surprise surprise, it can't find "fakefolder", and then proceeds to list the contents of the root directory. Further, if we then pipe that output into, say, cat
, which will reflect whatever it reads on standard input back to its standard out, we see the same exact thing:
$ ls / /fakefolder | cat
ls: cannot access '/fakefolder': No such file or directory
/:
bin etc initrd.img.old lib64 media proc sbin tmp vmlinuz
boot home lib libx32 mnt root srv usr vmlinuz.old
dev initrd.img lib32 lost+found opt run sys var
However, only part of the original output was piped into cat
. What we see is actually the original ls
error message, written to the standard error of ls
, and the standard output of cat
. Because both were set to the same terminal, they appear just as they would if both were in fact piped to cat
, and then written together, by cat
, to standard out.
This becomes apparent when we try to read the error text in the command after the pipe:
$ ls / /fakefolder | grep potato
ls: cannot access '/fakefolder': No such file or directory
Despite the fact that potato clearly doesn't exist anywhere in the original output, we still see a line printed to the terminal. While terminal coloring of grep
might give it away, this illustrates the potential for confusion.
To pipe both standard error and standard out together, we can do the following:
$ ls / /fakefolder |& cat
ls: cannot access '/fakefolder': No such file or directory
/:
bin etc initrd.img.old lib64 media proc sbin tmp vmlinuz
boot home lib libx32 mnt root srv usr vmlinuz.old
dev initrd.img lib32 lost+found opt run sys var
And now our test correctly returns no results:
$ ls / /fakefolder |& grep potato
&|
is short hand for the following, which uses the file descriptors we mentioned at the beginning of the article:
$ ls / /fakefolder 2>&1 | grep potato
This second, more verbose version of the same thing is a form of Bash redirection, which takes file descriptor (fd) 2, stderr
, and redirects it into fd 1, stdout
, before piping into grep.
Some programs play a little bit fast and loose (at least from a naive perspective) with the purposes of standard error and standard out. strace
comes to mind.