Will - February 9th '20

(Last Edited August 1st '20)

How To: Pipelines

When working on the command line, it's often useful to be able to take the output of one command, and use it as the input to another. All command line work is really data manipulation (what work isn't?) and commands and programs are the means of manipulation. A particularly quick and practical way to control those means are pipelines! Geertz, if he'd been feeling particularly digital, might well have commented that we are creatures bobbing in floods of data we ourselves have produced.

Anyway.

While there are variety of ways to communicate data between processes (IPC, shell variables, writing to files, etc.), in Bash, and Bash-like shells, a very common and straightforward means of doing this is with pipelines. According the Bash manual:

"A pipeline is a sequence of one or more commands separated by one of the control operators | or |&."

A little more background is in order. Normally, any time a *nix program starts it has three streams associated with it: an input stream, stdin, an output stream, stdout, and an error or diagnostic stream, stderr. These are commonly referred to as "standard in", "standard out" and "standard error". Usually, when a program is launched from the command line, they are each associated with the terminal, and so all output, error information or input is sent to, or received from that terminal. In *nix, roughly, everything is a file, and as such stdin, stdout and stderr are respectively assigned the descriptors 0, 1 and 2.

That's enough background to get going! For much more information, consult the Bash manual (the man page for stdin may also be interesting).

The Basics:

To begin, consider the cat program. Cat, without any arguments, reads from standard in, and writes it back to standard out:

$ cat
this is a test
this is a test

We wrote to standard input, and cat wrote it back to us on the standard output.

Another way to create input, other than typing in the terminal, is to use echo, which takes input as a string argument and then writes it to standard out:

$ echo "This is a test"
This is a test

Now let's use one of those control characters mentioned above to connect the output of echo to the input of cat:

$ echo "This is a test" | cat
This is a test

What we did with |, the pipe character, is connect the standard output of the command before it, echo, with the standard input of the command after it, cat! Easy and powerful.

Let's try something a little more real-world. If you run the following, you'll see a big list of all the processes running on your system:

$ ps xau
root         1  0.1  0.0 169548 10404 ?        Ss   13:59   0:05 /sbin/init
root         2  0.0  0.0      0     0 ?        S    13:59   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   13:59   0:00 [rcu_gp]
...

Now let's filter that, to see only the shells that are running. To do this we can use grep which takes text as input and then evaluates it using a regular expression. In its simplest form it will output (on standard out) each line of input that matches its argument. Assuming you're running Bash, you could do this:

$ ps xau | grep bash
miltiades  5725  0.0  0.0   9452  4048 pts/1    S+   14:14   0:00 man bash

Our first useful pipeline! Piping into grep is very common. It's so common in fact that some grep tasks have their own programs. pgrep, for example, does basically what we just did, but returns only the pid(s) of any matching process(es).

There's no limit (careful now!) to how long a pipeline you can create. Let's look at a couple of more slightly more complex examples.

Example 1: Extracting song titles from their file names

Let's try out this idea of manipulating data by piping it through different commands by removing all of the extraneous text from some typical mp3 file names. This kind of file name manipulation is a pretty common shell activity.

We've got the following files:

$ ls
01_Sgt._Peppers_Lonely_Hearts_Club_Band.mp3
03_Lucy_in_the_Sky_with_Diamonds.mp3
02_With_a_Little_Help_from_My_Friends.mp3

Let's start by removing the first three characters:

$ ls | cut -c4-
Sgt._Peppers_Lonely_Hearts_Club_Band.mp3
With_a_Little_Help_from_My_Friends.mp3
Lucy_in_the_Sky_with_Diamonds.mp3

We piped the list of files into cut, and told it to remove the first 4 characters. The -c option takes the starting position (i.e. position 4 in the string) and the following - says we want to retain everything from that position on.

Now let's reverse the string, which we can do simply by piping our list into rev:

$ ls | cut -c4- | rev
3pm.dnaB_bulC_straeH_ylenoL_sreppeP_.tgS
3pm.sdneirF_yM_morf_pleH_elttiL_a_htiW
3pm.sdnomaiD_htiw_ykS_eht_ni_ycuL

That looks pretty silly. But now we can run cut just like we did before, albeit starting from one position higher in the string, to get rid of the end of filename:

$ ls | cut -c4- | rev | cut -c5-
dnaB_bulC_straeH_ylenoL_sreppeP_.tgS
sdneirF_yM_morf_pleH_elttiL_a_htiW
sdnomaiD_htiw_ykS_eht_ni_ycuL

Now let's reverse it again:

$ ls | cut -c4- | rev | cut -c5- | rev
Sgt._Peppers_Lonely_Hearts_Club_Band
With_a_Little_Help_from_My_Friends
Lucy_in_the_Sky_with_Diamonds

And finally, let's replace those pesky underscores with spaces. To do this, we can use a program called tr which "translates or deletes characters" (from its man page) in the data it reads from standard in. All instances of the first argument will be replaced with the second argument:

$ ls | cut -c4- | rev | cut -c5- | rev | tr _ " "
Sgt. Peppers Lonely Hearts Club Band
With a Little Help from My Friends
Lucy in the Sky with Diamonds

Easy!

Example 2: Selectively removing paths from $PATH

This time, let's selectively remove any path from the $PATH variable that contains the string "local".

The $PATH variable contains a colon-separated list of paths that the shell will look through for programs. Like all shell variables, it's accessed with the $ symbol. We can see what it contains like this:

$ echo $PATH
/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games

Let's start by converting those colons into newline characters so that we can run grep on each line like we did with the output of ps. We can do this with tr:

$ echo $PATH | tr \: \\n
/usr/local/bin
/usr/bin
/bin
/usr/local/games
/usr/games

We needed the extra \ to escape the second \ because it's a special character.

Now let's get rid of the lines with the string "local" in them:

$ echo $PATH | tr \: \\n | grep -v local
/usr/bin
/bin
/usr/games

The -v switch inverts the selection, so we've matched all lines that don't contain "local". We can reassemble the individual lines into a single comma-separated string by doing the opposite of our earlier tr:

$ echo $PATH | tr \: \\n | grep -v local | tr \\n \:
/usr/bin:/bin:/usr/games:

However notice that there is both an extra colon at the end, and it also didn't print nicely. Let's fix that with a quick sed. The details are beyond the scope of the article, but sed is short for Stream EDitor, and is an extremely powerful stream (i.e. pipe!) manipulator. This basically uses a regular expression to match the colon at the end of the string and replace it with a newline:

$ echo $PATH | tr \: \\n | grep -v local | tr \\n \: | sed "s/:$/ \n/g"
/usr/bin:/bin:/usr/games

Okay!

In these two examples, we've built useful pipelines using some common *nix programs to perform useful data manipulation!

Tripping over Standard Error:

Now that we're familiar with moving data between programs by using pipes to connect the output of one program to the input of another program, let's briefly consider that last stream - the standard error. Because normally both it, and standard out are both written to the terminal, it's easy to forget that they are, in fact, separate streams.

Consider the following command:

$ ls / /fakefolder
ls: cannot access '/fakefolder': No such file or directory
/:
bin   etc         initrd.img.old  lib64       media  proc  sbin  tmp  vmlinuz
boot  home        lib             libx32      mnt    root  srv   usr  vmlinuz.old
dev   initrd.img  lib32           lost+found  opt    run   sys   var

ls reports that, surprise surprise, it can't find "fakefolder", and then proceeds to list the contents of the root directory. Further, if we then pipe that output into, say, cat, which will reflect whatever it reads on standard input back to its standard out, we see the same exact thing:

$ ls / /fakefolder | cat
ls: cannot access '/fakefolder': No such file or directory
/:
bin   etc         initrd.img.old  lib64       media  proc  sbin  tmp  vmlinuz
boot  home        lib             libx32      mnt    root  srv   usr  vmlinuz.old
dev   initrd.img  lib32           lost+found  opt    run   sys   var

However, only part of the original output was piped into cat. What we see is actually the original ls error message, written to the standard error of ls, and the standard output of cat. Because both were set to the same terminal, they appear just as they would if both were in fact piped to cat, and then written together, by cat, to standard out.

This becomes apparent when we try to read the error text in the command after the pipe:

$ ls / /fakefolder | grep potato  
ls: cannot access '/fakefolder': No such file or directory

Despite the fact that potato clearly doesn't exist anywhere in the original output, we still see a line printed to the terminal. While terminal coloring of grep might give it away, this illustrates the potential for confusion.

To pipe both standard error and standard out together, we can do the following:

$ ls / /fakefolder |& cat
ls: cannot access '/fakefolder': No such file or directory
/:
bin   etc         initrd.img.old  lib64       media  proc  sbin  tmp  vmlinuz
boot  home        lib             libx32      mnt    root  srv   usr  vmlinuz.old
dev   initrd.img  lib32           lost+found  opt    run   sys   var

And now our test correctly returns no results:

$ ls / /fakefolder |& grep potato

&| is short hand for the following, which uses the file descriptors we mentioned at the beginning of the article:

$ ls / /fakefolder 2>&1 | grep potato

This second, more verbose version of the same thing is a form of Bash redirection, which takes file descriptor (fd) 2, stderr, and redirects it into fd 1, stdout, before piping into grep.

Some programs play a little bit fast and loose (at least from a naive perspective) with the purposes of standard error and standard out. strace comes to mind.

# Reads: 2088

Comments:

No Comments Yet!