UNIX and UNIX-like are the operating system families of Astronomy. In the modern era this practically means Mac OS (formally OS X) and, to a lesser extend, Linux. Even brief attendance at any astronomy conference will bear this out. Look out over rows of astronomers attending some plenary talk and you will be greeted with a wall of aluminum and worn black plastic (in the time honored tradition of using Linux on the least visually appealing hardware possible).
I use both Mac OS and Linux; however, I certainly use and prefer Linux more than Mac OS. Perhaps the reasoning behind that will form the basis of some future post. Regardless of which of these operating systems you use these posts should be applicable to you. This post is the first in a multi-part series focused on more powerful usage of the shell and command-line utilities. In this part we will go over some basic operational principals how the shell works and in subsequent parts we will discuss individual command line utilities and how you might use them to speed up your research.
Swimming down Streams
Before diving into any individual command-line utility, it is important that you have at least a rudimentary understand how a UNIX shell works. (N.B. I will be writing commands that are tested on the z-shell (zsh) they should all also work on bash, I make no guarantees about other shells such as tcsh or fish…though I would wager everything I say here will work fine on any shell aside from PowerShell). As an astronomer you have almost certainly spent some time in what is called a terminal-emulator1. On Mac OS the bundled terminal-emulator is simply called terminal.
1The reasons this is known as a terminal emulator and not simply a terminal is in large part historic, a true terminal is a directly connected lightweight mouse-keyboard interface to a mainframe, VT100. A terminal-emulator emulates this behavior. Different terminal emulators provide different feature sets. Bundled ones are usually good enough, though I find Kitty to be more usable for its emphasis on low latency typing.
Terminals-emulators are interfaces to shells, they are the applications which shells run within. The terminal-emulator can be thought of as analogous to a web browser, and the shell to the website which that browser visits. Moreover, in the same way that the experience of the web is (nowadays) quite homogeneous across browsers, the experience of using a shell is essentially the same across different terminal-emulators. To extend this metaphor to about where its limit is, in the same way your experience of the web will be different if you use Google vs. if you use DuckDuckGo your experience of interfacing with, and controlling your computer can vary depending on the shell you use. Most modern systems will come with either bash (the Bourn again shell) or zsh (the z-shell). While there are certainly differences between them, for the most part they are similar. Both of these shells (and the vast majority of shells on Linux) make heavy use of streams.
Streams can roughly be thought of as lists of characters which a function can take as input and return as output. These lists can originate from keyboard input (standard input), another file (input redirection), or the output of another program (piping) and they can go to the screen (standard output), a file (output redirection), or the input of another program (piping). This very general way of handling input and output makes it very easy to string together many small programs to achieve one more complex outcome.
As a simple example consider the following python code in some file called redirect.py
print("Hello, World!")
We can run this (assuming we are in the same directory as redirect.py) with the command
python redirect.py
Hello, World!
This is as expected. We see what we asked the program to print appear on screen. If we wanted to be a bit more formal we could say that Python wrote Hello, World! to the standard output of our shell and then the terminal-emulator then drew those characters to our screen. If we wanted to have this written to a file, for later use, instead of the screen, we can use output redirection (more generally io redirection)
python redirect.py > output.txt
cat output.txt
Hello, World!
This example still has python write Hello, World! to standard output; however, before the terminal-emulator can write that to the screen that stream is redirected (using the > operator) to some file output.txt (this file will be created if it does not exist, and overwritten if it does). We then look at what is in the file with the cat command and see that yes, in fact, it is the same thing that would have been written to standard output had we not redirected.
2The pipe is actually more equivalent to using both input and output redirection at the same time. Storing the output of one command to a temp file and then redirecting the contents of that temp file to the input of another command
If, instead of a file, we want to send the output of one command to another command we use the | (pipe) character. I’ll hold of an example here as that is predominately what the next section will be focused on; however, for now it is okay to think of the pipe as operating in the same way as the > but for command inputs instead files2.
Awks and Greps and Seds oh my
IO redirection by itself can be very useful; however, the real power of the UNIX shell is the ability to pipe commands together. Later parts of this series will disscus different command-line utilities which you can string together with pipes in more detail. In lieu of that for now though, here is a list of some of the most helpful of these utilities.
grep
Used to search through the contents of one or many files or streams. Think of grep as your search engine when working in a shell. You can search individual files or entire directory trees for the exact line a string (or even a regex pattern) shows up in.
sed
Used to edit the contents of a stream on the fly. sed actually stands for stream editor. This allows you to preform find and replace operations on streams.
awk
Used to edit the contents and format of a stream. A good bit more powerful than sed but with a famously less helpful name. In many ways this can be thought of as a simple spread sheet editor as awk can handle column oriented data very cleanly.
head / tail
Used to select just the first or last nth lines of a file or stream respectively. For large files these are important utilities as opening the full file may either take too long or be impractical (depending on your computers memory). With head you can view just the first n lines (such as the column definitions in a csv) and with tail just the final n lines, such as the final model in some simulation output file.
column
Used to quickly format a stream into columns. I find that I use this less than some of the other commands here, and in fact awk can mimic its behavior; however, if you quickly need your data organized into columns, the column command can do that.
sort
Used to sort the lines of a stream by some criteria. The benefits of being able to sort the lines of your stream by some criteria should be self explanatory.
Over the next couple weeks individual posts on each of these utilities will be published that go into much more detail on their usage. Stay tuned!
One thought on “An awk.ward encounter pt. 1”