An awk.ward encounter pt. 2

Part One

In the previous part we introduced some basic concepts surrounding the command line, shells, and utilities. Here we are going to dive more deeply into what I would call “The essential utilities”. There are two main properties which I think a tool needs to become essential: one, solve a problem, two, solve that problem easily.

Some more justification

Later in this post I am going to introduce you to the utilities: grep, find, and sort; however, many of the examples may seem trivial to solve with a language you are already familiar with. I want to justify then why you should learn these commands with relatively esoteric syntax over simply using already known languages. When working in the command line, with command line utilities the problems you are often faced with are ones of string parsing, sorting, and searching. These are all things which can be done with almost any computer language. Often times, the logic behind a particular solution may even be more clear and the solution itself more general if you were to dive into Python or Pearl to solve your problem (as opposed to using command line utilities); however, the up front cost of using say python is generally much higher than invoking a pre-built tool. Consider for example searching a directory tree recursively for all fits files and then saving that list to a file. In python this can be achieved using the pathlib module (a module which a future post will address in detail)

from pathlib import Path

files = Path(".").rglob("*.fits")
with open("fitsPaths.txt", 'w') as f:
   f.write('\n'.join(map(lambda x: str(x), files))))

That’s not a particularly long piece of code; however, it does make use of a few more complex constructs such as map and the pathlib module; additionally, to write it you need to

Open and edit a script file
Write 4 lines of code
Save the file
Run the file with the python interpreter

On top of those steps, if there is an error or typo in what you wrote you have to reopen the file, edit it, and save it again before running it. Consider now the equivalent with the find command

$ find . -name "*.fits" > fitsPaths.txt

This produces effectively the same output as the above python code much more concisely and without the need to edit and invoke a script. It is certainly true that the python version can be more generally extended; however, most of the time I find that I don’t need that power (at least for simple tasks).

Hopefully this example is at least convincing enough that you see some value in learning these tools

Searching

Searching for and through files is fundamental task of computing. UNIX like systems provide a number of ways to do this but the two most important are find and grep. These preform similar but different tasks: find searches based on file name, grep searching any stream for a pattern (which can be a regex). As a very rough rule of thumb if you need to look for files whose name matches some pattern (like they all have a .fits extension) use find, whereas if you want to search whose content contains some pattern (like every fits file with “bias” in the object field of the header) use grep.

find

You have already seen a basic example using find, lets break down exactly what each part of that is doing before moving onto more complex examples. Recall

$ find . -name "*.fits" > fitsPaths.txt

find : invoke the find program
. : start searching in the current directory (you can put any path here and that is where find will search from)
-name “*.fits” : only return paths which end in .fits. Here the * means allow anything to come before .fits
> fitsPaths.txt : redirect the output of find into the file fitsPaths.txt

This example will cover much of what you might want to do with find but lets introduce a few more concepts.

Let’s say you have some directory with the products of a reduction pipeline throughout it (we are going to assume they are all fits files). You’ve run the reduction pipeline multiple times but the results each run are jumbled together in the same directory tree. It might be nice to separate the results from the most recent run into their own folder. Finally, take it as a given you have run the pipeline once in the last day and that is the most recent run. You can use find to select all the fits files output from the reduction pipeline like so. Assuming all the data files are in a folder called “reduction”

$ find reduction -mtime -1 -name "*.fits"

This searches the reduction folder for files modified less than 1 day ago and that end in .fits. What if you then want to automatically copy those files to a new directory called reduction_clean

$ find reduction -mtime -1 -name "*.fits" -exec cp --parents {} reduction_clean \;

Lets break this down

-mtime -1 : only return files which have been modified less than one day ago
-exec : execute the command which comes after for each found file, any time {} is in that command substitute the current file
cp –parents {} reduction_clean : copy the current file (denoted by {}) with its same relative file structure (–parents) to the folder reduction_clean
\; terminate the exec statement (this is needed whenever you use -exec)

I think this example does a quite good job of showing how powerful find can be. There are many options which you can use, far more than I can go over here. However, this should give you a taste of its power.

grep

If find is equivalent to having a list of webpage and searching by URL, grep is like searching with google in that it looks at the content. In the grand tradition of UNIX utilities with bad names, greps name is famously unintuitive. grep stands for “Globally search for a Regular Expression and Print”. Essentially it will take a regular expression and stream as input and return any lines in the stream which contain a match to the regular expression. The stream can the output from another program, a file, or one of a few built in streams grep has (such as recursively streaming in the contents of all files in a directory). As an example lets say you wanted to find all ThAr lamps

$ find 20170328_29 -name "*.fits" -exec fitsheader {} \; | grep -e "fits" -e "OBJECT  = 'ThAr'"

You should be able to parse what the first command (find) is doing based on the previous section; but in brief, it searches for all fits files within the directory named 20170328_29 and then runs each files through a program called fitsheader. This program extracts and prints the header. We then pipe (|) the output to grep. Therefore, the stream grep will be searching is a stream of fits headers.

We will now break down the grep block

grep : invoke the grep program
-e : define a pattern to be matched. This is only needed if you want to define multiple patterns to be matched with an OR between them. i.e. one or both of the patterns following e must be matched
“fits” match lines which contain fits
“OBJECT = ‘ThAr’ : match lines which contain this exactly.

This will then print out a line for every object with the path, and, if that object is a ThAr lamp will also print out a line saying that. There are ways, using some other utilities such as awk and sed that this can be cleaned up, but for now this we can work with this.

Another use case which I find grep very helpful in is when I am trying to search a code base for some variable. For example if I have code in a directory src and I want to find all files which contain the variable SUMASS I can use the following grep command

$ grep -rn "SUMASS"

The flags here will:

r : recursively search the current directory
n : print out line numbers in addition to match

This will print out the name of the file, and the line in that file where this variable shows up for every file where it does. This can be incredibly helpful when debugging, especially when debugging someone else’s code.

The final use case of grep I want to bring up is in searching your history for an old command. On most UNIX like systems there is a command “history” which will print out the last n commands you have issued (n is configurable but is usually quite high). Lets say you know you had to run some command called mikedb for a reduction pipeline in the past but you don’t remember exactly what the arguments were, or what command you had to run just before or after it

$ history | grep -B4 -A4 mikedb

This will show the 4 lines before, the 4 lines after, and the line containing every recorded use of mikedb in your command history. This is super-helpful in recreating old work.

Moving on

Both find and grep are far more powerful than I introduced here, and I would encourage you to use their man pages, and google around when you need to use them for more complex tasks; however, for the majority of what I use them for what is here should suffice. In the next entry in this series we will talk about stream editing using awk and sed and show some examples of how the use of awk and sed can make find and grep even more powerful.

Some more justification

Searching

find

grep

Moving on

Share this:

Related

Leave a comment Cancel reply