A few months back I was talking with one of my professors from undergrad; the topic of what, with the benefit of hindsight, I would most want to add to the physics and astronomy curriculum there came up. I strongly believe that an introduction to regular expressions is somewhere near the top of that list of things which should be added.
Anyone who has played around with web development is likely to have run into regular expressions, I got my introduction to them through vim’s and sed’s wonderfully powerful command set, others may have run into them elsewhere still. As it turns out, regular expressions are everywhere. However, for as common as they may be, I also know plenty of people who have either never used them or never even heard of them before. Perhaps this makes sense in astronomy where, with the benefit of the oft underrated FITS file, we have to spend very little (or at least less time than if FITS were not so ubiquitous) time parsing headers for metadata.
For those who may be unaware, regular expressions are a very powerful (see general) way of representing any pattern in a string. The simplest regular expression is the direct match, finding exact copies of what you ask it for (called the pattern). From the direct match, regular expressions can grow arbitrarily complex to match any pattern which you can describe in a finite time.
Why do we need such a powerful pattern matching tool? A web developer might tell you that regular expressions (which I will hereafter refer to as regexs, the common shorthand) provide a clean way of checking that the content a user enters into a form matches the data that form expects (did the user actual enter a valid email address for example). I would tend to express how regex’s allow me to find and replace in vim very quickly and for significantly more generalized expressions than might be typical in non regex based find and replace tools. Here however, I’m gonna provide an example targeted at observational astronomy.
One problem which many of us can imagine running into is having a list of 2MASS IDs and wanting to get sky coordinates out of them. We know that the 2MASS ID is simply a string built from RA and Dec (in the 2000 epoch) and we can pretty trivially look at any individual ID to get its coordinates. That process is a little tedious to do however over more than just a few. How could we use regex to clean up this process?
Imagine the following 2MASS ID
ID = "2MASS J04130560+1514520"
We can see that the RA of this star is 04:13:05.60 and the Declination is +14:15:52.0. Well with regex we know that we can match any pattern in a string so the problem is one of describing the pattern we want to extract
Regex Patterns
The fundamental concept behind regex is the pattern, you describe a pattern and regex matches that pattern against a string. Regex can take a pattern as a literal (exactly what you are looking for) but simply asking regex for a literal match fails to make use of the enormous power underpinning this tool. To move past literal matches special and wildcard patterns can be strung together.
It’s important here to take a quick sidebar to note that regex is not standardized and there are multiple different implementations. As this blog is aimed mainly at people who will be using Python I am going to present patterns which will work with the python re module (essentially similar to Pearl’s regexs) but note that some of the things I use here won’t for example work with sed or egrep.
Some of the basic special patterns to know are
- \d : for matching any numeric charterer
- . : for matching any character
- ? : for matching the proceeding character 0 to 1 times (making the previous character optional)
- + : for matching the proceeding character 1 or more times
- * : for matching the proceeding character 0 or more times
- ^ : force pattern to start at beginning of string
- $ : force pattern to end at conclusion of string
This is no where close to an exhaustive list, but, for most of what an astronomer might use regex for on a daily basis these will cover 95% percent of what you need to do. As an example of usage consider the following pattern
\d+E\d?$
We can use the knowledge we have of regex to translate this into a human readable pattern
- $ : The pattern must come at the end of the line / be the last thing on the line
- \d+ : match 1 or more numeric characters
- E : match the letter E directly
- \d? match or or 1 numeric characters
So this regex will look for anything made up of of 1 or more numeric characters followed by an E which may then be either followed by 1 more numeric character, or the end of the line.
Some examples of strings that would match this pattern are (matching part highlighted)
- 7659812E
- 9819E9
- 1E
- 1E1
- Hello83724E
- World34234E4
Some strings which would not match on the other hand are
- E2 : missing any numeric characters before the E
- 587432EE : two Es
- 8821E7Hello : Pattern does not anchor to the end of the string (Hello is after)
- 3984238 : missing E and character after
Note how, in the examples where the regex matches, it can pull out a substring (the pattern does not match Hello or World but does match the remaining string after those prefixes).
RA and Dec from a 2MASS ID
Perhaps you already see some applications of this to your own work; however, lets make those applications a little more explicit by returning to the example of extracting coordinates from a 2MASS ID. Recall that we have set the variable ID to be equal to
ID = "2MASS J04130560+1514520"
To extract coordinates from this we need one more concept..the group. Groups in regex, denoted with parentheses, allow us to split matched patterns up into subsections. So to extract RA we can use the pattern
J(\d{2})(\d{2})(\d{2})(\d{2})
Now, I’ve gone ahead and introduced a few new concepts here so let’s parse through this.
- J : Match the character J literally. This lets me select just just the part of the string starting with J. Because, right now I am just interested in RA, and not Dec by matching a J first I will exclude the second half of the ID (because it starts with a +)
- (\d{2}) : Match a numeric character 2 times and call that a group so we can separate it from the rest of the match. Braces allow you to specify how many times to match a character.
Here I am trying to match four groups of 2 numeric characters each. The entire pattern starts with the letter J. We can apply this to the ID with a small bit of python code
import re # The regular expression module in python
ID = "2MASS J04130560+1514520"
# Put an r before regex strings in python so special characters work properly
pattern = r"J(\d{2})(\d{2})(\d{2})(\d{2})"
# findall will return a list of every time the entire pattern shows up. If the pattern only appears once select the 0th element of that list to get the actual regex match
matchs = re.findall(pattern, ID)
RA = matchs[0]
print(RA)
OUT[1]: ('04', '13', '05', '60')
Note how the J was not returned, that’s because the J was not in a group (not in parentheses). We could then format that into a more normal styles using a bit more python
RA = f"{':'.join(RA[:3]}.{RA[3]}"
print(RA)
OUT[2]: 04:13:05.60
Hurah! That’s a properly formatted RA, what about Dec? Well, let’s try the same pattern but replacing the J with a +
import re # The regular expression module in python
ID = "2MASS J04130560+1514520"
# Put an r before regex strings in python so special characters work properly
pattern = r"+(\d{2})(\d{2})(\d{2})(\d{2})"
matchs = re.findall(pattern, ID)
RA = matchs[0]
print(RA)
OUT[3]: error: nothing to repeat at position 0
So why did that not work? There are a couple things to notice, for one the + character is reserved, recall it means match the proceeding character 1 or more times. That means, if we want to match the literal + character we need to “escape it”, this simply means putting some special character before it which tells regex to not treat it as special but rather as literal. Most implementations of regex us \ as the escape character. So the pattern would then be
pattern = r"\+(\d{2})(\d{2})(\d{2})(\d{2})"
If we run the code with that pattern however we will still get a IndexError. That’s because there was no match, if you print matchs (not RA) you can see that it is an empty list (so asking for the 0th element throws an error)? Well this is due to the Dec not being made of 4 groups of 2 characters in this case but 3 groups of 2 and one group of 1 (the Dec is one digit shorter than the RA in this ID) . Okay that’s an easy fix
pattern = r"\+(\d{2})(\d{2})(\d{2})(\d)"
And yes, indeed, that does seem to give the Dec. However, there are just one or two more things we should address. First of all the sign on the Declination does matter, we would not want to parse an ID and think it’s in the northern sky when it is really in the southern sky. This is easily resolved by placing the \+ in a group. Second, some IDs will actually have 2 decimal places for the Dec seconds (and in fact some may have only one decimal for the RA seconds). Therefore, we might look for a more general way of requesting either one or two digits in the last group for both Dec and RA.
We know that {n} will match the proceeding character n times; however, this syntax actually extends to select ranges. {n, m} will select the previous character between n and m times. Therefore, we can form the new Dec pattern as
pattern = r"(\+)(\d{2})(\d{2})(\d{2})(\d{1,2})"
We now have one regular expression which will extract the RA from a 2MASS ID and another which will extract the Dec. We can put these together using a bit of python.
import re # The regular expression module in python
ID = "2MASS J04130560+1514520"
# Put an r before regex strings in python so special characters work properly
RA_pattern = r"J(\d{2})(\d{2})(\d{2})(\d{1,2})"
Dec_pattern = r"(\+)(\d{2})(\d{2})(\d{2})(\d{1,2})"
# rf says the following string is a regular expression (r) and asks python to evaluate anything in {} as if it were not in a string (f). So the variable RA_pattern will replace {RA_pattern}. Note that this can get tricky if you want to use an regex's {} range notation in the same string you marked as an f string.
full_pattern = rf"({RA_pattern})({Dec_pattern})
match = re.findall(full_pattern, ID)
RA = f"{':'.join(match[0][1:4]}.{match[0][4]}"
Dec = f"{match[0][6]}{':'.join(match[0][7:10])}.{match[0][10]}"
print(f"RA is: {RA}")
print(f"Dec is: {Dec}")
OUT[4]: RA is: 04:13:05.60 Dec is: +15:14:52.0
Note here that I had to adjust the final formatting a bit from before. This is because both RA and Dec were placed in a group (they are in parenthesis when I define full_pattern). This means that in addition to the individual elements the regular expression also returns both the full RA string and Full Dec String.
There are ways in which this regex could be cleaned up; however, I think they would only serve to obscure the point that more complex regexs can be built out of quite simple patterns and used to parse data easily and quickly.
This has been a very high level overview of regexs (using what is perhaps a somewhat silly example) and certainly does not cover all or even most of what can be accomplished using them. However, I hope that by seeing this simple example you can start to incorporate regexs into your problem solving toolkit.
One thought on “Regular Expressions for the Regular Astronomer”