Sunday, February 13, 2011

Regular Expression

Regular expression is pattern matching and optionally, extracting data from those that matched. It is like 'grep' or 'findstr' command, only better.

For example, a station name might be an 'A', followed by numbers, followed by 'S'. Below are examples of station name.

A50S
A234S
A3S
A5425S


You might want to know whether a line of text contain a station name in it. Optionally, you might want to pull, say, '50' out of 'A50S'. Regular expression or regex is your tool. Below are the most common regex I use. For more complete discussion, read this.

Use the symbol ------------ To match
  below
\d                                  any decimal digit
\D                                 any non-digit
\s                                  any white space character like space and tab
\S                                 any non-white space character
[a-z,A-Z]                       any alphabet letter
[a, b, c]                        either 'a' or 'b' or 'c'
.                                    any character except newline

The pattern above would only match one item. For example, \d would only match 1 digit. If you want to match 2 digits, you need to place \d\d. If there could be 2 or 3 or 4 or more digits but you don't know on what particular instance it would be, this approach may not work. The way to approach that is to use the following 'special characters'

Use this                    ---------   If you want to match, with respect
Special Character                to the previous symbol (see above)
*                                           zero or more instance of same thing
+                                          one or more instance of same thing

So, if you say '\d*', it would match even if there is no digit. '\d+' would match if there is at least one digit.

Now, the regex formed above would match anywhere in the line. There are times when you want a match only if it is at the beginning or end. To accomplish this, use the following 'anchors' at the beginning (use ^) or end (use $) of your regex

^regex  - if you want a match only at the very beginning of the line
regex$  - If you want it at the end only

The example below should make it clear.

You want to find a match for station name which could either be:


A50S
A234S
A3S
A5425S

The pattern that you could see for station name is.

'A' followed by a number, followed by 'S'  . The regex is
 ^             ^                                           ^
  |  ________|                                            |
  |  |                                                         |

'A\d+S' ---------------------------------------------------

You would recall that '\d' is a decimal digit and so, one or more of that (the '+' symbol), meaning, one or more decimal digit, would catch any digits following the first digit. Had you used '\d*' instead, it would match even though there is no digit, like 'AS' because zero digit is allowed. Having formed the regex, you could use it as follows.

>>> import re
>>> pattern = re.compile('A\d+S')
>>> s = 'A50S'
>>> if pattern.search(s):
...     print 'got a match'
... else:
...     print 'no match'
...
got a match

First you import the regex module (re), then compile the regex into a pattern, then use the 'search' method of the pattern with the string to be searched as argument. This 'search' method will return nonzero value if it found a match and none otherwise.

>>> s = 'I went inside A698S and shouted'
>>> if pattern.search(s):
...     print 'got a match'
... else:
...     print 'no match'
...
got a match

It still got a match even though the station name is in the middle of the string.

>>> s = 'AB453S'
>>> if pattern.search(s):
...     print 'got a match'
... else:
...     print 'no match'
...
no match


It did not match this time because after 'A', you got 'B' which is not a digit.


So, if you got a file 'test.txt' which contains the following:


A432S is the first station - a station
Next to that is A53L - not a station
Third is A432222S which is large! - a station
Fourth is A5422CS - not a station


And you want to print those lines which contain names of station. Then the script reg.py below could do the trick.


import re

pattern = re.compile('A\d+S')
fin = open('test.txt', 'r')
for line in fin:
    if pattern.search(line):
        print line
fin.close()


Now, let me run this script...

[postgres@postgres-desktop:~]$ python reg.py                      (02-07 00:09)
A432S is the first station - a station

Third is A432222S which is large! - a station

If you want a match only if station name is at the beginning of the line, your regex would be:

'^A\d+S'

You need to 'compile' this again and assign to a pattern variable before doing a 'search'. Here is another example of regex expression.

To match:

xa aDfd455ddD
xa XdG4442sDt
xa cCC123Xt

The pattern here is 'xa' followed by space, followed by group of letters, followed by digits, followed by letters. Regex is

'xa [a-z,A-Z]+\d+[a-z,A-Z]+'

Again, keep in mind that the '+' sign means 'one or more instance of the same thing' and not concatenation of regular expression.

----------------------------------------------------------------------
EXTRACTING DATA USING REGEX

From our station example at the beginning, suppose, say for a station name of A42S, you want to extract the numerical part at the center ('42' in this station), how do you do it? Well, all you need to do is to enclose the regex that you want to extract inside a pair of parentheses and use the 'group' command.

The regex for station name is: 'A\d+S'

You want the digit portion which is '\d+'. So, enclose this by parentheses and assign it to a new pattern:

>>> pattern1 = re.compile('A(\d+)S')

Then, do a search followed by 'group'

>>> s = 'I am going to A4326S today'
>>> m = pattern1.search(s)
>>> m.group(1)
'4326'

Now, what if there are parentheses are part of the string to be searched? All you need to do is 'escape' it using '\'.

For example, you might be trying to match 'origin(23 45)'. Your regex would be:

'origin\(\d+ \d+\)'

If '23' and '45' are x and y coordinates, respectively and you want to extract them, it will be as follows. Enclose the digits that you want to extract with parentheses, do a search, followed by group:

 >>> pattern2 = re.compile('origin\((\d+) (\d+)\)')
>>> s = 'display picture origin(23 45)'
>>> m = pattern2.search(s)
>>> x, y = m.group(1,2)
>>> x
'23'
>>> y
'45'

No comments:

Post a Comment