Friday, December 31, 2010

Counting the Number of Occurances of Words in a File

Suppose you want to count the number of times a word appeared in a text file, how would you do it? The best approach would be to use dictionaries. You use the word to be counted as the key and the dict content would the the word count. For example, d['and'] = 4 means the word 'and' appeared 4 times.

If you are quick, you might now be asking: I don't even know what words exist in the file, so how could I even know what keys to use? Well all you need to do is to check if a word is already a key in the dict and if not, create a dict entry with that key; otherwise simply increment that dict entry. Below is an example:


We use humpy.txt whose contents are:

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.

The script that will count the number of occurrences of a word is count.py with listing below:

fin = open('humpy.txt','r')
d = {}
for line in fin:
    # break down the line into individual words
    # strip() would remove leading/trailing spaces and newlines
    # words below will be an array whose elements are the
    # individual words
    words = line.strip().split(' ')
    for word in words:
        if word in d: # does this word has an entry in d?
            d[word] += 1  # then increment it
        else:  # no entry yet. So, make an entry
            d[word] = 1
fin.close()
# ok, all words have been processed. Print the statistics
for key, value in d.iteritems():
    print '%s appeared %d times' % (key,value)

If you run the command below in the command line,

python count.py

You will get this result:

a appeared 2 times
on appeared 1 times
great appeared 1 times
again. appeared 1 times
Humpty appeared 3 times
all appeared 1 times
Dumpty appeared 2 times
men appeared 1 times
had appeared 1 times
wall, appeared 1 times
together appeared 1 times
king's appeared 2 times
horses appeared 1 times
All appeared 1 times
fall. appeared 1 times
Couldn't appeared 1 times
put appeared 1 times
and appeared 1 times
the appeared 2 times
sat appeared 1 times

You will notice that 'all' is different from 'All'. If you want them to be considered the same, you could convert all words into lower case first by changing this

words = line.strip().split(' ')

into this

words = line.strip().lower().split(' ')


The other thing that may be unusual is the print syntax. It is actually similar to that of C where you specify the formatting string: %s for string, %d for integer, and %f for float, and then you place the corresponding variables to be printed inside the parentheses in the same order as the formatting string.

Another alternative for the print statement would be:

print key + ' appeared ' + str(value) + ' times'

Or better yet, the more efficient version

print ' '.join([key,'appeared',str(value),'times'])

This might look like a trivial example but I used this idea to create a report of how many alarms were generated for a particular alarm category and even the alarm message itself. I parsed the alarm log files and created a dict d[cat] to count the number of alarms for a particular category. I also created a dict for the alarm message and printed the top 20 alarm messages. This helps us easily identify nuisance alarms or perhaps legitimate alarms that needs immediate attention.

Monday, December 27, 2010

Python Tutorial

This tutorial is intended for those who already have previous programming experience, most notably C and VB, to get them to transition easily to python. At the start, a VB or C code will be provided and then, translate them to python for easier comparison. I have minimal VB experience and hence, the VB version might be lame and have lots of room for improvement.


Please don't skip and cherry pick as you might miss many important points.


-------------------------------------------------------------------------------------------------------

INSTALLATION

As a start, download the python installer here:

http://www.activestate.com/activepython/downloads

If you are using windows, download the windows installer (msi) for version 2.6. and then, accept all default settings during installation.

-------------------------------------------------------------------------------------------------------
RUNNING THE PYTHON EDITOR/INTERPRETER

You should now see "ActiveState ActivePython 2.6" under All Programs and select the "PythonWin Editor". You should then see the >>> prompt, in the "Interactive Window". This is where you test your code.

The normal workflow in writing python script is that first, you have an idea but not sure if it will work or not. So, you try it first  in the >>> prompt and see the results immediately. You then just copy and paste the working snippet to your editor, fully confident that that portion of the code is already correct.

For the earlier part of this tutorial, everything was just copied and pasted from the "Interactive Window".  I would recommend that you try it out yourself and do more experimentation to gain more confidence in the language.

-------------------------------------------------------------------------------------------------------
DECLARATION OF VARIABLES

In python, you normally don't declare variables. Those who do are black-belt pythonista but for beginners like us, we would unlikely be in a situation where we need to. Variable names are CASE SENSITIVE! Below are samples:

>>> a = 2
>>> print a
2
>>> a
2

As could be seen above, typing the variable name actually prints the contents. From now on, instead of typing 'print a', I will just be typing "a" to print its contents.

>>> a = 1.25
>>> a
1.25
>>> a = 'This is good'
>>> a
'This is good'

As you could see, I assigned "a" to an int and then float, and then string and it did not complain. I did not even bother declaring what the contents would be beforehand.

The usual convention in python is to use lower case for the first letter of variable or function; use upper case for first letter of classes; use an underscore as the first character of a variable in classes if you want others to treat it as private (this is just a convention because you could still access that variable outside of the class if you really want to. It is just a way to tell the users of the class not to mess with it)

-------------------------------------------------------------------------------------------------------
OPERATIONS

Operations are almost similar to that in C

>>> a = 3
>>> b = 4
>>> c = 6
>>> 2*a
6
>>> a**2
9
>>> c/a
2
>>> b % a
1

Exponentation is ** and modulo is %. It also allows string operations where it makes sense. In the example below, subtraction does not make sense and throws an exception

>>> a = 'noel'
>>> b = 'quintos'
>>> a + b
'noelquintos'
>>> 3 * a
'noelnoelnoel'
>>> '-' * 50
'--------------------------------------------------'
>>> a - b
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'str' and 'str'

Bit-wise operation is similar to C.

>>> a = 4
>>> b = 2
>>> a | b
6
>>> c = 6
>>> a & c
4

-------------------------------------------------------------------------------------------------------
CONDITIONS

The comparison operators are similar to C (==, !=, >, <, >=, <=) but the logical operators are similar to VB (or, and)

>>> a = 4
>>> b = 10
>>> a > 6
False
>>> b >= 10
True
>>> b =>10
  File "<stdin>", line 1
    b =>10
       ^
SyntaxError: invalid syntax
>>> a < 5 and 6 < b
True
>>> a != 1
True
>>> a == 4 and not b > 200
True
>>> a < 8 < b
True

Take note of the last example!

In python, anything that evaluates to True or False may be used in the condition clause.

 >>> a = 1
>>> b = 30
>>> a < 4 < b
True
>>> if a < 4 < b:
...     print '4 is between a and b'
... else:
...     print '4 is not between a and b'
...
4 is between a and b

Just like in C, anything that evaluates to 0 or none is False; otherwise, it is true. More of this in the last section under "Conditions Part 2"

>>> a = 0
>>> if a:
...     print 'true'
... else:
...     print 'false'
...
false
>>> a = 2
>>> if a:
...     print 'true'
... else:
...     print 'false'
...
true
>>> a = -3
>>> if a:
...     print 'true'
... else:
...     print 'false'
...
true

-------------------------------------------------------------------------------------------------------
VARIABLE SCOPE

Variables in the "main" function are global in scope and variables in functions and classes are known only inside those functions or classes. In case you happen to use the same variable names inside the function, the global value is used if it is on the right hand side of assignment but do not change the global value if it is on the left hand side. In the example below, I created a function called 'demo'. You use the 'def' keyword to indicate that it is a function.

>>> def demo():
...   print a
...
>>> a = 4
>>> demo()
4
>>> def demo2():
...     c = a
...     print c
...
>>>
>>> demo2()
4
>>> def demo3():
...   a = 5
...   print a
...
>>> a = 1
>>> demo3()
5
>>> a
1

Notice that 'a' was assigned before the function call and it was not passed and yet the function knows about it. In demo2, 'a' was on the left hand side of the assignment and the global value was used. In demo3, 'a' was on the left hand side and was assigned a value which is retained inside the function but after the function call, the previous value of 'a' is still there.

In python, the functions and variables must be defined first before it is used or it throws an exception.

>>> def demo3():
...     print notYetDefinedVariable
...
>>> demo3()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in demo3
NameError: global name 'notYetDefinedVariable' is not defined
>>> notYetDefinedVariable = 'it works now'
>>> demo3()
it works now

-------------------------------------------------------------------------------------------------------
CODE BLOCKS

Python uses indentation to designate a block. This sounds naive but when you think about it, you will indent anyway for readability of your code, hence braces or other form of delimiters are superflous. Yeah, it is waste of typing! One word of caution: Tabs are not the same as white spaces, so, do not use both! Stick to white spaces and indent consistently like 4 white spaces for every indent.

Indentation to designate a block of code will give you a feeling that you missed something but you will get over it fast!

-------------------------------------------------------------------------------------------------------
COMMENTS

Comments are denoted by #. Everything on it's right are
treated as comments like this:

# this is a comment

x = 9999   # assign x a very large value to denote EOF


=======================================
CONTROL STRUCTURES

-------------------------------------------------------------------------------------------------------
IF STATEMENT

VB:

If average>50 Then
    text = "Pass"
Else
     text = "Fail"
End If


python:


if average > 50:
    text = 'Pass'
else:
    text = 'fail

Take note of the colon at the 'if' and 'else'.

You might be wondering if there is a equivalent to C's statement below

x = a?u:v

Yes, there is and below is how you translate this statement to python

x = u if a else v

-------------------------------------------------------------------------------------------------------
NESTED IF - THEN

VB:

If average > 75 Then
text = "A"
ElseIf average > 65 Then
text = "B"
ElseIf average > 55 Then
text = "C"
ElseIf average > 45 Then
text = "S"
Else
text = "F"
End If


python:


if average > 75:
     text = "A"
elif average > 65:
     text = "B"
elif average > 55:
     text = "C"
elif average > 45:
     text = "S"
else:
     text = "F"




There is no 'case' statement in python like in C. The nested if-else is normally how you translate a 'case' statement.

-------------------------------------------------------------------------------------------------------
DO - WHILE STATEMENT

There is no do-while statement in python

-------------------------------------------------------------------------------------------------------
WHILE STATEMENT

VB:

number = 1
While number <=100
number = number + 1
Wend


python:

number = 1
while number <= 100:
    number += 1


Note: the last statement may also be written as: number = number + 1

Just like in other languages, you could use 'break' to break out of the loop and 'continue' to skip the rest of the loop statement and start over again from the top part.

-------------------------------------------------------------------------------------------------------
FOR - NEXT LOOP:


VB:

For x = 1 To 50
Print x
Next

 python:

for x in xrange(1, 51):
    print x

VB:

For x = 1 To 50 Step 2
Print x
Next


python:

for x in xrange(1, 51, 2):
    print x

Just like in other languages, you could use 'break' to break out of the loop and 'continue' to skip the rest of the loop statement and start over again from the top part.
-------------------------------------------------------------------------------------------------------
WHILE - ELSE LOOP:

This exists in python but not in VB or C. The else statement is executed if the while loop completes without executing a 'break'.


>>> count = 10
>>> while count > 1:
...     if count == 4:
...         break
...     count -= 1
... else:
...     print 'you did not passed a break'
...





In this example, count passed by '4' as it went from 10 to 1, and therefore, 'break' was issued. Because of this, the 'else' clause was not issued.

>>> count = 10
>>> while count > 1:
...     if count == 100:
...         break
...     count -= 1
... else:
...     print 'you did not passed a break'
...
you did not passed a break

In this example you are counting down from 10 to 1 and you will have a break only if count is 100 (will not happen). Hence, break was not issued and 'else' clause was executed.

When is this useful? When you are comparing the elements of an array and issues a 'break' to get out of the loop once you found a match, you can have a special handler when nothing matches.

You can also have an 'else' in a 'for' loop.

=======================================
DATA TYPES:

There is nothing special on integer or int and float; behavior is similar with other languages, hence, I will skip these.

-------------------------------------------------------------------------------------------------------
STRING:
The following are all valid string assignments:

>>> s = 'abc'
>>> s
'abc'
>>> s = "abc"
>>> s
'abc'
>>> s = "abc's"
>>> s
"abc's"
>>> s = '''this could span
... multiple lines and
... will be printed
... as is '''
>>> s
'this could span\nmultiple lines and\nwill be printed\nas is '

Just like in C, the backslash character '\' has a special meaning and has to be 'escaped' using another backslash for it to be taken literally by the interpreter. This often times happen when  specifying the file location in windows. So, if your file abc.txt is in C:\documents\abc.txt and you want to assign it to a string or passing it to a function as an argument, you have to replace every quote with a double quote like this.

>>> filename = 'c:\\documents\\abc.txt'
>>> filename
'c:\\documents\\abc.txt'

If you want to avoid this, you could prefix the string with "r" like this and have same effect.

>>> s = r'C:\documents\abc.txt'
>>> s
'C:\\documents\\abc.txt'

Remotely shared files would then look like this for windows machine since it uses a double back slashes:

>>> s1 = '\\\\server1\\docs\\abc.txt'
>>> s1
'\\\\server1\\docs\\abc.txt'





Or, with an "r" prefix,

>>> s2 = r'\\server1\docs\abc.txt'
>>> s2
'\\\\server1\\docs\\abc.txt'



Just like in C, the elements of a string is indexed:

>>> s
'this could span\nmultiple lines and\nwill be printed\nas is '
>>> s[0]
't'
>>> s[1]
'h'
>>> s[2]
'i'



However, strings are immutable - i.e. - you cannot change them.

 >>> s
'this could span\nmultiple lines and\nwill be printed\nas is '
>>> s[0] = 'x'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

string comparison is also allowed:

>>> s1 = 'abcd'
>>> s2 = 'abcd'
>>> s3 = 'abc'

>>> s1 == s2
True
>>> s1 == s3
False

-------------------------------------------------------------------------------------------------------
ARRAYS / LISTS


Arrays in python are called 'lists'. Tuples are also arrays but I never came across a situation where I actually used them. So, I will skip tuples and discuss lists only.

This is how you assign a list. The first one is how you assign an empty list.

a= []

u = [2, 'abc', 4.5, [23, 4], 3]
>>> u
[2, 'abc', 4.5, [23, 4], 3]

As you could see, you could assign anything in a list. even another list!

The index is referenced to '0'

 >>> u
[2, 'abc', 4.5, [23, 4], 3]
>>> u[0]
2
>>> u[1]
'abc'
>>> u[3]
[23, 4]
>>> u[3][0]
23
>>> u[3][1]
4
>>> u[1][0]
'a'

If you want to reference an array from the end instead of from the start, you start with -1 and then -2, and so on.

>>> u
[2, 'abc', 4.5, [23, 4], 3]
>>> u[-1]
3
>>> u[-2]
[23, 4]
>>>

This makes it so easy to access, say, the last element of an array. The index is always -1. Of course, if you want the hard way, the last index is len(u)-1 as you would do in C.
>>> u
[2, 'abc', 4.5, [23, 4], 3]

>>> u[len(u)-1]
3

As you might have guessed, 'len' would give you the number of elements in the array.

>>> u
[2, 'abc', 4.5, [23, 4], 3]
>>> len(u)
5

In the above example, the 4th element which is another list is counted as one even though it has 2 elements inside.

Although you don't need to declare variables in python, you should somehow tell it that you are operating on a list or else, you get an exception.

>>> uu[9] = 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'uu' is not defined

You could make an index assignment only if that element exists before.

>>> uu = []
>>> uu[0] = 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list assignment index out of range



You have an error here because you have an empty list and yet you want to replace the first element which doesn't exists.

>>> uu.append(5)
>>> uu
[5]
>>> uu[0] = 4
>>> uu
[4]
>>> uu.append(7)
>>> uu
[4, 7]
>>> uu[1] = 100
>>> uu
[4, 100]

-------------------------------------------------------------------------------------------------------
LIST CONDITIONS:

List may be compared and are considered equal if each element matches in value for each position.

>>> l1 = [5,4,3,2,1]
>>> l2 = [5,4,3,2,1]
>>> l3 = [1,2,3,4,5]
>>> l1 == l2
True
>>> l1 == l3
False

You could also easily check if an element exists in a list:

>>> lx = [10, 20, 30, 40]
>>> 30 in lx
True
>>> 100 in lx
False


-------------------------------------------------------------------------------------------------------
EXTRACTING ELEMENTS FROM A LIST:

Of course this is obvious:

>>> a = ['abc', 200, 4.5]
>>> x = a[0]
>>> x
'abc'
>>> b = a[1]
>>> b
200
>>> c = a[2]
>>> c
4.5

However, python has an idiom to make this simpler:

>>> a
['abc', 200, 4.5]
>>> x, b, c = a
>>> x
'abc'
>>> b
200
>>> c
4.5

Perhaps, due to this idiom, swapping of variable contents could be made as follows:

>>> a = 10
>>> b = 20
>>> a, b = b, a
>>> a
20
>>> b
10

What happened is that the first variable on the right is assigned to the first variable on the left, the second variable at the right is assigned to the second variable on the left. This allows you to avoid  the usual use of temporary variable shown below:

>>> a = 10
>>> b = 20
>>> T = a
>>> a = b
>>> b = T
>>> a
20
>>> b
10

Shifting of variable contents is also made more simple:

>>> first = 11
>>> second = 22
>>> third = 33
>>> new = 44
>>> first, second, third = second, third, new
>>> first
22
>>> second
33
>>> third
44

-------------------------------------------------------------------------------------------------------
ITERATING THROUGH A LIST:

You could iterate a list in this manner:

>>> a = [10, 20, 30, 40, 50]
>>> N = len(a)
>>> N
5
>>> for i in xrange(N):
...     print a[i]
...
10
20
30
40
50

It is also possible to replace N with len(a) as follows:

>>> for i in xrange(len(a)):
...     print a[i]
...
10
20
30
40
50

The examples above in not the best way. The preferred way is:

>>> for x in a:
...     print x
...
10
20
30
40
50

Not only will you type less; it is also clearer and is guaranteed not to have those "out of index" exceptions.

What if you have more than 1 list to be processed simultaneously as below?

>>> a
[10, 20, 30, 40, 50]
>>> b = [1000, 2000, 3000, 4000, 5000]

>>> for i in xrange(len(a)):
...     print a[i] + b[i]
...
1010
2020
3030
4040
5050


You use zip() which you could think of some kind of a zipper which binds things together. This is how zip() works:

>>> zip(a,b)
[(10, 1000), (20, 2000), (30, 3000), (40, 4000), (50, 5000)]

As you could see, it pairs the ith elements of list a and list b together and it becomes the ith element of the new list. So, the best way to process more than 1 list together is as follows:

>>> for x, y in zip(a,b):
...     print x + y
...
1010
2020
3030
4040
5050

What happened is that in the first pass, x, y is assigned the value of (10, 1000), or x gets 10 and y gets 1000 (if you don't get it, you probably skipped one of my topic above entitled: EXTRACTING ELEMENTS FROM A LIST. Read that again). On the second pass, x, y gets (20, 2000), and so on. As a result, you typed less code and have avoided indices.

-------------------------------------------------------------------------------------------------------SLICING

Suppose you want just a portion of an array or list what would you do? Resist the temptation of writing a loop to accomplish this because python offers a far easier way - slicing!

The syntax is: listname[startIndex:endIndex:increment]. If you leave the startIndex blank, it implies '0' index. If you leave the endIndex blank, that means all the way to the end. The increment is optional. If not provided, it means increment is 1. Take not that the endIndex element being pointed to is not included. Examples below:

>>> a
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
>>> a[2:7]
[20, 30, 40, 50, 60]
>>> a[:7]
[0, 10, 20, 30, 40, 50, 60]
>>> a[7:]
[70, 80, 90, 100]
>>> a[2:7:2]
[20, 40, 60]

To understand the last example, go back to a[2:7] and then get every other element because increment is 2.

Below is how you copy a list:

>>> a
[1, 2, 3, 4]
>>> b = a[:]
>>> b
[1, 2, 3, 4]
>>>

Now, you might ask, how is b = a[:] different from b = a ? In the former, b is just a copy whereas in the latter, b and a are the same list where changes done in one instance is carried over to the other. See below

>>> a
[1, 2, 3, 4]
>>> b = a[:]
>>> b
[1, 2, 3, 4]
>>> b[0] = 10
>>> b
[10, 2, 3, 4]
>>> a
[1, 2, 3, 4]


Notice that 'a' was unaffected by the changes in 'b'

>>> c = a
>>> c
[1, 2, 3, 4]
>>> c[0] = 0
>>> c
[0, 2, 3, 4]
>>> a
[0, 2, 3, 4]



Notice that when you changed 'c', 'a' was changed as well! So, be very careful especially when passing list to a function, what you want most probably, is just to pass a copy.

-------------------------------------------------------------------------------------------------------
BUILT-IN LIST FUNCTIONS

Suppose you want to get the sum of the elements in a list. How would you do it? Perhaps, you write this code:

>>> a = [1, 2, 3, 4]
>>> sum = 0
>>> for x in a:
...     sum += x
...
>>> sum
10

Well, python is no different from other languages if you still need to do it this way. But it is different! You are re-inventing the wheel! The right way is:

>>> a = [1, 2, 3, 4]
>>> sum(a)
10

There is also maximum and minimum:

>>> max(a)
4
>>> min(a)
1

---------------------------------------------------------------------------------------------------------
STRINGS - SPLIT AND JOIN

Often times, you have a string that is comma delimited, most likely, a csv file, and you want to process the contents. The split function is your friend. You supply it with the delimiter character and it returns a list containing the individual elements.

s = '402S, cb, 701, opcl'
>>> s
'402S, cb, 701, opcl'
>>> v = s.split(',')
>>> v
['402S', ' cb', ' 701', ' opcl']

Notice that since there was a space between comma and next element, the space was retained.

Example below uses '$' as the delimiter intead of a comma

>>> ss = '402S$CB$701$OPCL'
>>> ss
'402S$CB$701$OPCL'
>>> vv = ss.split('$')
>>> vv
['402S', 'CB', '701', 'OPCL']





You could also use a word as a delimiter!

>>> r = 'redandblueandgray'
>>> r.split('and')
['red', 'blue', 'gray']

The opposite of split is join. You supply join with an array of strings plus the join character and the result is a single string.

>>> vv
['402S', 'CB', '701', 'OPCL']
>>> '_'.join(vv)
'402S_CB_701_OPCL'



I want to stress these 2 points:
1.  join will process only a list. So if you don't have a list, create one
2.  The list should have only string elements.

>>> ''.join('one','two','three','four')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: join() takes exactly one argument (4 given)

Item 1 above was violated and so, you got an error. To correct:

>>> ''.join(['one','two','three','four'])
'onetwothreefour'
>>> ' '.join(['one','two','three','four'])
'one two three four'



>>> ' '.join(['I','have',2,'hands'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 2: expected string, int found

Item 2 above was violated. To correct:


>>> ' '.join(['I','have',str(2),'hands'])
'I have 2 hands'

The str() function converted the non-string value into a string.

-------------------------------------------------------------------------------------------------------
BUILT-IN HELP - dir()

Suppose you want to know what operations are readily available for a string, or any object, use the dir() function
>>> s = 'abc'
>>> dir(s)
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

As you could see, most of them are self-explanatory, and a lot could be done without having to write additional code. When you write in python, keep in mind that most of the routine things that you need to do were already been done for you. So, when you find yourself writing trivial code that you wished the language handles by itself, you are more likely been re-inventing the wheel!

>>> s = 'Abc'
>>> s.upper()
'ABC'
>>> s
'Abc'
>>> s.swapcase()
'aBC'
>>> s
'Abc'
>>> s.endswith('bc')
True
>>> s.isdigit()
False

What if you want more help for these built-in functions? You import help from pydoc and proceed as follows:

>>> s
'Abc'
>>> dir(s)
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
>>> from pydoc import help
>>> help(s.rsplit)


The last command would give you:


Help on built-in function rsplit:

rsplit(...)
    S.rsplit([sep [,maxsplit]]) -> list of strings
   
    Return a list of the words in the string S, using sep as the
    delimiter string, starting at the end of the string and working
    to the front.  If maxsplit is given, at most maxsplit splits are
    done. If sep is not specified or is None, any whitespace string
    is a separator.
(END)

If you don't want to import pydoc, you could proceed as follows: It gives you the same information but uglier, though.


>>> s.rsplit.__doc__
'S.rsplit([sep [,maxsplit]]) -> list of strings\n\nReturn a list of the words in the string S, using sep as the\ndelimiter string, starting at the end of the string and working\nto the front.  If maxsplit is given, at most maxsplit splits are\ndone. If sep is not specified or is None, any whitespace string\nis a separator.'

As another example for lists:

>>> a = []
>>> dir(a)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
>>> a.pop.__doc__
'L.pop([index]) -> item -- remove and return item at index (default last).\nRaises IndexError if list is empty or index is out of range.'

-------------------------------------------------------------------------------------------------------
SETS

If you have 2 list and you want to know which elements are in the first list but not on the second, and vice-versa, then sets would be more appropriate. To illustrate:

>>> l1 = ['abc','def','ghi','klm']
>>> l2 = ['xxx','yyy','abc','klm','uuu']
>>> s1 = set(l1)
>>> s2 = set(l2)
>>> inList1ButNotInList2 = s1 - s2
>>> inList1ButNotInList2
set(['ghi', 'def'])
>>> inList2ButNotInList1 = s2 - s1
>>> inList2ButNotInList1
set(['uuu', 'xxx', 'yyy'])
>>> for x in inList1ButNotInList2:
...     print x
...
ghi
def

Sets have no index as shown below. You have to use the "for" loop to extract each element.
>>> inList1ButNotInList2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

With resplect to Areva, this is the best approach to determine which points were added and which ones deleted in the latest database release.

-------------------------------------------------------------------------------------------------------
DICTIONARIES

Dictionaries are lists where strings are used instead of integers for index. The strings which are used as indices are known as 'keys'.  Below are examples on how you create one and access values:

>>> d = {} # initialize a dictionary
>>> d['fiftyfive'] = 55
>>> d['sixtyeight'] = 68
>>> d['iv'] = 4
>>> d['x'] = 10
>>> d
{'x': 10, 'fiftyfive': 55, 'sixtyeight': 68, 'iv': 4}
>>> d['x']
10
>>> d['sixtyeight']
68
 




If you forgot to initialize a variable first before making an assignment, you would receive and exception

 >>> unInitializedDict['Richard'] = 'Nixon'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'unInitializedDict' is not defined


To correct this,

>>> unInitializedDict = {}
>>> unInitializedDict['Richard'] = 'Nixon'

I use dictionaries a lot especially in mapping one string into another or storing information for convenient retrieval later on - you don't have to remember an integer as the keys are now strings which are more meaningful.


>>> opposites = {'black':'white', 'big':'small', 'tall':'short'}
>>> opposites['tall']
'short'

Dictionaries could be used to store any information. In Areva, I normally would extract information from netmom for later use in conjunction with scadamom. Below is an example where I stored the 'from' and 'to' substations for line '704L', as well as voltage level.

>>> tl = {}
>>> tl['704L'] = ['102S','74S','138']

So, I could easily check whether a line terminates on a given substation in this manner.

>>> '100S' in tl['704L'][:2]
False
>>> '74S' in tl['704L'][:2]
True

You could also use composite key as index and store the voltage level information. In netmom, you could easily parse the composite key to reconstruct the corresponding scadamom key

>>> cb = {}
>>> cb['102S.CB.701'] = 138

Now, how do you iterate through a dictionary (dict)?

>>> d = {'aaa':'first','bbb':'second'}
>>> for key, value in d.iteritems():
...     print key, value
...
aaa first
bbb second


-------------------------------------------------------------------------------------------------------

DICTIONARY CONDITIONS

To check if a key exists in a dictionary:

>>> d = {'aaa':'first','bbb':'second'}
>>> 'bbb' in d
True
>>> 'ccc' in d
False


To check if a certain value exists in a dictionary

>>> 'second' in d.values()
True
>>> 'xxx' in d.values()
False

To extract the key given a value, I use list comprehension (to be covered in detail later).

>>> [key for key in d.keys() if d[key] == 'second']
['bbb']

One of the most irritating behaviour of dictionaries is that if there is no such key, it will throw an exception. There are times when you wish it will just return a default value rather than throw an exception. Good news! Dictionaries have a method "get" that will do this. General usage is:

dictName.get(key, defaultValue)

To illustrate, say you have a dict which stores the dollars contributed by members of an organization shown below: John has contributed $32, Ann $35, and Mario $55.

>>> contribution = {'john':32, 'ann':35, 'mario':55}



Now, Pete has not made any contribution yet; hence, his name is not in the dict. Obviously, his contribution is zero. However, if you try to find his amount of contribution from this dict, it gives you an error like below:


>>> contribution['pete']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'pete'
 


This is reasonable because python do not attempt to guess what it should return if the key does not exist. One work around is to check if key exists first and return 0 if not there.

 >>> if 'pete' in contribution:
...     contribution['pete']
... else:
...     0
...
0

However, using "get" method is more straightforward and elegant.


>>> contribution.get('ann',0)
35
>>> contribution.get('pete',0)
0

It returned the value given the key if it exists and return the default (zero in this case) if it does not.

-------------------------------------------------------------------------------------------------------
MIXING AND MATCHING THE DIFFERENT DATA TYPES

In python, you could have a list that contains another list, a dictionary and what have you like this:

>>> u = [11, 22, [300, 400, 500], {'VI':'six', 'X':'ten', 'L':'fifty'}]
>>> u[2]
[300, 400, 500]
>>> u[2][2]
500
>>> u[3]
{'VI': 'six', 'L': 'fifty', 'X': 'ten'}
>>> u[3]['L']
'fifty'
>>> u[0]
11

Similarly, a dictionary can have any contents. Be careful though when creating them. It should be like this:

>>> pupils = {}
>>> pupils['boys'] = {}
>>> pupils['boys']['grade1'] = ['Bryan','James','Rick','Michael']
>>> pupils['boys']['grade1'].append('Mario')
>>> pupils['boys']['grade1']
['Bryan', 'James', 'Rick', 'Michael', 'Mario']

>>> pupils['girls'] = {}
>>> pupils['girls']['grade1'] = []
>>> pupils['girls']['grade1'].append('Laura')
>>> pupils['girls']['grade1'].append('Susan')
>>> pupils['girls']['grade1']
['Laura', 'Susan']

-------------------------------------------------------------------------------------------------------
CONDITIONS PART 2

The following evaluates to True: non-zero value, non-empty string, non-empty list, non-empty dictionary, anything that is non-empty. The snippet below illustrates this point. I defined a function which determines whether a value evaluates to true or false.

>>> def trueorFalse(x):
...     if x:
...         return 'True'
...     else:
...         return 'False'
...
>>> trueorFalse(25)
'True'
>>> trueorFalse(-12)
'True'
>>> trueorFalse(2.5)
'True'
>>> trueorFalse('whatever')
'True'
>>> trueorFalse([3, 4])
'True'
>>> trueorFalse({'name':'noel'})
'True'
>>> trueorFalse(trueorFalse)
'True'

The last example above passes the function itself as an argument. Passing a function to another function is valid (and normal) in python.

The reverse, of course evaluates to false: zero value and anything empty.

>>> trueorFalse(0)
'False'
>>> trueorFalse(0.0)
'False'
>>> trueorFalse('')
'False'
>>> trueorFalse([])
'False'
>>> trueorFalse({})
'False'

Be forewarned, though that passing an empty function (one which contains 'pass' only) would still evaluate to true.


-------------------------------------------------------------------------------------------------------
FILE OPERATIONS

To open a textfile for reading

fileInputHandle = open(/path/to/file.txt, 'r')

To open a textfile for writing

fileOutputHandle = open(/path/to/file.txt,'w')

Note: In windows, the path will have a backslash '\' and should be escaped by placing another backslash. For example, if you have

C:\My Documents\work\humpy.txt,

fileInputHandle = open('C:\\My Documents\\work\\humpy.txt','r')

A workaround would be to precede the string with r like this:

fileInputHandle = open(r'C:\My Documents\work\humpy.txt','r')

Here is an example. We have humpy.txt at the current working directory with this content:

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.

>>> fin = open('humpy.txt','r')
>>> r = fin.readlines() # read everything and store in list
>>> r
['Humpty Dumpty sat on a wall,\n', 'Humpty Dumpty had a great fall.\n', "All the king's horses and all the king's men\n", "Couldn't put Humpty together again.\n"]
>>> r.sort()
>>> r
["All the king's horses and all the king's men\n", "Couldn't put Humpty together again.\n", 'Humpty Dumpty had a great fall.\n', 'Humpty Dumpty sat on a wall,\n']
>>> fin.close()

In above example, the whole file is stored in list 'r', one line per element, with the first line being the first element. Having done that, all list built in functions may be applied, like sort(), which rearranges the lines in alphabetical order. You could also iterate through the list using the idiom discussed before. However, unless you need to do sorting, this is not the best approach as this will load everything in the memory. The better approach would be just to process one line at a time.

>>> fin = open('humpy.txt','r')
>>> for r in fin:
...     u = r.replace('Humpty Dumpty','Mr Egg') # do something with line
...     print u
...
Mr Egg sat on a wall,

Mr Egg had a great fall.

All the king's horses and all the king's men

Couldn't put Humpty together again.

>>> fin.close()

Now you are wondering why you could iterate through a file handle as if it were a list. In C you would have an error if you do this and I know it does not make sense! The reason is that 'fin' is an instance of a File object which implements the iterator interface. This allows you to magically iterate through the object as if it were a list. We will discuss iterators later on. In the meantime, just use this idiom. It is easy to remember and less code!

Now, if you want to write the modified version of Humpty Dumpty (where Humpty Dumpty was replaced with Mr Egg) to egg.txt, here is how you do it.

>>> fin = open('humpy.txt','r')
>>> fout = open('egg.txt','w')
>>> for r in fin:
...     u = r.replace('Humpty Dumpty','Mr Egg') # do something with line
...     fout.write(u)
...
>>> fout.close()
>>> fin.close()

Here is the content of egg.txt

Mr Egg sat on a wall,
Mr Egg had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.