Quick Fix with Least Effort: January 2011

Wednesday, January 19, 2011

Functions

Functions must be defined (or coded earlier in the script) before they can be called as shown below:

[postgres@postgres-desktop:~]$ python myfunction.py               (01-18 00:06)
Traceback (most recent call last):
File "myfunction.py", line 1, in <module>
    a = square(10)
NameError: name 'square' is not defined

Below is a listing of myfunction.py

a = square(10)
print a

def square(x):
    return x*x

You use the keyword "def" to define a function. Just like in C, the function arguments goes inside the parentheses and are parsed in the same order that they are passed. You then terminate the "def" statement with a colon. You then indent the contents of the function. A function may, or may not return a value, pretty much like in C.

The reason why an exception was thrown when the code was run is that the function was defined after it was called and hence, the interpreter is not yet aware of its existence when the call was made. Remember, the way python works is that it reads the py file in sequence, starting at the beginning, interpreting it, and then executing the interpreted line.

To correct this, place the function definition ahead of its call like this:

def square(x):
    return x*x

a = square(10)
print a

When you run it, the result is as expected:

[postgres@postgres-desktop:~]$ python myfunction.py               (01-18 00:18)
100

Python knew the end of function definition when the indentation ends. One thing with python is that you could mix the function definition with the the main body and still works fine. In C, you can't do this. However, this could easily make yourself confused. So, avoid it. Below is an example:

# my function.py

y = 10

def square(x):
    return x*x

a = square(y)
print a

Below is the result when you run it:

[postgres@postgres-desktop:~]$ python myfunction.py               (01-18 00:23)
100

As you could see, it still runs without error.

You could define an empty function definition like this:

def square(x):
    pass

which is equivalent to this C function

float square(float x)
{
}

DEFAULT VALUES

It is very easy to set default values in function definition. To illustrate:

>>> def whoAmI(name, age, sex='male'):
...     print 'I am ' + name + ', ' + str(age) + ' yrs old, ' + sex
...
>>> whoAmI('John', 33)
I am John, 33 yrs old, male
>>> whoAmI('Ann', 24, 'female')
I am Ann, 24 yrs old, female

In the first function call, only the name and age were passed. Since nothing was supplied for sex (i.e., only 2 arguments were supplied), it used the default value of sex, which is "male".

In the second call, all 3 arguments were supplied and the supplied value for sex was used.

Remember, the order by which the arguments were passed, will be the order they will be assigned on the variables at the receiving function.

Another thing to remember for assigning default values is that they should be at the end and not to be followed by non-default argument as shown below:

>>> def whoAmI(name, sex='male', age):
...     print 'I am ' + name + ', ' + str(age) + ' yrs old, ' + sex
...
File "<stdin>", line 1
SyntaxError: non-default argument follows default argument

NAMED ARGUMENTS

Use of named arguments is illustrated below:

>>> def whoAmI(name, age, sex):
...     print 'I am ' + name + ', ' + str(age) + ' yrs old, ' + sex
...
>>> whoAmI(age=20, sex='female', name='Mary')
I am Mary, 20 yrs old, female

As you could see, using named argument when you make a function call enables you to pass the arguments in any order you want and still get the desired result. The added advantages are: 1. It makes your code clearer because it reminds you which values goes to which variables when you make the call, and 2. It is easier to handle default values as illustrated below:

>>> def whoAmI(name, sex='male', age=25):
...     print 'I am ' + name + ', ' + str(age) + ' yrs old, ' + sex
...
>>> whoAmI('Kim', age=44)
I am Kim, 44 yrs old, male
>>> whoAmI(sex='female', name='Mary')
I am Mary, 25 yrs old, female

VARIABLE NUMBER OF ARGUMENTS

Remember the printf function in C? You pass any number of arguments and it works fine. Here is how you do it in python:

>>> def f(*v):
...     for u in v:
...         print u
...
>>> f(10, 20, 30)
10
20
30
>>> f(1234)
1234
>>> f('are','you','human',12.5)
are
you
human
12.5

Notice the asterisk (*) before the variable 'v' in the function argument. It is like saying that this variable will be a list to hold all parameters that were passed. You access the function arguments by iterating this 'list' variable.

You could actually combine the positional arguments with variable arguments like this:

>>> def g(positional, *variable):
...     print 'positional argument is: ' + str(positional)
...     print 'variable arguments are: ' + str(variable)
...
>>> g(10, 20, 30, 40)
positional argument is: 10
variable arguments are: (20, 30, 40)

Here is another example:

>>> def h(a, b, c, *d):
...     print 'a=' + str(a)
...     print 'b=' + str(b)
...     print 'c=' + str(c)
...     print 'variable argument d=' + str(d)
...
>>> h(11, 22, 33, 44, 55, 66, 77, 88)
a=11
b=22
c=33
variable argument d=(44, 55, 66, 77, 88)

Positional arguments should come first before variable arguments or there will be an exception.

KEYWORD ARGUMENTS

Here is how you handle a variable number of keyword arguments - you place double asterisk before the variable name on the receiving function and then, the keyword arguments are converted into a dictionary. The variable names that were passed were converted into a string.

>>> def f(**k):
...     print k
...
>>> f(a='apple', b='banana')
{'a': 'apple', 'b': 'banana'}
>>> f(x=1, y='yarn',z=2)
{'y': 'yarn', 'x': 1, 'z': 2}

GENERAL FORMAT OF FUNCTION ARGUMENTS

The general format is that you have the positional arguments first, followed by variable arguments, then finally, the variable keyword arguments. Here is an example:

>>> def g(a,b, *args, **kwargs):
...     print 'positional arguments are: a=' + str(a) + ', b=' + str(b)
...     print 'variable arguments args=' + str(args)
...     print 'keyword arguments kwargs=' + str(kwargs)
...
>>> g(10, 20, 30, 40, 50, 60, m='model', c='cat', x=100)
positional arguments are: a=10, b=20
variable arguments args=(30, 40, 50, 60)
keyword arguments kwargs={'x': 100, 'c': 'cat', 'm': 'model'}

FUNCTIONS AS FIRST CLASS OBJECTS IN PYTHON

In python, functions are treated just like strings or integers in the sense that you could pass them around as if they were data. The examples below illustrates the important points:

>>> def add(x,y):
...     print 'inside add function'
...     return x+y
...
>>> def sub(x,y):
...     print 'inside sub function'
...     return x-y
...
>>> def multiply(x,y):
...     print 'inside multiply function'
...     return x*y
...
>>> z = add # you could assign the function to a variable
>>> z(3,4)   # and then invoke it through that variable
inside add function
7

>>> def doThis(func,x,y): # invokes the function passed to this function
...     print 'inside doThis function, about to call another function'
...     return func(x,y)
...
>>> doThis(sub,7,5)
inside doThis function, about to call another function
inside sub function
2

>>> doThis(multiply,4,5)
inside doThis function, about to call another function
inside multiply function
20

You could also create a dictionary of function and invoke it.

>>> switch = {'add':add, 'sub':sub, 'mul':multiply}
>>> switch['add'](4,3)
inside add function
7

>>> x = 'mul'
>>> switch[x](3,5)
inside multiply function
15

So, if you miss C's or VB's 'switch' statement, the above example demonstrates how to elegantly handle it in python.

Monday, January 10, 2011

Using Python to Access OSI Soft's PI Data

If you are not using PI historian (or don't know what it is), move on. This post is not for you.

First of all, for this to work, you will need pisdk installed. I think you already have it if you could access PI data using excel. You will then call it via COM service. Code snippet below:

import time
from datetime import datetime, timedelta
from win32com.client import Dispatch
tdMax = timedelta(minutes=15)
pisdk = Dispatch('PISDK.PISDK')
myServer = pisdk.Servers('PI_ServerName')
con =Dispatch('PISDKDlg.Connections')
con.Login(myServer,'username','password',1,0)
piTimeStart = Dispatch('PITimeServer.PITimeFormat')
piTimeEnd = Dispatch('PITimeServer.PITimeFormat')
piTimeStart.InputString = '08-25-2009 00:00:00'
piTimeEnd.InputString = '08-25-2009 23:59:59'
sampleAsynchStatus = Dispatch('PISDKCommon.PIAsynchStatus')
samplePoint = myServer.PIPoints['tagname']
sampleValues = samplePoint.Data.Summaries2(piTimeStart,piTimeEnd,'1h',5,0,sampleAsynchStatus)
t0 = datetime.now()
while True:
    try:
        valsum = sampleValues("Average").Value # retrieve the value
        break # it will get out of infinite loop when there is no error
    except: # it failed because server needs more time
        td = datetime.now() - t0
        if td > tdMax:
            print "It is taking so long to get PI data ! Exiting now ...."
            exit(1)
        time.sleep(3)
i = 1
while i < valsum.Count+1:
    print valsum(i).Value, valsum(i).ValueAttributes('Percentgood').Value
    i += 1
...
61.4011819388 100.0
61.3148685875 100.0
61.4948477524 100.0
61.2000007629 100.0
61.2000007629 100.0
61.2000007629 100.0
63.8442935978 100.0
64.2453963685 100.0
66.205138361 100.0
71.4747024728 100.0
63.8786754349 100.0
64.3719732268 100.0
64.6296242787 100.0
61.8953031561 100.0
57.0552419993 100.0
57.2999992371 100.0
58.2445955483 100.0
57.5679091366 100.0
59.6997264523 100.0
60.9096459478 100.0
59.7756373596 100.0
55.1742569927 100.0
59.7151307987 100.0

For my officemates. That is just too much work, right? Actually, if you will write the application in VB, your code will look just like that. For this reason, I wrote a convenience class to hide all those warts and simplify everything as follows:

import sys
sys.path.append(r'M:\EMS\pyhomebrew')

from osipi import osipi
# with time stamp
b = osipi('username','password') # username, then, password, and optionally, PI server
for x, y in b.getAverage('pi-tag here','04-01-2010 00:00','04-01-2010 08:00','1h',True):
    print x, y    # x is time, y is value
# without time stamp

b = osipi('username','password') # username, then, password, and optionally, PI server
for x in b.getAverage('pi-tag here','04-01-2010 00:00','04-01-2010 08:00','1h'):
    print x
# to get the compressed values
for x, y in b.getCompressed('pi-tag here','04-01-2010 00:00','04-01-2010 00:01',True):
    print x, y    # x is time, y is value

Why not use excel? You use excel when it is appropriate but there are instances when it is not. For one, excel has limitations on the number of rows. If you go back too far in time, excel could choke. I was told that the latest version of excel has virtually no limit on the number of rows - not true! I dumped the OAG database via rio and it could not read the whole thing!

Thursday, January 6, 2011

File Handling

Here are common problems and how they are handled in python. If I have a custom-made module that makes it easier, I will label it as "Using my custom module". This custom module is not available except for my office mates.

DELETING ALL FILES AND SUBFOLDERS IN A DIRECTORY named directoryName

import shutil
shutil.rmtree('directoryName')

DELETING A SINGLE FILE

import os
if os.path.exists('nameOfFileToDeleteIncludingCompletePath'):
    os.remove('nameOfFileToDeleteIncludingCompletePath')

You need to check first if file exist before deleting or it may throw an exception. Alternatively, you could do it like below and accomplish the same thing.

import os
try:
    os.remove('nameOfFileToDeleteIncludingCompletePath')
except:
    pass

COPYING FILES

Though this could also be done indirectly through a system call (see last topic of this post), python has a module called shutil that handles this.

import shutil
shutil.copy('sourcefileName','destination')

'destination' could either be another filename or a directory.

GETTING THE FILES WHOSE NAMES FOLLOW A PATTERN

You can use the same pattern matching rule when you use the command line for directory listing. As an example, when you want the files which starts with "abc", followed by anything plus a txt extension, you would do "dir abc*.txt". To get the list of these files in python,

import glob
fileNameList = glob.glob('abc*.txt')

That is, you supply "glob" with same patterns as you would supply the "dir" or "ls" command. If the files are not in the current working directory, you do

import glob
fileNameList = glob.glob('\\path\\to\\file\\abc*.txt')

GETTING THE LATEST FILE IN A DIRECTORY

You most probably would want to do this in conjunction with the file matching capability above. First of all, some background. You can get the time of most recent modification of a file by doing this:

import os
fileAge = os.stat('filename')[8]

os.stat actually gives you the file attributes in a list and the 8th (relative to 0) element gives you the time of most recent modification in milliseconds. The larger this number, the more recent it is.

However, often times, you have a group of files and you want to use the latest. For example your files may look like. alarm_235.txt, alarm_111.txt, alarm_241.txt, and so on. I would do it this way:

import os
import glob

candidateFiles = glob.glob('\\path\\to\\alarm_*.txt')
fileModificationDate = [os.stat(f)[8] for f in candidateFiles]
latestFile = candidateFiles[fileModificationDate.index(max(fileModificationDate)]

"fileModificationDate" is the list containing the "modification date" of the files and in same order as the corresponding files in candidateFiles - i.e. - the first element in fileModificationDate the modification date of the first element in candidateFiles. Therefore, if the largest number (a.k.a. most recent) in fileModificationDate is the 3rd element, then, the corresponding file is the 3rd element in candidateFiles.
Now, max(fileModificationDate) would give you the largest number in fileModificationDate list. fileModificationDate.index(max(fileModificationDate)) will give you the index of this largest number in fileModificationDate. This index, therefore, when used in candidateFiles would give you the filename corresponding to the latest file.

For my officemates: I use this quite a lot and so, I decided to create a module for this. It could actually give you a list with the 1st element being the latest, 2nd element the next latest, and so on. Here is how you use it:

import sys
sys.path.append('M:\\EMS\\pyhomebrew')
from qFileUtils2 import getLatestFiles
latestFile = getLatestFiles('filePattern')[0]

filePattern is the same argument that you supply to glob.

If you want the top 3 latest files:

import sys
sys.path.append('M:\\EMS\\pyhomebrew')
from qFileUtils2 import getLatestFiles
latestFiles = getLatestFiles('filePattern', 3)

The latest file would be latestFiles[0], followed by latestFiles[1], followed by latestFiles[2]. Notice that getLatestFiles may have 1 or 2 arguments. If you supply only 1 argument, it will assume that you need the top 2. In other words, these 2 are identical:

getLatestFiles('filePattern')
getLatestFiles('filePattern', 2)

CHECKING IF A FILE IS UPDATED

Often times, you run a script which uses a data file and you forgot to update the data file. So, it used the old data file and all along, you thought you are done. It would be nice if the script could check the modification/creation time, compare that to the current time, and decide that the data file is not updated and tell you so. For example, you might have a new database release every 2 weeks. Therefore, if I will be running the script right now and the time difference between now and when the data file was last modified is longer than 10 hours, it is possible that the data file was indeed old! I have probably finished with my data file in the morning and running my script in the afternoon. This interval is certainly less than 10 hours.

This is how you do it:

import os
ageInHours = 10 # let us use 10 hrs for now

t1 = os.path.getmtime(fileName)
t2 = time.time()
ageInSeconds = 3600 * ageInHours
if t2-t1 > ageInSeconds: # file is older than 10 hrs
    print 'data file is not updated!'
else:
    print 'data file is updated!'

For my officemates: I have a module that does this - the function isUpdated.

import sys
sys.path.append('M:\\EMS\\pyhomebrew')
from qFileUtils2 import isUpdated

if not isUpdated('dataFile'):
    print 'Your data file is not updated. Exiting now...'
    exit(1)
# continue with your script

isUpdated may take 2 arguments like this:

isUpdated('filename',48)

where 48 hours is used as basis for saying that the file is not updated; 10 hours is the default if you don't supply the 2nd argument.

RUNNING SYSTEM COMMANDS

If you want to run system command from python:

import os
os.system('notepad test.txt') # runs notepad and open test.txt

So, whatever you would type in the command line, you simply place inside the parentheses. The only downside is that the result is not echoed back to your script; only the return status (0 if no errors are encountered, nonzero otherwise). For example, "dir *.*" will not return the name of the files on the current working directory. It will just tell you that the command executed and terminated normally via the return value of zero.

How do you pass environment variable to the system command? It is like this:

import os
os.environ['PYTHONPATH'] = r'D:\path\to\other\modules'
os.system('python yourScript.py') # this invokes another python script; PYTHONPATH is passed

Take note that changing the environment variable in this way is not permanent. It is unset again after your script terminates.

Tuesday, January 4, 2011

Exception Handling

If something happened and python could not go on it will throw an exception. Like for instance, you want to open a non-existing file. You could either do extra processing when the exception happen or ignore it. I won't get into too much detail here and discuss only that which I normally do. Below is how you handle an exception: you place the code inside the except-try block.

try:
    # statements that could potentially throw an exception
except:
    # Now, you have an exception. What do you want to do?

There are instances where you don't want the script to quit when it encounter an exception and just move on. In this case, you just place a 'pass' stetement in the except clause like this:

try:
    # statements that could potentially throw an exception
except:
    pass

If there are specific exceptions that you might wish to catch, place it in the except statement as shown below. Without qualifier, the except clause catches everything.

>>> s = 'abc'
>>> b = float(s)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
ValueError: invalid literal for float(): abc

In the example above, we want to convert a string value to a floating point. Since you don't have numbers in string s, it throws an exception ValueError. You can handle this exception as shown below. As you could see, the exception ValueError was catched and the code inside the except clause was executed. Since the exception was caught in your code, the script went to completion and no 'Traceback' was shown.

>>> try:
...    b = float(s)
... except ValueError:
...    print 's is not a floating point string'
...
s is not a floating point string

If I have a script that runs via scheduler, I would place statements in the except clause to send me e-mail telling me what went wrong. In that way, I am notified in real-time and need not look on lengthy logs. Below is how I did it.

import smtplib

def emailError(errMsg):
    host='mailServerName'
    s=smtplib.SMTP(host)
    recepients = ['emailAddress1','emailAddress2','emailAddres3 ']
    msg = "Subject: Application1 Had some Problems !!!\n\n" + errMsg
    s.sendmail('dummyReplyEmailAddress',recepients,msg)
    exit(1)

try:
fin = file("D:\\somewhere\\sysgenid","r")
except:
emailError("Can't open sysgenid file!\nWas it deleted?\n")

Another area where I find exception handling useful is when you have a backup and would want to shift to that when the primary resource is down. For example, say, you have an Oracle server Ora1 which is your primary source of data for your script and then there is backup server Ora2, and a third backup Ora3, this is how I use exception handling to take advantage of this redundancy:

import cx_Oracle
try:
    connection = cx_Oracle.connect('uid/pwd@Ora1')
except:
    try:
       connection = cx_Oracle.connect('uid/pwd@Ora2')
except:
         try:

             connection = cx_Oracle.connect('uid/pwd@Ora3')
except:
              print 'All Oracle servers are down!!! Exiting now ...'
              exit(1)
cur = connection.cursor()
cur.execute("Select * from userTable")
Now, if there is at least one Oracle server up, the script may continue without any problems. Better yet, use email to notify yourself if any of the Oracle server is down using the previous example.

Modules and PYTHONPATH

What makes python so useful are the wealth of modules written to simplify your tasks at hand. You just 'import' these modules and you are ready to go. Normally, the import statements are at the start of your script. The previous post has just shown you a module which allowed you to access window's COM objects.

Normally, if I need to do something, I just google it on the internet. Let me give you a concrete example. There was a time when I want to access an Oracle database using python. I typed "python Oracle" on the google search engine and this is the result:

-----------------------------------------

Python - Oracle Wiki

2 posts - 1 author - Last post: 3 Feb 2010

Getting Started: #!/usr/bin/python import cx_Oracle connstr='scott/tiger' conn = cx_Oracle.connect(connstr) curs = conn.cursor() ...
wiki.oracle.com/page/Python - Cached - Similar pages

cx_Oracle

cx_Oracle is a Python extension module that allows access to Oracle databases and conforms to the ... Windows amd64 Installer (Oracle 10g, Python 2.5) ...
cx-oracle.sourceforge.net/ - Cached - Similar pages

Using Python With Oracle Database 11g

With the rise of Frameworks, Python is also becoming common for Web application development. If you want to use Python and an Oracle database, this tutorial ...
www.oracle.com/technetwork/articles/dsl/python-091105.html - Cached - Similar pages

Python - Oracle FAQ

5 Nov 2010 ... cx_Oracle is a Python extension module that allows access to Oracle databases and conforms to the Python database API specification. ...
www.orafaq.com/wiki/Python - Cached - Similar pages

---------------------------------------------------------------------------------

I clicked on first entry and found this code snippet:

-------------------------------------------------------------------

#!/usr/bin/python

import cx_Oracle

connstr='scott/tiger'
conn = cx_Oracle.connect(connstr)
curs = conn.cursor()
curs.arraysize=50

curs.execute('select 2+2 "aaa" ,3*3 from dual')
print curs.description
print curs.fetchone()

conn.close()

Resources:
- Python Programming Language - Official Website
- cx_Oracle module for accessing Oracle Database from Python (DB API 2.0 conformant, updated for 11g)
- Pydev, a superb IDE for Python built on top of Eclipse
- Mod_python, Apache/Python Integration, introducing Python Server Pages technology
- Python Package Index - official repository for 3rd party modules
- comp.lang.python - active, helpful community of Python experts - Essbase.py - port of Essbase perl module to

-----------------------------------------------------------------------------

This tells me that I need the cx_Oracle module (because of the 'import' statement at the start). So I checked via the python interpreter if I have this module.

>>> import cx_Oracle
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named cx_Oracle

Ok, I don't have it; I need to get this module and install it. Again, from the first site that I clicked, it has a link on cx_Oracle under "Resources" (please see above). So, I clicked on it and saw several windows installer for different versions of Oracle. I selected the one compatible to our Oracle server and installed it. Now, importing that module from the python interpreter no longer issue any error.

Now, I need more documentation. Going back to google result, I selected "Python-Oracle FAQ" and saw these code snippets:

---------------------------------------------------------------------------

python
>>> import cx_Oracle
>>> connection = cx_Oracle.connect('scott/tiger@orcl')
>>> # do some stuff here
>>> connection.close()

connection = cx_Oracle.connect("uid/pwd@database")
cursor = connection.cursor()
cursor.execute("SELECT COUNT(*) FROM User_Tables")
count = cursor.fetchall()[0][0]
cursor.close()
connection.close()

connection = cx_Oracle.connect("uid/pwd@database")
cursor = connection.cursor()
cursor.arraysize = 50
cursor.execute("""
       select Col1, Col2, Col3
       from SomeTable
       where Col4 = :arg_1
         and Col5 between :arg_2 and :arg_3""",
       arg_1 = "VALUE",
       arg_2 = 5,
       arg_3 = 15)

for column_1, column_2, column_3 in cursor.fetchall():
    print "Values from DB:", column_1, column_2, column_3

cursor.close()
connection.close()

--------------------------------------------------

Well, there are a lot more but you get the idea. If you don't know SQL, these code snippets may not be enough for you. Fortunately, in my case, I know SQL and these snippets are enough to keep me going.

Now, if you don't have administrative rights, you would not be able to install the module at the default location where python expects to find it. To solve this problem, you have to include the installation folder in your PYTHONPATH so that python would find it. To change your PYTHONPATH, you could either convince your administrator to append the installation folder at the end of this environment variable (good luck with that!) or dynamically change PYTHONPATH in your script as follows:

import sys

sys.path.append(r'/path/to/cx_Oracle')

import cx_Oracle

sys.path is actually a list containing folders where python would search for modules

Sunday, January 2, 2011

Manipulating Excel Sheets using Python

There are 2 ways to handle excel sheets using python:

1. Using the xlrd and xlwt
2. Using windows COM object

The first method is very fast and would not need excel application to be installed at the target machine. However, the drawback is that you could only either read an existing spreadsheet or write to NEW spreadsheet. You can't modify an existing spreadsheet. If you really want to modify, you open it for reading, then write your modifications to a temporary file, then overwrite the original file with the temporary file. I don't intend to cover this but here is a tutorial on how to use this module for those who are interested.

The second method is slower and you need to have excel application installed. But then, you have so much more flexibility and modification of existing spreadsheet is possible. I will discuss only this method.

Below is a code snippet on how you do it.

# import the module that you need for handling COM objects

from win32com.client import Dispatch

# Open a workbook for a given excel file

wb = Dispatch('Excel.Application').Workbooks.Open(filename)

# Open a worksheet from that workbook

ws = self.wb.Sheets(sheetName)

# read a cell value

cellValue = ws.Cells(row,col).Text

# write to a cell

ws.Cells(row,col).Value = newValue

The succeeding discussions are for my officemates only. However, you are welcome to read it to get some idea and perhaps roll out your own convenience class.

As you could see, syntax is identical to that in VB. Actually all spreadsheet manipulation that you could do in VB, you could also do in python and the syntax is exactly the same! To me, it looks ugly. So, I decided to create my own convenience class that hides the ugliness and it works like this:

# import the convenience class. It should be in PYTHONPATH
import sys
sys.path.append(r'M:\EMS\pyhomebrew')
from dxls import dxls

# open a workbook
wb = dxls(filename)

# open a worksheet from that workbook
ws = wb.ws(sheetName)

# read a cell value
cellValue = ws.read(row,col)

# write to a cell
ws.write(row,col,newValue)

# close the workbook and save changes
wb.close()

I wrote this class even before I learned enough to know that there is a convention to capitalize the first letter of the class name. Now, I have used this a lot in my other scripts and correcting it would break them.

I also added some convenience feature like instead of:

cellValue = ws.read(row,10)

you could just say:

cellValue = ws. read(row,'j') # cell column is 'J' which is 10th col

Or if you have a table and table heading for this 10th column is 'Address'

cellValue = ws. read(row,'Address')

If you have row heading, you could also use that as well. All of these are automatically done for you with no extra coding on your part. The only requirement is that the table on your worksheet should be well behaved which means:

1. Table has only 2 lines of title.
2. After table column headings, values immediately follow - i.e., no blank line
3. The next 4 lines after the table column headings have no blanks

It also detects the table boundaries: ws.minRow, ws.maxRow, ws.minCol and ws.maxCol

To iterate through the entire table,

for row in xrange(ws.minRow,ws.maxRow):
    for col in xrange(ws.minCol, ws.maxCol):
# do something like multiply the cell contents by 2
        # and then write the result on the same row but 100
        # columns away from current cell position
        ws.write(row,col+100,ws.read(row,col)*2)

As you could see, there is no need to continually check for an empty cell which marks the end of the table - the convenience class has already done it for you.

I mentioned 2 new concepts here: module and PYTHONPATH. That will be the next topic.

Saturday, January 1, 2011

List Comprehension

Suppose you have a list of integers and you want to create another list whose elements are twice larger than the first list, how would you do it? Here are typical ways of doing it.

>>> a = [10, 20, 30, 40]
>>> b = []
>>> for i in a:
...     b.append(2*i)
...
>>> b
[20, 40, 60, 80]

Those new to python may not easily understand what is going on in the above example; perhaps they would approach it like below:

>>> c = []
>>> for i in range(len(a)):
...     c[i] = 2*a[i]
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list assignment index out of range

The reason why you have an error above is that array c has no elements and yet you are trying to assign its ith element (there is no ith element) to a new value.

For this approach to be feasible, create a list c which has the same number of elements as list a. The easiest way is to copy list a into list c using the idiom below

>>> c = a[:]

And then, proceed as usual.

>>> for i in range(len(a)):
...     c[i] = 2*a[i]
...
>>> c
[20, 40, 60, 80]

If you used

c = a

instead of

c = a[:]

c and a would be referring to the same array and changing 'c' changes 'a' as well. On the other hand, if you used slices. you will get just a copy. Recall that if you leave the left side of ':' blank, it means all the way to the start and leaving the right side blank means all the way to the end.

Now, if you used list comprehension, the code would be:
>>> a = [10, 20, 30, 40]
>>> b = [2*x for x in a]
>>> b
[20, 40, 60, 80]

The way you read this is that, you pick the first element in list 'a', assign it to x, then multiply it by 2, and that is the first element. Then, you pick the second element in 'a', assign it to x, multiply it by 2, and that is the second element, and so on. It is pretty much like:

for x in a:
    2*x

except that the result of each loop automatically become an element of a new list. Not only is list comprehension very compact - it is also very fast compared to the ordinary 'for' loop. In the 'for' loop, each line is interpreted as many times as there are elements in the loop; whereas in list comprehension, it is interpreted only once regardless of the list size.

If you want a new array whose elements is the square of each elements of another array, it will be like this:

>>> a
[10, 20, 30, 40]
>>> squared = [x*x for x in a]
>>> squared
[100, 400, 900, 1600]

I guess you now know what is going on. It is possible that instead of 2*x or x*x, you use function like this.

>>> def square(u):
...     return u*u
...

>>> a
[10, 20, 30, 40]
>>> c = [square(x) for x in a]
>>> c
[100, 400, 900, 1600]

You could also use conditions. For example, you might want the square of each element only if the element is greater than 20.

>>> a
[10, 20, 30, 40]
>>> d = [square(x) for x in a if x > 20]
>>> d
[900, 1600]

Only 30 and 40 are greater than 20 and so, you only have 2 elements on the new array.

The elements of the new array need not be a function of the elements of the old array at all. For example, I could do something like this:

>>> a
[10, 20, 30, 40]
>>> divisibleBy20 = [1 for x in a if x % 20 == 0]
>>> divisibleBy20
[1, 1]

The above code sets the element of the new array equal to 1 if the old array element is exactly divisible by 20 (i.e., modulo is equal to zero). Only 2 elements are exactly divisible by 20. That is why you only have 2 elements with 1 in it.

Suppose you want to know if a string has 'abc' and 'xyz' in it, what would you do?

>>> s = 'Do you know your abcd?'
>>> if 'abc' in s or 'xyz' in s:
...     print 'It is there!'
...
It is there!

Now, suppose you are given a list of strings to search instead of just 2? Maybe you would do it this way:

It is there!
>>> s
'Do you know your abcd?'
>>> searchThis = ['abc','xyz','uuu']
>>> found = False
>>> for x in searchThis:
...     if x in s:
...         found = True
...         break
...
>>> if found:
...     print 'found it!'
... else:
...     print 'It is not there'
found it!

Well, there is a better way - use for-else:

>>> s
'Do you know your abcd?'
>>> for x in searchThis:
...     if x in s:
...         print 'found it!'
...         break
... else:
...     print 'It is not there'
...
found it!

Please note that the 'else' clause is NOT for the 'if' statement. It is for the 'for' statement and is executed if you exhausted the list without issuing a break statement

I think I have found a much better way using list comprehension:

>>> if sum([1 for x in searchThis if x in s]) > 0:
...     print 'found it!'
... else:
...     print 'It is not there'
...
found it!

To understand what is going on, let us see what we have inside the sum function.

>>> [1 for x in searchThis if x in s]
[1]

It assign a value from searchThis into x and whenever it can find that substring from 's', it places '1' in the new array. So, if none of the elements of searchThis matches, there will be nothing there. Now the sum function adds all elements of a list. If it is an empty list, sum will be zero and greater than zero (there is a match) otherwise.

I just found out that there is still another better way in doing the above using "any" . It is not really list comprehension but the syntax is very similar.

>>> s = 'Do you know your abcd?'
>>> searchThis = ['abc','xyz','uuu']
>>> any(x in s for x in searchThis)
True

"any" actually test if at least an element of a list satisfies a condition and bails out (i.e., do not continue testing the rest of the list) once it found one. Here is
another example of "any". It checks if any element of the list has a square greater than 78.

>>> lx = [3, 4, 5]
>>> any(x*x > 78 for x in lx)
False
>>> lx = [3, 4, 10]
>>> any(x*x > 78 for x in lx)
True

Another useful construct is "all" - it checks whether all elements of a list satisfies the condition. The example below checks if all elements has a square larger than 78.

>>> lx = [10, 11, 12]
>>> all(x*x > 78 for x in lx)
True
>>> lx = [3, 4, 10]
>>> all(x*x > 78 for x in lx)
False

Another common use that I find with list comprehension is trimming the elements of a list of spaces. If you use the Areva ODBC driver to extract data from the database, the returned values will be padded with spaces to fill up the exact size allocation for the data like this.

>>> substn = ['74S     ','102S    ','286S    ']
>>> mySubstn = [x.strip() for x in substn]
>>> mySubstn
['74S', '102S', '286S']

In my later posts, I will discuss how to use python to extract data from within habitat