Saturday, January 1, 2011

List Comprehension

Suppose you have a list of integers and you want to create another list whose elements are twice larger than the first list, how would you do it? Here are typical ways of doing it.


>>> a = [10, 20, 30, 40]
>>> b = []
>>> for i in a:
...     b.append(2*i)
...
>>> b
[20, 40, 60, 80]



Those new to python may not easily understand what is going on in the above example; perhaps they would approach it like below:

>>> c = []
>>> for i in range(len(a)):
...     c[i] = 2*a[i]
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
IndexError: list assignment index out of range



The reason why you have an error above is that array c has no elements and yet you are trying to assign its ith element (there is no ith element) to a new value.

For this approach to be feasible, create a list c which has the same number of elements as list a. The easiest way is to copy list a into list c using the idiom below

>>> c = a[:]


And then, proceed as usual.

>>> for i in range(len(a)):
...     c[i] = 2*a[i]
...
>>> c
[20, 40, 60, 80]



If you used


c = a


instead of

c = a[:]


c and a would be referring to the same array and changing 'c' changes 'a' as well. On the other hand, if you used slices. you will get just a copy. Recall that if you leave the left side of ':' blank, it means all the way to the start and leaving the right side blank means all the way to the end.

Now, if you used list comprehension, the code would be:
 >>> a = [10, 20, 30, 40]
>>> b = [2*x for x in a]
>>> b
[20, 40, 60, 80]

The way you read this is that, you pick the first element in list 'a', assign it to x, then multiply it by 2, and that is the first element. Then, you pick the second element in 'a', assign it to x, multiply it by 2, and that is the second element, and so on. It is pretty much like:

for x in a:
    2*x

except that the result of each loop automatically become an element of a new list. Not only is list comprehension very compact - it is also very fast compared to the ordinary 'for' loop. In the 'for' loop, each line is interpreted as many times as there are elements in the loop; whereas in list comprehension, it is interpreted only once regardless of the list size.

If you want a new array whose elements is the square of each elements of another array, it will be like this:

>>> a
[10, 20, 30, 40]
>>> squared = [x*x for x in a]
>>> squared
[100, 400, 900, 1600]

I guess you now know what is going on. It is possible that instead of 2*x or x*x, you use function like this.

>>> def square(u):
...     return u*u
...

>>> a
[10, 20, 30, 40]
>>> c = [square(x) for x in a]
>>> c
[100, 400, 900, 1600]

You could also use conditions. For example, you might want the square of each element only if the element is greater than 20.

>>> a
[10, 20, 30, 40]
>>> d = [square(x) for x in a if x > 20]
>>> d
[900, 1600]

Only 30 and 40 are greater than 20 and so, you only have 2 elements on the new array.

The elements of the new array need not be a function of the elements of the old array at all. For example, I could do something like this:

>>> a
[10, 20, 30, 40]
>>> divisibleBy20 = [1 for x in a if x % 20 == 0]
>>> divisibleBy20
[1, 1]

The above code sets the element of the new array equal to 1 if the old array element is exactly divisible by 20 (i.e., modulo is equal to zero). Only 2 elements are exactly divisible by 20. That is why you only have 2 elements with 1 in it.

Suppose you want to know if a string has 'abc' and 'xyz' in it, what would you do?

>>> s = 'Do you know your abcd?'
>>> if 'abc' in s or 'xyz' in s:
...     print 'It is there!'
...
It is there!

Now, suppose you are given a list of strings to search instead of just 2? Maybe you would do it this way:

It is there!
>>> s
'Do you know your abcd?'
>>> searchThis = ['abc','xyz','uuu']
>>> found = False
>>> for x in searchThis:
...     if x in s:
...         found = True
...         break
...
>>> if found:
...     print 'found it!'
... else:
...     print 'It is not there'
found it!

Well, there is a better way - use for-else:

>>> s
'Do you know your abcd?'
>>> for x in searchThis:
...     if x in s:
...         print 'found it!'
...         break
... else:
...     print 'It is not there'
...
found it!

 Please note that the 'else' clause is NOT for the 'if' statement. It is for the 'for' statement and is executed if you exhausted the list without issuing a break statement

I think I have found a much better way using list comprehension:

>>> if sum([1 for x in searchThis if x in s]) > 0:
...     print 'found it!'
... else:
...     print 'It is not there'
...
found it!

To understand what is going on, let us see what we have inside the sum function.

>>> [1 for x in searchThis if x in s]
[1]

It assign a value from searchThis into x and whenever it can find that substring from 's', it places '1' in the new array. So, if none of the elements of searchThis matches, there will be nothing there. Now the sum function adds all elements of a list. If it is an empty list, sum will be zero and greater than zero (there is a match) otherwise.

I just found out that there is still another better way in doing the above using "any" . It is not really list comprehension but the syntax is very similar.

>>> s = 'Do you know your abcd?'
>>> searchThis = ['abc','xyz','uuu']
>>> any(x in s for x in searchThis)
True

"any" actually test  if at least an element of a list satisfies a condition and bails out (i.e., do not continue testing the rest of the list) once it found one. Here is
another example of "any". It checks if any element of the list has a square greater than 78.

>>> lx = [3, 4, 5]
>>> any(x*x > 78 for x in lx)
False
>>> lx = [3, 4, 10]
>>> any(x*x > 78 for x in lx)
True

Another useful construct is "all" - it checks whether all elements of a list satisfies the condition. The example below checks if all elements has a square larger than 78.

 >>> lx = [10, 11, 12]
>>> all(x*x > 78 for x in lx)
True
>>> lx = [3, 4, 10]
>>> all(x*x > 78 for x in lx)
False


Another common use that I find with list comprehension is trimming the elements of a list of spaces. If you use the Areva ODBC driver to extract data from the database, the returned values will be padded with spaces to fill up the exact size allocation for the data like this.

>>> substn = ['74S     ','102S    ','286S    ']
>>> mySubstn = [x.strip() for x in substn]
>>> mySubstn
['74S', '102S', '286S']

In my later posts, I will discuss how to use python to extract data from within habitat

No comments:

Post a Comment