Friday, December 31, 2010

Counting the Number of Occurances of Words in a File

Suppose you want to count the number of times a word appeared in a text file, how would you do it? The best approach would be to use dictionaries. You use the word to be counted as the key and the dict content would the the word count. For example, d['and'] = 4 means the word 'and' appeared 4 times.

If you are quick, you might now be asking: I don't even know what words exist in the file, so how could I even know what keys to use? Well all you need to do is to check if a word is already a key in the dict and if not, create a dict entry with that key; otherwise simply increment that dict entry. Below is an example:


We use humpy.txt whose contents are:

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again.

The script that will count the number of occurrences of a word is count.py with listing below:

fin = open('humpy.txt','r')
d = {}
for line in fin:
    # break down the line into individual words
    # strip() would remove leading/trailing spaces and newlines
    # words below will be an array whose elements are the
    # individual words
    words = line.strip().split(' ')
    for word in words:
        if word in d: # does this word has an entry in d?
            d[word] += 1  # then increment it
        else:  # no entry yet. So, make an entry
            d[word] = 1
fin.close()
# ok, all words have been processed. Print the statistics
for key, value in d.iteritems():
    print '%s appeared %d times' % (key,value)

If you run the command below in the command line,

python count.py

You will get this result:

a appeared 2 times
on appeared 1 times
great appeared 1 times
again. appeared 1 times
Humpty appeared 3 times
all appeared 1 times
Dumpty appeared 2 times
men appeared 1 times
had appeared 1 times
wall, appeared 1 times
together appeared 1 times
king's appeared 2 times
horses appeared 1 times
All appeared 1 times
fall. appeared 1 times
Couldn't appeared 1 times
put appeared 1 times
and appeared 1 times
the appeared 2 times
sat appeared 1 times

You will notice that 'all' is different from 'All'. If you want them to be considered the same, you could convert all words into lower case first by changing this

words = line.strip().split(' ')

into this

words = line.strip().lower().split(' ')


The other thing that may be unusual is the print syntax. It is actually similar to that of C where you specify the formatting string: %s for string, %d for integer, and %f for float, and then you place the corresponding variables to be printed inside the parentheses in the same order as the formatting string.

Another alternative for the print statement would be:

print key + ' appeared ' + str(value) + ' times'

Or better yet, the more efficient version

print ' '.join([key,'appeared',str(value),'times'])

This might look like a trivial example but I used this idea to create a report of how many alarms were generated for a particular alarm category and even the alarm message itself. I parsed the alarm log files and created a dict d[cat] to count the number of alarms for a particular category. I also created a dict for the alarm message and printed the top 20 alarm messages. This helps us easily identify nuisance alarms or perhaps legitimate alarms that needs immediate attention.

No comments:

Post a Comment