I have been playing with Apache’s Hadoop over the last couple of days. One of the code examples is a simple word count routine, and I thought it would be fun to find a large book and do a word count on it.

I choose Ulysses by James Joyce, and downloaded the Project Gutenberg version of the text. After removing the extra text related to licensing etc., I uploaded it to my Hadoop distributed file system and ran the word counter against it. The initial results were obviously not very accurate, so I tweaked the word count MapReduce job to remove extra punctuation and convert all the words to lowercase. Much better.

It turned out that the most used word was “the”, at 14,932 times. However, according to one of my Twitter contacts, “a,” “and,” and “the,” are not supposed to be counted as words. That would make “of” the most used word (if it can be counted), at 8,138. She was curious about the number of times “yes” is used in the text: It turns up only 360 times.

Here are the top-25 most used words in Ulysses:
1134 she
1187 as
1208 said
1269 by
1288 at
1320 all
1435 is
1524 him
1786 her
1889 you
1936 for
2109 on
2133 was
2357 it
2514 with
2606 that
2698 i
3333 his
4031 he
4924 in
4962 to
6500 a
7214 and
8138 of
14932 the

If you are curious about the word counts in Ulysses, you can download this file and see for yourself. The download is a zip file containing two text files, one sorted alphabetically by word and the other sorted by the number of occurrences of each word. If you come up with any interesting visualizations or other analysis of the files, please leave a comment. I would like to see what you do with the data.