Word Count

I have been playing with Apache’s Hadoop over the last couple of days. One of the code examples is a simple word count routine, and I thought it would be fun to find a large book and do a word count on it.

I choose Ulysses by James Joyce, and downloaded the Project Gutenberg version of the text. After removing the extra text related to licensing etc., I uploaded it to my Hadoop distributed file system and ran the word counter against it. The initial results were obviously not very accurate, so I tweaked the word count MapReduce job to remove extra punctuation and convert all the words to lowercase. Much better.

It turned out that the most used word was “the”, at 14,932 times. However, according to one of my Twitter contacts, “a,” “and,” and “the,” are not supposed to be counted as words. That would make “of” the most used word (if it can be counted), at 8,138. She was curious about the number of times “yes” is used in the text: It turns up only 360 times.

Here are the top-25 most used words in Ulysses:
1134 she
1187 as
1208 said
1269 by
1288 at
1320 all
1435 is
1524 him
1786 her
1889 you
1936 for
2109 on
2133 was
2357 it
2514 with
2606 that
2698 i
3333 his
4031 he
4924 in
4962 to
6500 a
7214 and
8138 of
14932 the

If you are curious about the word counts in Ulysses, you can download this file and see for yourself. The download is a zip file containing two text files, one sorted alphabetically by word and the other sorted by the number of occurrences of each word. If you come up with any interesting visualizations or other analysis of the files, please leave a comment. I would like to see what you do with the data.

Old Media, Social Media, and Developers

I recently attended the Twitter developer meetup, and one of the subjects that came up quite often was the question of how to get “old” media companies to understand social media. A couple of days after the meeting, I read this article from 1995, where the author predicts that no online database can ever replace your daily newspaper, electronic books will never catch on, and e-commerce is a joke. The article actually made me laugh out loud, but the confluence of the article and the questions raised at the developer meeting made me think.

True, it was 1995, and the internet was nothing like we have today. But what the author of that article lacked was vision. What else happened in 1995? Sun launched Java. The Yahoo! domain was created. RealAudio gave us audio streaming over our dial-up connections. People who were willing to look for the potential of this new technology found it, and many of them made a lot of money along the way.

Fast forward to 2010. Everybody has a web site. Most companies understand that commerce over the Internet is a viable business model. But social media is something new. Many companies, including media companies, are still struggling to understand how to fit social networking into their business. They are lacking the vision to see how to make this new technology, this new avenue of interacting with customers, fit in to their old ways of thinking.

Part of the problem is that social media is about community, more than marketing, and this is a change of thinking that is difficult for old media to embrace. Community implies an openness that many companies are not comfortable with. But this is a shift that will need to be made to make social media work for a company. Social media companies can make a big difference here by providing education, helping management and marketing teams understand how to use social media to create communities around their brands.

Another problem is that developer resources are required, and these resources are often stretched thin. Social media companies can make a big difference here by building tools and providing documentation to assist in rapid integration of social network features into a companies’ existing workflow and systems. This will also create an environment where companies that specialize in social media integration to thrive.

It will be interesting to see how social media companies tackle this challenge, and it is something that we, as developers, should be thinking about.