To us, here and now, it appears thus

Rants, Ramblings and other things that make life worth living

Archive for May, 2007

Is language huffman coded?

with 5 comments

Huffman coding, for the uninitiated is a sort of compression scheme in computer science that assignes short binary representations to frequently used characters and longer binary representations to the less frequently used. The idea is that what you lose by encoding less frequently occuring characters with longer bitstrings, you gain by encoding more frequently used characters with shorter bitstrings.

Having said that, something I noticed only lately is that human language seems like its encoded using a similar scheme. The more frequent a word is used, the shorter it is. For examples, most of the closed class words like prepositons and determiners and sometimes even commonly used verbs are short, while the ones which are longer are usually rarely used words. Since I’ve been bragging lately about all this computing power that I have, Its about time to flaunt it.

I used the enron corpus which contains about 250,000 unique email messages totalling to approximately 1.5 gigs of text, small but not too small. I plotted the word frequency vs length of the word and the results seem as expected (click the image to see a better picture).

There is an initial peak when the length of characters is about 3 or so and then the frequency rapidly declines to almost nothing when the length increases to around 13 or so. A much clearer picture of the encoding that goes on here can be obtained if we normalize the frequency counts by the number of words of a given length. There are very few one letter words (namely the determiner ‘a’) but more two letter words (’an’,'to’,'in’,etc..) and so on.

normalized-graph.jpg

At first look the above graph looks negative exponential to me (meaning, good job on the compression). Infact this is what one would expect if someone were to do a good job on the compression.

Its really suprising to realise that what usually requires sophisticated critical thinking (schemes like huffman encoding) can also be easily reproduced by random, unplanned phenomena like linguistic evolution.

Signing Off,
Vishnu Vyas.

Written by admin

May 31st, 2007 at 9:51 pm

Big Numbers 2 - The Hadoop Version.

without comments

Well, I’ve talked about big numbers before, but never before have I thought that having a 1.5 gigabytes of text corpora to be actually small. Well, guess what - it is small. Tiny even. This is what happens when you have the power to process multiples of terabytes and all you have is puny 1.5 gigabytes of data. That’s right, I’m now setup on two clusters one of 3 nodes and one of 19 nodes all running Hadoop - an open source version of the Google File System and the Google Map Reduce program.

So, what is all the fuss about hadoop and map reduce? Haven’t people been doing such stuff for a long long time? - Well, yes and no. The idea of distributing your computation and then combining the results has been around for long, but what hadoop does is that instead of moving your data to the place of computation, it moves computation to the location of data. And thus allowing to run multiple independent jobs called ‘maps’ which work on each chunk of data independently and then one can use the output to do a ‘reduce’ which then combines all the output of a map step into the final result that one desires. Programming in this model is fun, powerful and furthermore really really simple. 

I will probably put up some ajaxy demos of some results that I’ve with the my new found computing power quite soon, so till then, stay tuned.

Signing Off,
Vishnu Vyas

Written by admin

May 24th, 2007 at 8:29 am

Tusckon Kavidai - Collection 1.

with 2 comments

These Tusckon kavidai’s are things that come up in late friday night conversation when everyone is either high on alcohol or on nostalgia… here is 6 of the best ones that I’ve collected (or made up) .

  1. What a pity, What a pity, Aaaya kaila otta chatti.
  2. Kadhal soriya soriya sugama irukum, sorinji mudicha peragu ranama irukum.
  3. Naiku naalu kal irunthalum athula STD call’o ISD call’o podamudiyuma?
  4. Punctuality is the main quality, gumudipoondi is a municipality, aunty wear the nighty, uncle putting 90, kazhatti potta jatti, ts all dirty, I want one malayala kutti.
  5. You think I don’t know malayalam? - I know thirichur, Cochin, Ernakulam, Trivandram, Nendaram Pazham, Shakila Padam
  6. Digiree kaapi, digiree kapi nu sollureengaley, adhu yendha college-la degree vanguchu?

By Special Request of Hari,

Signing Off,
Vishnu Vyas.

Written by admin

May 22nd, 2007 at 6:25 pm

Posted in Geeky Stuff