Bangla Word Frequency

I was curious which Bangla words come up most frequently, so I wrote a program to analyze articles from the Anandabazar Patrika newspaper. The current results come from the articles from all of 2012. One of the things that came up right away is that in Bangla words get modified in various ways. The word বই might appear as বইটি. A verb like করা may show up in conjugated form as করেছেন. I thought it would be a lot more informative if these words were counted as one word for the purpose of determining frequency. So my program tries to figure out whether a word is a modified form of some other "base" word.

Here is a screenshot of the word frequency list:

Bangla Frequency Screenshot

At the top, "Total words" counts how many instances of words it scooped up from the articles. If a word like হয়ে appears 26751 times then it gets counted that many times. Below that are the "Unique Variants", wherein those 26751 instances of হয়ে are only counted once. Below that, "Unique words" is computed by consolidating variants like হওয়া, হয়ে and হবে into a single category and counting them all as a single unique word, in this case হওয়া.

Below those three counts is a header that shows how the entries are formatted below that. The rank just indexes the words in order of their frequency. The word হওয়া is ranked number one because no other word was counted more than 280449 times. The percentile indicates what percentage of the words seen were হওয়া and its variants, plus the words that came before it in the ranking. Thus the percentile continually increases until it gets to 100% at the end of the list. Indented beneath each main entry are all of the variants that were found, sorted in order of decreasing frequency, along with the count for each.

Notice the word হওয়া is highlighted and since the mouse is hovering over it the definition "to become, to happen" is being displayed. Try hovering your mouse over this হওয়া and see if it works for you. You may have to hold the mouse still for a second or two.

I've truncated the (rather huge) list according to several criteria below, with each list longer than the one before it:

A few notes about the system:

Compute the word frequency from your own unicode file: