Using command line tools to detect the most frequent words in a file

Antonio Cangiano wrote a post about “[Using Python to detect the most frequent words in a file](”. It’s a nice summary of how to do it in Python, but (nearly) the same thing can be accomplished by stringing together a few standard command line tools.

I’m no command line ninja, but I’d like to think I have basic command of most of the standard filters. Here’s my solution:

cat test.txt | tr -s ‘[:space:]’ ‘\n’ | tr ‘[:upper:]’ ‘[:lower:]’ | sort | uniq -c | sort -n | tail -10

I’ll explain it blow-by-blow:

cat test.txt

If you don’t know what this does you’ve got a lot to learn. “cat” simply reads files and prints them to standard output (concatenates), for use by subsequent filters.

tr -s ‘[:space:]’ ‘\n’

“tr” is a handy tool that simply translates matching characters from the first set to the corresponding character of the second set. The first instance turns all whitespace characters (spaces, tabs, newlines) into newlines (“\n”) so that each word is on a separate line (the -s option “squeezes” multiple runs of newlines into a single newline).

tr ‘[:upper:]’ ‘[:lower:]’

The second instance translates all uppercase characters into lowercase (note: the two “tr”s are separate for clarity, but they could be combined into a single one).

sort | uniq -c

“sort” and “uniq” do exactly as their names imply, but “uniq” only removes adjacent duplicates, so you often want to sort the input first. The “-c” option for “uniq” prepends each line with the number of occurrences.

sort -n

We sort the result of “uniq”, this time by numerical order (“-n”) to get the list of words in order of the number of occurrences.

tail -10

Finally, we get the 10 most frequently occurring words by using “tail” to take only the last 10 lines (since the “sort -n” puts the list in ascending order)

It’s not perfect, especially since punctuation is included in the words, but the “tr” commands can be tweaked as needed.