Category Archives: data science

calculating Term Frequency in Ruby

I’ve recently been doing an online course in Data Science, part of which involves practical exercises in data mining and sentiment evaluation. The course required me to do them in Python which was a useful learning experience, but since Ruby is my preferred hacking language I’ll re-do some of them in Ruby here over the course of a few posts. This one is about calculating Term Frequency, a common requirement for text mining and the first step in a TF-IDF analysis.

1. read your text into an array of words

using a simple regex here to ignore anything that doesn’t start with a letter:

words = File.read("my_doc.txt").text.split(/\W+/)

(for big documents you’ll want to split this e.g. using each_line)

2. make a hash of word frequency

word_freq = Hash.new(0)
words.each { |word| word_freq[word] += 1 }

as always with Ruby there’s more than one way to do it so here are a couple of alternatives; first with reduce (inject):

word_freq = words.reduce(Hash.new(0)) {|word, freq| word[freq] +=1; word}

and here’s another way using Enumerable’s each_with_object method:

word_freq = words.each_with_object(Hash.new(0)) {|freq, word| word[freq] +=1}

(note the switching of the order of key/value params).

I’ve not tested which of these is most performant against large documents yet but from a purely stylistic point of view I find each_with_object the clearest way

3. calculate the TF by dividing word frequency by total words in the text

term_freq = Hash[word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] } ]

or if you’re on Ruby 2.1 the rather tidier .to_h Enumerable method:

term_freq = word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] }.to_h

4. sort, reverse, enjoy!

Hash[term_freq.sort_by { |k,v| v}.reverse]

(or again, in 2.1)

term_freq.sort_by { |k,v| -v }.to_h

That’s about it, nothing too taxing. Next up I’ll do Inverse Document Frequency to complete the TF-IDF algorithm.

Advertisements