Category Archives: ruby

calculating Term Frequency in Ruby

I’ve recently been doing an online course in Data Science, part of which involves practical exercises in data mining and sentiment evaluation. The course required me to do them in Python which was a useful learning experience, but since Ruby is my preferred hacking language I’ll re-do some of them in Ruby here over the course of a few posts. This one is about calculating Term Frequency, a common requirement for text mining and the first step in a TF-IDF analysis.

1. read your text into an array of words

using a simple regex here to ignore anything that doesn’t start with a letter:

words = File.read("my_doc.txt").text.split(/\W+/)

(for big documents you’ll want to split this e.g. using each_line)

2. make a hash of word frequency

word_freq = Hash.new(0)
words.each { |word| word_freq[word] += 1 }

as always with Ruby there’s more than one way to do it so here are a couple of alternatives; first with reduce (inject):

word_freq = words.reduce(Hash.new(0)) {|word, freq| word[freq] +=1; word}

and here’s another way using Enumerable’s each_with_object method:

word_freq = words.each_with_object(Hash.new(0)) {|freq, word| word[freq] +=1}

(note the switching of the order of key/value params).

I’ve not tested which of these is most performant against large documents yet but from a purely stylistic point of view I find each_with_object the clearest way

3. calculate the TF by dividing word frequency by total words in the text

term_freq = Hash[word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] } ]

or if you’re on Ruby 2.1 the rather tidier .to_h Enumerable method:

term_freq = word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] }.to_h

4. sort, reverse, enjoy!

Hash[term_freq.sort_by { |k,v| v}.reverse]

(or again, in 2.1)

term_freq.sort_by { |k,v| -v }.to_h

That’s about it, nothing too taxing. Next up I’ll do Inverse Document Frequency to complete the TF-IDF algorithm.

Advertisements

querying RDF in Ruby with RDF.rb

I recently had to make a tool for work that allowed me to see the linked data graphs that BBC journalists are starting to create as they annotate news content. Ruby is my hacking language of choice so this blog post describes how I used @gkellog’s RDF.rb library to:

  • fetch RDF graphs from the BBC’s linked data platform’s HTTPS API (via Restclient)
  • parse the data with RDF::Turtle::Reader
  • query it with RDF::Query and process the resulting Solutions

Disclaimer – I’m an amateur programmer so some of this may look horribly hacky to a Ruby or RDF expert; in my defence all I can say is that it works 🙂

getting data from the API

The BBC’s linked data platform sits behind a REST API that uses HTTPS and requires RSA cert authentication (the guys working on it plan a public API sometime soon, bit for now its use is internal only). Using the restclient gem makes getting data from this kind of API pretty straightforward:

require 'restclient'

SSL = {
  :ssl_client_cert => OpenSSL::X509::Certificate.new(File.read("/path/to/my/client.crt")),
  :ssl_client_key => OpenSSL::PKey::RSA.new(File.read("/path/to/my/client.key")),
  }

def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
end

so now I have a String object that contains some RDF/turtle graphs. For the sake of completeness here’s an example of what the API response looks like:

<http://www.bbc.co.uk/things/ffc9b446-97b0-4cec-9f4f-dbd5d8238dad#id>
      a       <http://www.bbc.co.uk/ontologies/cms/ManagedThing> , <http://www.bbc.co.uk/ontologies/news/Person> ;
      <http://www.w3.org/2000/01/rdf-schema#seeAlso>
              <http://www.chucknorris.com/> ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/disambiguationHint>
              "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist and actor. After serving in the United States Air Force, he began his rise to fame as a martial artist, and has since founded his own school, Chun Kuk Do." ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/preferredLabel>
              "Chuck Norris" ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/sameAs>
              <http://dbpedia.org/resource/Chuck_Norris> .

<http://www.bbc.co.uk/contexts/85390773-6985-49c9-aef1-ec3763f258ab#id>
      a       <http://www.bbc.co.uk/ontologies/provenance/ThingGraph> ;
      <http://www.bbc.co.uk/ontologies/provenance/provided>
              "2013-11-07T17:20:39+00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
      <http://www.bbc.co.uk/ontologies/provenance/provider>
              <mailto:jeremy.tarling@bbc.co.uk> .

The next step was to read the response into an RDF graph that can be queried – “get me all the objects (values) of triples with the predicate <core:sameAs>” sort of thing.

reading the API response into an RDF graph

This is the bit that I got a bit stuck on. There are some great examples linked to from the RDF.rb page but none of them seemed to do exactly what I wanted, namely to work with the in-memory String object that restclient had made for me.

I ended up with a two step process: first to read the string using RDF::Turtle::Reader

rdf_doc = RDF::Turtle::Reader.new(data)

and then to append the resulting RDF data to a RDF::Graph.new object so it could be queried with RDF::Query

graph = RDF::Graph.new << rdf_doc

The getThingGraph method now looks like this:

 def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
  rdf_doc = RDF::Turtle::Reader.new(data)
  graph = RDF::Graph.new << rdf_doc
end

which results in an object that can now be queried.

querying the graph and processing the results

The RDF::Query class allows you to define a query pattern. In my example I’m going to define a simple query that looks for any triples that have the predicate rdf:type – a useful thing to get an idea of the sort of data you are dealing with:

@thingType = RDF::Query.execute(graph, {
  :thing => {RDF::URI("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") => :type}
})

Executing the query gets you an RDF::Query::Solutions object which has some nice methods for examining graph datasets. Note that that’s ‘Solutions’, not ‘Solution’ – in other words it’s a collection so you can iterate over each solution that matched your query. In my case I’m presenting the results in a Sinatra app so they surface via an erb template:

<% @thingType.each do |thing| %>
  <%= thing[:type] %>
<% end %>

And there you have it – in my example graph above the result tells me that Chuck has three types, a cms:ManagedThing, a news:Person and a provenance:ThingGraph.