Author Archives: jeremytarling

About jeremytarling

Data Architect for BBC News online

calculating Term Frequency in Ruby

I’ve recently been doing an online course in Data Science, part of which involves practical exercises in data mining and sentiment evaluation. The course required me to do them in Python which was a useful learning experience, but since Ruby is my preferred hacking language I’ll re-do some of them in Ruby here over the course of a few posts. This one is about calculating Term Frequency, a common requirement for text mining and the first step in a TF-IDF analysis.

1. read your text into an array of words

using a simple regex here to ignore anything that doesn’t start with a letter:

words = File.read("my_doc.txt").text.split(/\W+/)

(for big documents you’ll want to split this e.g. using each_line)

2. make a hash of word frequency

word_freq = Hash.new(0)
words.each { |word| word_freq[word] += 1 }

as always with Ruby there’s more than one way to do it so here are a couple of alternatives; first with reduce (inject):

word_freq = words.reduce(Hash.new(0)) {|word, freq| word[freq] +=1; word}

and here’s another way using Enumerable’s each_with_object method:

word_freq = words.each_with_object(Hash.new(0)) {|freq, word| word[freq] +=1}

(note the switching of the order of key/value params).

I’ve not tested which of these is most performant against large documents yet but from a purely stylistic point of view I find each_with_object the clearest way

3. calculate the TF by dividing word frequency by total words in the text

term_freq = Hash[word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] } ]

or if you’re on Ruby 2.1 the rather tidier .to_h Enumerable method:

term_freq = word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] }.to_h

4. sort, reverse, enjoy!

Hash[term_freq.sort_by { |k,v| v}.reverse]

(or again, in 2.1)

term_freq.sort_by { |k,v| -v }.to_h

That’s about it, nothing too taxing. Next up I’ll do Inverse Document Frequency to complete the TF-IDF algorithm.

Events…

About a year ago I worked on a public data model for representing news stories as linked data. The model is simple  and can be summed up by example in the following RDF statements:

<Storyline1>  :hasSlot  <StorylineSlotA>
<Storyline1>  :hasSlot  <StorylineSlotB>
 
<StorylineSlotA>  :contains   <Event1>
<StorylineSlotB>  :contains   <Event2>
 
<StorylineSlotB>  :follows  <StorylineSlotA> 

<Asset1>  :about  <Event1>
<Asset2>  :about  <Event1>
<Asset3>  :about  <Event1>
 
<Asset4>  :about  <Event2>
<Asset5>  :about  <Event2>
<Asset6>  :about  <Event2>

In order to implement that in BBC News I took a strategic decision to allow Storyline instances to be the object (rdfs:domain) of our :about predicates, effectively simplifying the model to enable a journalist to say:

<Asset7>  :about  <Storyline1>

We ran a pilot with a local newsroom in winter 2103/14 and this approach worked fine, content could be aggregated into collections (typically chronological streams of updates) with each asset being annotated as being about that Storyline. This can be used to drive a user experience similar to http://www.itv.com/news.

In December 2013 I was fortunate to have Paul Rissen join me in News – Paul had been one of the original collaborators on the Storyline data model, and was the author of the Stories ontology which it was derived from.  Over the past few months Paul has helped me realise that while allowing Storyline instances to be used as tags may have been useful to promote its adoption, it is semantically wrong.  A Storyline is a particular telling of a story – a version of events unique to that journalist or newsroom:

<Asset1>  :about  <Journalist A's version of events>

Doesn’t sound right does it? News assets are usually about events, and (as Yves pointed out long ago) events involve people and organisations, take place at locations, and can involve other factors. Storyline is the editorial layer on top of that basic annotation – a curation if you like. It is the decision process that goes in to the selection of assets  that describe that event or series of events.

Over the coming months Paul and I will be looking at how we can implement this distinction into the (now well established) newsroom tagging workflow, to make sure that the semantic annotations we are making are as accurate and useful as possible.

 

querying RDF in Ruby with RDF.rb

I recently had to make a tool for work that allowed me to see the linked data graphs that BBC journalists are starting to create as they annotate news content. Ruby is my hacking language of choice so this blog post describes how I used @gkellog’s RDF.rb library to:

  • fetch RDF graphs from the BBC’s linked data platform’s HTTPS API (via Restclient)
  • parse the data with RDF::Turtle::Reader
  • query it with RDF::Query and process the resulting Solutions

Disclaimer – I’m an amateur programmer so some of this may look horribly hacky to a Ruby or RDF expert; in my defence all I can say is that it works 🙂

getting data from the API

The BBC’s linked data platform sits behind a REST API that uses HTTPS and requires RSA cert authentication (the guys working on it plan a public API sometime soon, bit for now its use is internal only). Using the restclient gem makes getting data from this kind of API pretty straightforward:

require 'restclient'

SSL = {
  :ssl_client_cert => OpenSSL::X509::Certificate.new(File.read("/path/to/my/client.crt")),
  :ssl_client_key => OpenSSL::PKey::RSA.new(File.read("/path/to/my/client.key")),
  }

def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
end

so now I have a String object that contains some RDF/turtle graphs. For the sake of completeness here’s an example of what the API response looks like:

<http://www.bbc.co.uk/things/ffc9b446-97b0-4cec-9f4f-dbd5d8238dad#id>
      a       <http://www.bbc.co.uk/ontologies/cms/ManagedThing> , <http://www.bbc.co.uk/ontologies/news/Person> ;
      <http://www.w3.org/2000/01/rdf-schema#seeAlso>
              <http://www.chucknorris.com/> ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/disambiguationHint>
              "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist and actor. After serving in the United States Air Force, he began his rise to fame as a martial artist, and has since founded his own school, Chun Kuk Do." ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/preferredLabel>
              "Chuck Norris" ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/sameAs>
              <http://dbpedia.org/resource/Chuck_Norris> .

<http://www.bbc.co.uk/contexts/85390773-6985-49c9-aef1-ec3763f258ab#id>
      a       <http://www.bbc.co.uk/ontologies/provenance/ThingGraph> ;
      <http://www.bbc.co.uk/ontologies/provenance/provided>
              "2013-11-07T17:20:39+00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
      <http://www.bbc.co.uk/ontologies/provenance/provider>
              <mailto:jeremy.tarling@bbc.co.uk> .

The next step was to read the response into an RDF graph that can be queried – “get me all the objects (values) of triples with the predicate <core:sameAs>” sort of thing.

reading the API response into an RDF graph

This is the bit that I got a bit stuck on. There are some great examples linked to from the RDF.rb page but none of them seemed to do exactly what I wanted, namely to work with the in-memory String object that restclient had made for me.

I ended up with a two step process: first to read the string using RDF::Turtle::Reader

rdf_doc = RDF::Turtle::Reader.new(data)

and then to append the resulting RDF data to a RDF::Graph.new object so it could be queried with RDF::Query

graph = RDF::Graph.new << rdf_doc

The getThingGraph method now looks like this:

 def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
  rdf_doc = RDF::Turtle::Reader.new(data)
  graph = RDF::Graph.new << rdf_doc
end

which results in an object that can now be queried.

querying the graph and processing the results

The RDF::Query class allows you to define a query pattern. In my example I’m going to define a simple query that looks for any triples that have the predicate rdf:type – a useful thing to get an idea of the sort of data you are dealing with:

@thingType = RDF::Query.execute(graph, {
  :thing => {RDF::URI("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") => :type}
})

Executing the query gets you an RDF::Query::Solutions object which has some nice methods for examining graph datasets. Note that that’s ‘Solutions’, not ‘Solution’ – in other words it’s a collection so you can iterate over each solution that matched your query. In my case I’m presenting the results in a Sinatra app so they surface via an erb template:

<% @thingType.each do |thing| %>
  <%= thing[:type] %>
<% end %>

And there you have it – in my example graph above the result tells me that Chuck has three types, a cms:ManagedThing, a news:Person and a provenance:ThingGraph.