calculating Term Frequency in Ruby

I’ve recently been doing an online course in Data Science, part of which involves practical exercises in data mining and sentiment evaluation. The course required me to do them in Python which was a useful learning experience, but since Ruby is my preferred hacking language I’ll re-do some of them in Ruby here over the course of a few posts. This one is about calculating Term Frequency, a common requirement for text mining and the first step in a TF-IDF analysis.

1. read your text into an array of words

using a simple regex here to ignore anything that doesn’t start with a letter:

words = File.read("my_doc.txt").text.split(/[^A-Za-z]/)

(for big documents you’ll want to split this e.g. using each_line)

2. make a hash of word frequency

word_freq = Hash.new(0)
words.each { |word| word_freq[word] += 1 }

as always with Ruby there’s more than one way to do it so here are a couple of alternatives; first with reduce (inject):

word_freq = words.reduce(Hash.new(0)) {|word, freq| word[freq] +=1; word}

and here’s another way using Enumerable’s each_with_object method:

word_freq = words.each_with_object(Hash.new(0)) {|freq, word| word[freq] +=1}

(note the switching of the order of key/value params).

I’ve not tested which of these is most performant against large documents yet but from a purely stylistic point of view I find each_with_object the clearest way

3. calculate the TF by dividing word frequency by total words in the text

term_freq = Hash[word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] } ]

or if you’re on Ruby 2.1 the rather tidier .to_h Enumerable method:

term_freq = word_freq.map { |k,v| [k, (v.to_f / words.length).round(3) ] }.to_h

4. sort, reverse, enjoy!

Hash[term_freq.sort_by { |k,v| v}.reverse]

(or again, in 2.1)

term_freq.sort_by { |k,v| -v }.to_h

That’s about it, nothing too taxing. Next up I’ll do Inverse Document Frequency to complete the TF-IDF algorithm.

Events…

About a year ago I worked on a public data model for representing news stories as linked data. The model is simple  and can be summed up by example in the following RDF statements:

<Storyline1>  :hasSlot  <StorylineSlotA>
<Storyline1>  :hasSlot  <StorylineSlotB>
 
<StorylineSlotA>  :contains   <Event1>
<StorylineSlotB>  :contains   <Event2>
 
<StorylineSlotB>  :follows  <StorylineSlotA> 

<Asset1>  :about  <Event1>
<Asset2>  :about  <Event1>
<Asset3>  :about  <Event1>
 
<Asset4>  :about  <Event2>
<Asset5>  :about  <Event2>
<Asset6>  :about  <Event2>

In order to implement that in BBC News I took a strategic decision to allow Storyline instances to be the object (rdfs:domain) of our :about predicates, effectively simplifying the model to enable a journalist to say:

<Asset7>  :about  <Storyline1>

We ran a pilot with a local newsroom in winter 2103/14 and this approach worked fine, content could be aggregated into collections (typically chronological streams of updates) with each asset being annotated as being about that Storyline. This can be used to drive a user experience similar to http://www.itv.com/news.

In December 2013 I was fortunate to have Paul Rissen join me in News – Paul had been one of the original collaborators on the Storyline data model, and was the author of the Stories ontology which it was derived from.  Over the past few months Paul has helped me realise that while allowing Storyline instances to be used as tags may have been useful to promote its adoption, it is semantically wrong.  A Storyline is a particular telling of a story – a version of events unique to that journalist or newsroom:

<Asset1>  :about  <Journalist A's version of events>

Doesn’t sound right does it? News assets are usually about events, and (as Yves pointed out long ago) events involve people and organisations, take place at locations, and can involve other factors. Storyline is the editorial layer on top of that basic annotation – a curation if you like. It is the decision process that goes in to the selection of assets  that describe that event or series of events.

Over the coming months Paul and I will be looking at how we can implement this distinction into the (now well established) newsroom tagging workflow, to make sure that the semantic annotations we are making are as accurate and useful as possible.

 

querying RDF in Ruby with RDF.rb

I recently had to make a tool for work that allowed me to see the linked data graphs that BBC journalists are starting to create as they annotate news content. Ruby is my hacking language of choice so this blog post describes how I used @gkellog’s RDF.rb library to:

  • fetch RDF graphs from the BBC’s linked data platform’s HTTPS API (via Restclient)
  • parse the data with RDF::Turtle::Reader
  • query it with RDF::Query and process the resulting Solutions

Disclaimer – I’m an amateur programmer so some of this may look horribly hacky to a Ruby or RDF expert; in my defence all I can say is that it works :-)

getting data from the API

The BBC’s linked data platform sits behind a REST API that uses HTTPS and requires RSA cert authentication (the guys working on it plan a public API sometime soon, bit for now its use is internal only). Using the restclient gem makes getting data from this kind of API pretty straightforward:

require 'restclient'

SSL = {
  :ssl_client_cert => OpenSSL::X509::Certificate.new(File.read("/path/to/my/client.crt")),
  :ssl_client_key => OpenSSL::PKey::RSA.new(File.read("/path/to/my/client.key")),
  }

def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
end

so now I have a String object that contains some RDF/turtle graphs. For the sake of completeness here’s an example of what the API response looks like:

<http://www.bbc.co.uk/things/ffc9b446-97b0-4cec-9f4f-dbd5d8238dad#id>
      a       <http://www.bbc.co.uk/ontologies/cms/ManagedThing> , <http://www.bbc.co.uk/ontologies/news/Person> ;
      <http://www.w3.org/2000/01/rdf-schema#seeAlso>
              <http://www.chucknorris.com/> ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/disambiguationHint>
              "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist and actor. After serving in the United States Air Force, he began his rise to fame as a martial artist, and has since founded his own school, Chun Kuk Do." ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/preferredLabel>
              "Chuck Norris" ;
      <http://www.bbc.co.uk/ontologies/coreconcepts/sameAs>
              <http://dbpedia.org/resource/Chuck_Norris> .

<http://www.bbc.co.uk/contexts/85390773-6985-49c9-aef1-ec3763f258ab#id>
      a       <http://www.bbc.co.uk/ontologies/provenance/ThingGraph> ;
      <http://www.bbc.co.uk/ontologies/provenance/provided>
              "2013-11-07T17:20:39+00:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
      <http://www.bbc.co.uk/ontologies/provenance/provider>
              <mailto:jeremy.tarling@bbc.co.uk> .

The next step was to read the response into an RDF graph that can be queried – “get me all the objects (values) of triples with the predicate <core:sameAs>” sort of thing.

reading the API response into an RDF graph

This is the bit that I got a bit stuck on. There are some great examples linked to from the RDF.rb page but none of them seemed to do exactly what I wanted, namely to work with the in-memory String object that restclient had made for me.

I ended up with a two step process: first to read the string using RDF::Turtle::Reader

rdf_doc = RDF::Turtle::Reader.new(data)

and then to append the resulting RDF data to a RDF::Graph.new object so it could be queried with RDF::Query

graph = RDF::Graph.new << rdf_doc

The getThingGraph method now looks like this:

 def getThingGraph(guid)
  url = "https://api.live.bbc.co.uk/ldp-writer/thing-graphs?guid=" + guid
  data = RestClient::Resource.new(url, SSL).get({:accept => "application/rdf+turtle"})
  rdf_doc = RDF::Turtle::Reader.new(data)
  graph = RDF::Graph.new << rdf_doc
end

which results in an object that can now be queried.

querying the graph and processing the results

The RDF::Query class allows you to define a query pattern. In my example I’m going to define a simple query that looks for any triples that have the predicate rdf:type – a useful thing to get an idea of the sort of data you are dealing with:

@thingType = RDF::Query.execute(graph, {
  :thing => {RDF::URI("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") => :type}
})

Executing the query gets you an RDF::Query::Solutions object which has some nice methods for examining graph datasets. Note that that’s ‘Solutions’, not ‘Solution’ – in other words it’s a collection so you can iterate over each solution that matched your query. In my case I’m presenting the results in a Sinatra app so they surface via an erb template:

<% @thingType.each do |thing| %>
  <%= thing[:type] %>
<% end %>

And there you have it – in my example graph above the result tells me that Chuck has three types, a cms:ManagedThing, a news:Person and a provenance:ThingGraph.

Storylines vs object oriented news

I’ve recently been collaborating with some like-minded colleagues in the BBC and other media organisations on a model for story-telling in News. Building on the work that the BBC has been doing to utilise Linked Data driven content aggregations, we wanted to look at how we might model the relationship between events as told by journalists. As Michael Smethurst has pointed out, the who/what/where/when aspect of reporting events gets you so far but leaves out the more interesting elements of ‘why’ and ‘because’:

“The more interesting part (for me) is the dependencies and correlations that exist between events because why is always the most interesting question and because the most interesting answer. Getting the Daily Mail and The Guardian to agree that austerity is happening is relatively easy, getting them to agree on why, and on that basis what should happen next, much more difficult.”

I’m often reminded by colleagues at work that the BBC has a reputation for quality factual reporting and impartiality, as it were to suggest that the editorialisation of news is something that only goes on in newspapers. I don’t think the BBC is just a glorified wire service aggregator and publisher; while it’s true that a good deal of BBC content does come direct from the wires there’s an inevitable process of editorial selection: what to leave in or out, the order to present reported events in, and the links to make between events. Also a lot of content produced by BBC journalists doesn’t fit in to a neat event model: features, analysis, even republishing the the odd bit of celebrity gossip:

gossip

(As Jonathan Stray has pointed out this is not a new trend in journalsim but has been a developing theme over the past century.)

From a data architecture point of view I’m particularly interested in modelling news stories as data. For the past decade the BBC News website has been a flat, page-based website where a page is equal to an article. Not a story mind you, but an article about a story. You might find the odd article that’s a one-off but the great majority of articles on the BBC news website wil be multiple accounts of the same story, retold as new developments occurr. There are three problems with this approach:

  • duplication of content – because each article stands alone it has to re-tell the events that make up the story so far
  • duplication in search engines – search engines will index each article separately, so when someone searches for details about a story they may get the BBC’s latest account or they may not be so lucky – most likely they’ll see multiple articles about the same storyline
  • link curation scaling – links between articles that are about the same story have to be manually created and curated and immediately decay from the moment an article is published

The BBC is in the process of migrating its News website from static page publishing to a dynamic publishing platform based on a typical three-tier architecture: presentation – service – data. This was done for the BBC Sport website last year, and it’s particularly exciting as the data tier consists of both a content store (for articles) and a triple store that holds semantic annotations about the articles in the content store. The opportunity for a BBC News website running on this platform is to move from a page-based model of multiple articles about the same story to a story-driven model where journalists publish updates about storylines to the same (persistent) URL for that story. This was one of the motivations for us to collaborate on the Storyline Ontology.

storyline data model

So what is an update in this story-driven approach? From a web perspective I see an update as a fragment of a story, a development if you like.  Physically it’s an asset: some text, an image, an audio clip, a video clip, a social media status update, etc. Updates might be represented in a URL structure as bbc.co.uk/news/storylineID#updateID, which could be a useful pattern for a few reasons:

  • users of the website could share individual updates via social media
  • updates could be presented in context of the wider narrative – an item in a timeline for example
  • search engines should ignore the fragment identifier (the hash and everything after it) thereby only indexing the story page and removing the duplication that I mentioned above.

But coming back to Michael’s point at the start of this post, it’s not just the updates about reported events that are interesting in a storyline, it’s the selection that drives the narrative thread and points to things like causality – the ‘why’ rather than the ‘what’.

There’s been a fair bit of buzz lately about how some news outlets are paring back news to it’s bare bones – a ‘just the facts’ approach, and that these ‘facts’ can be treated like objects and instanced into news accounts. Object-oriented news is not a new idea, and I can see the attraction in a short-form-social-media-status-update driven world. But I think there’s a risk in this approach that if we overemphasise these fact-objects out of the context of a narrative thread then they take on a life of their own.

Building facts into a storyline involves an editorial process that (should) ensure provenance, attribution and maybe one day even openness about the editorial process that the journalist went through. I was workshopping up in Birmingham last week with the England editorial crew and Eileen Murphy used this phrase that has stuck in my mind: ‘a window on the newsroom’. Anything that increases the transparency of our journalism can only be a good thing.

BBC core news data model

I blogged previously my early thinking about how we needed a core news model to describe basic real-world concepts (people, places, organisations) in the context of a news event, and that those events could be organised into stories. A couple of months on and we have got that corenews model installed in the BBC’s linked data platform, and journalists are now able to annotate news content with these concepts.

The URI for the model is http://bbc.co.uk/ontologies/news/ and you can get HTML or RDF/turtle from that address. The updated ontology diagram looks like this:

core news ontology diagram

You might notice that the stories class has been pulled out of here; there’s a lot of interest at the BBC and other news organisations in to how linked data and story-telling can work. We kicked off a project to collaborate with The Guardian and PA on an open model for this, which I will blog more about soon.