Getting data from the Semantic Web (Ruby)

From semanticweb.org
Jump to: navigation, search

This tutorial is for programmers used to building software on top of non-Semantic-Web data sources: using screen scraping techniques, or using APIs that return XML, JSON, CSV etc.

Getting data from Semantic Web sources is typically done in one of two ways: either directly getting data in an RDF serialization over HTTP or by using a SPARQL endpoint.

In this tutorial, we shall get some data from DBPedia, the Semantic Web version of Wikipedia.

The original tutorial—Getting data from the Semantic Web—described how to use Python's rdflib. This version is ported to use Ruby's RDF.rb.


[edit] Getting RDF data directly

Some websites produce RDF data that is available in one of the many RDF serializations. The two most common RDF serializations that you need to worry about at the moment are RDF/XML and RDFa. RDF/XML is an XML format that contains RDF data while RDFa allows the developer to include RDF statements inside a web page.

When you first see RDF/XML, you may find it especially hard to understand compared to 'normal' XML: often it is machine-produced and contains some unfamiliar constructs. Just because it is XML, do not be seduced into thinking you can parse RDF/XML by hand using XML tools like DOM or SAX parsers. This is because the data inside the RDF graph can be encoded in a number of different ways in RDF/XML, and your attempt to parse the data may break. The same is true for RDFa: for the same reason you should avoid screenscraping if possible (it can break), you should avoid trying to parse RDFa by hand.

Fortunately, there are a variety of tools you can use to get at RDF data. For this tutorial, we'll use the Ruby library RDF.rb.


If you are using a Linux or UNIX-based machine (including Mac OS X), you can install RDF.rb by running:

sudo gem install linkeddata

This will install the basic rdf gem as well as a variety of extension modules including rdf-rdfa, rdf-rdfxml, rdf-n3 etc.

Now start an interactive Ruby shell by running 'irb' from the shell.

We shall parse some RDF/XML from DBpedia on a number of people. The way you parse RDF with RDF.rb is you create a Graph, which is a sort of empty holder for data. Imagine this as a big container for data, and you can throw in to the container as much data as you like, then just filter out the bits you want.

First we should import the relevant libraries:


require 'rdf'
require 'linkeddata'

Next, we'll create a graph object and parse some data from the web. The graph object has a method called load which takes an HTTP URI and will try to load data from that source. So, let's load in data about Elvis Presley from DBpedia.


graph = RDF::Graph.load("http://dbpedia.org/resource/Elvis_Presley")


This will pause for a second or so to load the data from the web.

We can see that we've loaded some data by seeing how many statements are in the graph object:

graph.size

At the time of writing, g.size returned 1,097. This will change as both the parsers that DBPedia use and the page on Wikipedia changes.


RDF as a data format is basically a graph that is built up of 'triple' statements. These are made up of subjects, predicates and objects, like simple sentences in a natural language like English. The graph having 1,097 statements is a bit like it having 1,097 individual sentences, but not necessarily about the same thing. Those sentences are of the form:


  • Elvis Presley is a rock-and-roll singer.
  • Elvis Presley was born in the United States.
  • Elvis Presley was born on the 8 January 1935.

An RDF graph doesn't all have to be on the same topic. It could freely have 'sentences' about Elvis Presley, Bondi Beach, Barack Obama, the Moon, Camembert, your pet cat, a news article on the trial of a Nazi war criminal, triangles, some particualr species of whale, a television programme, and anything else that is a "thing".

RDF has sentences like this translated into a machine-readable structure. The subjects – 'objects' in an object-oriented sense – are URIs, as are the predicates (like 'is a', 'was born in' etc.) and the 'objects' of the sentence are either URIs of other resources or they are 'literals', blobs of data.

In this tutorial, we're going to retrieve some literals: RDF literals are basically strings. Other datatypes exist but are implemented as a type restriction on a string. So, for instance, integers or floats or dates are just strings with a little tag on them saying "by the way, this is an integer (or a float or a date or whatever)". If you know about XML, the datatypes used in RDF literals come from XML Schema (don't worry: RDF doesn't worry about the rest of the stuff in the XML Schema spec!).

So let's retrieve the birth and death dates from the graph. The first thing we need to know are the URIs of the properties. On DBpedia, the URIs used for this are:

To retrieve this, we can use a query syntax in RDF.rb called Basic Graph Pattern (BGP). This is pretty simple to use.

First we make a query object:

query = RDF::Query.new({
  :person => {
    RDF::URI("http://dbpedia.org/ontology/birthDate") => :birthDate,
    RDF::URI("http://dbpedia.org/ontology/deathDate") => :deathDate
  }
})

Now we apply that to our graph:

results = query.execute(graph)

This will give us a RDF::Query::Solution object. You can treat this like an array.

You can get the birth date by calling:

results.first[:birthDate].to_s

It should return the string 1935-01-08

You can now use your favourite date parsing library to turn that into a date object, if you so desire.

As was noted earlier, this graph structure is a big bucket where you can throw as much data as you like (within the limits of your computer's memory, of course). So let's test our birth and death date query with a few more people.


graph.load("http://dbpedia.org/resource/Tim_Berners-Lee")
graph.load("http://dbpedia.org/resource/Albert_Einstein")
graph.load("http://dbpedia.org/resource/Margaret_Thatcher")

We can re-run the query:

query.execute(graph)

Now, at the time of writing, Tim Berners-Lee and Margaret Thatcher are still alive while Elvis Presley and Albert Einstein are dead, so we should get two results. We can iterate through them:

query.execute(graph).each {|i| puts i[:person] + " -- born: " + i[:birthDate].to_s }

And we should get this in response:

http://dbpedia.org/resource/Elvis_Presley -- born: 1935-01-08
http://dbpedia.org/resource/Albert_Einstein -- born: 1879-03-14

Hopefully, you can now do some basic parsing of data from RDF you've gotten from the web such that you can use this data for building mashups and other applications. You have to remember that while this seems rather strange compared to parsing XML or JSON, it is because RDF is built around a different data model - that of a graph structure rather than a document tree (plain XML) or nested key-value/array (JSON).

If you want to use other languages than Ruby, there are a variety of tools of similar simplicity to RDF.rb:

[edit] Support

More information on the parsing libraries can be found on their websites and on their project pages. For general assistance with RDF and other Semantic Web technologies, you can use:

[edit] Further resources

Personal tools
Namespaces

Variants
Actions
Navigation
services
Toolbox