Getting data from the Semantic Web

Jump to: navigation, search

This tutorial is for programmers used to building software on top of non-Semantic-Web data sources: using screen scraping techniques, or using APIs that return XML, JSON, CSV etc.

Getting data from Semantic Web sources is typically done in one of two ways: either directly getting data in an RDF serialization over HTTP or by using a SPARQL endpoint.

In this tutorial, we shall get some data from DBPedia, the Semantic Web version of Wikipedia.

Getting RDF data directly[edit]

Some websites produce RDF data that is available in one of the many RDF serializations. The two most common RDF serializations that you need to worry about at the moment are RDF/XML and RDFa. RDF/XML is an XML format that contains RDF data while RDFa allows the developer to include RDF statements inside a web page.

When you first see RDF/XML, you may find it especially hard to understand compared to 'normal' XML: often it is machine-produced and contains some unfamiliar constructs. Just because it is XML, do not be seduced into thinking you can parse RDF/XML by hand using XML tools like DOM or SAX parsers. This is because the data inside the RDF graph can be encoded in a number of different ways in RDF/XML, and your attempt to parse the data may break. The same is true for RDFa: for the same reason you should avoid screenscraping if possible (it can break), you should avoid trying to parse RDFa by hand.

Fortunately, there are a variety of tools you can use to get at RDF data. For this tutorial, we'll use the Python library RDFLib.

If you are using a Linux or UNIX-based machine (including Mac OS X), you can install RDFLib by running:

sudo easy_install -U "rdflib>=3.0.0"

Now start an interactive Python shell (running 'python' from the shell should do it).

We shall parse some RDF/XML from DBpedia on a number of people. The way you parse RDF with rdflib is you create a Graph, which is a sort of empty holder for data. Imagine this as a big container for data, and you can throw in to the container as much data as you like, then just filter out the bits you want.

First we should import the Graph class from the rdflib package and create a Graph instance.

from rdflib import Graph, URIRef

g = Graph()

The 'g' variable now has an empty graph.

Now we should load some data from the web. The graph object has a method called 'parse' which allows you to give it a file name from your local system or an HTTP URI, as well as an optional format, and it will try to load data from that source. We'll load in data about Elvis Presley.


This will pause for a second or so to load the data from the web.

We can see that we've loaded some data by seeing how many statements are in the graph object:


At the time of writing, len(g) returned 1272. This will change as both the parsers that DBPedia use and the page on Wikipedia changes.

RDF as a data format is basically a graph that is built up of 'triple' statements. These are made up of subjects, predicates and objects, like simple sentences in a natural language like English. The graph having 1,272 statements is a bit like it having 1,272 individual sentences, but not necessarily about the same thing. Those sentences are of the form:

  • Elvis Presley is a rock-and-roll singer.
  • Elvis Presley was born in the United States.
  • Elvis Presley was born on the 8 January 1935.

An RDF graph doesn't all have to be on the same topic. It could freely have 'sentences' about Elvis Presley, Bondi Beach, Barack Obama, the Moon, Camembert, your pet cat, a news article on the trial of a Nazi war criminal, triangles, some particualr species of whale, a television programme, and anything else that is a "thing".

RDF has sentences like this translated into a machine-readable structure. The subjects – 'objects' in an object-oriented sense – are URIs, as are the predicates (like 'is a', 'was born in' etc.) and the 'objects' of the sentence are either URIs of other resources or they are 'literals', blobs of data.

In this tutorial, we're going to retrieve some literals: RDF literals are basically strings. Other datatypes exist but are implemented as a type restriction on a string. So, for instance, integers or floats or dates are just strings with a little tag on them saying "by the way, this is an integer (or a float or a date or whatever)". If you know about XML, the datatypes used in RDF literals come from XML Schema (don't worry: RDF doesn't worry about the rest of the stuff in the XML Schema spec!).

So let's retrieve the birth and death dates from the graph. The first thing we need to know are the URIs of the properties. On DBpedia, the URIs used for this are:

To retrieve the birth date, we use a method called "subject_objects" on the graph object, which takes a URIRef (an object that wraps a URI) as an argument and then returns all the statements that match that as a Python generator (for those unfamiliar with Python, a generator is like an iterator in Java etc.). You can then use a for-loop to iterate over the results:

for stmt in g.subject_objects(URIRef("")):
     print stmt

The response will be as follows:

(rdflib.term.URIRef(''), rdflib.term.Literal(u'1935-01-08', datatype=rdflib.term.URIRef('')))

This is a Python tuple object. You can access the data inside it as you would a tuple, and you can call str() on the URIRef and Literal objects to return the string representation (Java users: this is basically Python's equivalent of the toString() method).

for stmt in g.subject_objects(URIRef("")):
    print "the person represented by", str(stmt[0]), "was born on", str(stmt[1])

Here is another example using spouse:

for stmt in g.subject_objects(URIRef("")):
    print "the person represented by", str(stmt[0]), "was married to", str(stmt[1])

Elvis was married to two women:

the person represented by was married to
the person represented by was married to

As was noted earlier, this graph structure is a big bucket where you can throw as much data as you like (within the limits of your computer's memory, of course). So let's test our birth date query with a few more people.


We can now run our birth date call on the lot of them:

for stmt in g.subject_objects(URIRef("")):
     print "the person represented by", str(stmt[0]), "was born on", str(stmt[1])

And we should get this in response:

the person represented by was born on 1955-06-08
the person represented by was born on 1925-10-13
the person represented by was born on 1935-01-08
the person represented by was born on 1879-03-14

Hopefully, you can now do some basic parsing of data from RDF you've gotten from the web such that you can use this data for building mashups and other applications. You have to remember that while this seems rather strange compared to parsing XML or JSON, it is because RDF is built around a different data model - that of a graph structure rather than a document tree (plain XML) or nested key-value/array (JSON).

If you want to use other languages than Python, there are a variety of tools of similar simplicity to RDFLib:

Querying the Semantic Web with SPARQL[edit]

SPARQL is a SQL-like query language used with RDF. Some sites will provide a SPARQL endpoint, which allows you to submt a query to the site and get back data in a variety of formats. The formats that get returned are determined by the sort of query you send:

  • SELECT returns SPARQL Results Format XML (or JSON or a few other formats)
  • ASK returns true or false in an XML or JSON wrapper.
  • DESCRIBE returns RDF
  • CONSTRUCT returns RDF

For simple queries, you should probably start with SELECT. SELECT lets you query the whole database, and returns rows and columns.

The columns match up to all or some of the variables used in the query.

Here is a simple query:

    PREFIX rdfs: <>
    SELECT ?label
    WHERE { 
      <> rdfs:label ?label .

This is selecting the label property of the DBpedia resource 'Asturias'. This could return numerous rows: any number of RDF statements in Dbpedia may satisfy this.

SPARQL is a relatively complicated query language (although much simpler than SQL!). Teaching the syntax of SPARQL is outside of the scope of this tutorial, so it might be useful for you to familiarise yourself with the syntax of the query language by reading Leigh Dodds' tutorial on

Once you've formulated a query, you then need to send it off to the server and get a response. SPARQL uses HTTP, so it should be relatively easy to get started with any language that can use HTTP. There are wrapper libraries which add a layer of convenience. The process of using these is detailed below.

Sending the query (Python)[edit]

Firstly we would need a library to support remote SPARQL querying. If you are using a Linux or UNIX-based machine (including Mac OS X), you can install SPARQLWrapper by running:

sudo easy_install SPARQLWrapper

Now start an interactive Python shell (running 'python' from the shell should do it), and then use it to query DBpedia:

from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("")
    PREFIX rdfs: <>
    SELECT ?label
    WHERE { 
      <> rdfs:label ?label .
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print result["label"]["value"]


More information on the parsing libraries can be found on their websites and on their project pages. For general assistance with RDF and other Semantic Web technologies, you can use:

Further resources[edit]