DBpedia, as its home page tells us, "is a community effort to extract structured information from Wikipedia and to make this information available on the Web." That's "available" in the sense of available as data to programs that read and process it, because the data was already available to eyeballs on Wikipedia. This availability is a big deal to the semantic web community because it's a huge amount of valuable (and often, fun) information that the public can now query with SPARQL, the W3C standard query language that is one of the pillars of the semantic web.
Although I'd dabbled in SPARQL and seen several sample SPARQL queries against DBpedia in action, I had a little trouble working out how to create my own SPARQL queries against DBpedia data. I finally managed to do it, so I thought I'd describe here how I successfully implemented my first use case. Instead of a "Hello World" example, I went with more of an "I will not publish the principal's credit report" example: a list of things written by Bart on the school blackboard at the beginning of a collection of Simpsons episodes.
For an example of the structured information available in Wikipedia, see the Infobox data on the right of the Wikipedia page for the 2001 Simpsons episode Tennis the Menace and the Categories links at the bottom of the same page. The DBpedia page for that episode shows the Infobox information with the property names that you would use in SPARQL queries; semantic web fans will recognize some of the property and namespace prefixes. I'm going to repeat this because it took a while for it to sink into my own head, and once it did it made everything much easier: most Wikipedia pages with fielded information have corresponding DBpedia pages, and those corresponding pages are where you find the names of the "fields" that you'll use in your queries.
Once I knew the following three things, I could create the SPARQL query:
The Simpson episode Wikipedia pages are the identified "things" that we would consider as the subjects of our RDF triples (or, put another way, as the objects in the {object, attribute name, attribute value} triplets that contain our data).
The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12".
The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field.
Knowing this, I created the following SPARQL query to list everything that Bart wrote on the blackboard in season 12:
SELECT ?episode,?chalkboard_gag WHERE {
?episode skos:subject
<http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>.
?episode dbpedia2:blackboard ?chalkboard_gag
}
You can paste this into the form in DBpedia's SNORQL interface (which adds the namespace declarations that I've omitted above) to run it, or you can just click here to execute the URL version of the query as created by the SNORQL interface.
Of course this is just scratching the surface. More extensive use of SPARQL features (see Leigh Dodds' excellent tutorial on XML.com) and more study of the available classes ofDBpedia data will turn up huge new possibilities. It's like having a new toy to play with, and I know I'm going to have fun.
For my next step, I was hoping to list what Bart wrote in all the episodes, not just season 12. The bottom of the Wikipedia page for season 12 tells us that that this category is part of the category The Simpsons episodes, but I haven't found a variation on the query above that makes the connection. I know that getting a handle on this category/subcategory distinction is going to open up what I can do with SPARQL and DBpedia in a lot more ways than just listing everything that Bart ever wrote on that blackboard.