Semantic web development and publishing

The Uriverse Experiment is Over

Uriverse has been archived
http://www.flickr.com/photos/dolescum/3567689465/

You’ve probably landed on this page because you were redirected from uriverse.com. I’m sorry to say that the site no longer exists. I suggest clicking the back button and selecting another search result. If you want to know a bit more about the site that was, read on…

Uriverse was an experiment which last for a couple of years. It demonstrated that it is possible to run a very large, richly interlinked instance of Drupal 6, the content management system. All you need is plenty of RAM for your well placed MySQL indexes and Solr search!

The data for Uriverse came from an import of DBpedia, a semantic version of Wikipedia. There were about 13M articles in all, based off around 3M English articles. The site was run off two 3G, 32bit Linode servers: one for Drupal and MySQL, the other for Solr. In the end I couldn’t justify the cost of running these servers for the small amount of traffic they were serving.

Why was Uriverse set up? Well, I wanted to scratch an itch really – could Drupal handle it? Yes – mostly. Does Solr work with big repositories? Big yes. I also was wondering whether such a project could generate enough traffic through organic search and Adsense to cover server costs and labour. The answer to this question was no :( Google is much better these days at weeding out duplicate content. It’s not 2001 any more :) I wasn’t prepared to do black hat things to make it look less like a Wikipedia/DBpedia page.

The site was live for a long while before I added any Adsense to it. Most of the content was in the supplemental index and I was only getting a handful of visits a day. When I did put Adsense on it, traffic rose to around 500 pages a day. Google then complained about inappropriate image on the Masturbation page. Ha! I removed the image but apparently there were others which were inappropriate as well, leading me to be banned from Adsense. Who knew that Wikipedia had such sensitive content. Once Adsense was gone traffic rose a bit more to 800 pages a day but has started to tail off recently.

SEO morals of the story?

  • Well written, unique, targeted content is king.
  • So are lot of backlinks.
  • Adding Adsense seems to improve traffic from google.
  • The long tail is still in operation.

The thing I was most proud of was getting all that data into MySQL as fast as I could I learnt a few things about big datasets. The project also got me into Drupal which I have been working with full time over the last couple of years.

I’ve included the About, Technical and FAQ pages below which give you a better idea of what I did with the site.

Goodbye Uriverse. It was fun while it lasted.


About

Uriverse is a site which allows you to explore a constellation of information. We do this by aggregating data about a wide range of subjects and make it available on a single page. We currently have around four million subjects in our database in 80 languages. These subjects then form the foundation for gathering extra information from around the web.

Each subject is based on a short description from Wikipedia. The core dataset has been derived from DBpedia, a semantic version of Wikipedia. DBPedia has extracted semantic infomation from Wikipedia such as dates, geo locations, names and relationships between subjects. Read the Uriverse Technical Overview (below) for more details. This allows Uriverse query the data to show you interesting lists of things such as the richest people, the biggest companies or things in the vicinity of a certain location. We aim to bring a new interface to the data through interactive timelines, maps and ordered lists of things.

This basic information will be augmented with relevant, topical information from around the web. We have gathered (or will gather) photos, videos, news, weather and product suggestions for selected subjects in our database. Info will be gathered from blogs, feeds, creative commons content sites and various other third parties. We are only in the early stages at the moment, but the aim is to have a compelling pages which become more than the sum of their parts. Expect to see pages incrementally extended over time.

Uriverse also allows users to rate content. All you need to do is sign up and get rating. Somewhere down the track we will use the user contributed ratings to form suggestions for other content which the user may find interesting.

Thanks for checking out Uriverse. If you have any comments or suggestions then we would love to hear them.

Murray Woodman
Cruncht


Technical Overview

Uriverse is s site which attempts to do cool things with freely available linked data. There is now a wide range of data sets which can be merged and combined to become more than the sum of the parts. Furthermore these datasets generally are semantically rich allowing applications to operate on the data in a way which isn’t possible with plain text. Uriverse aims to bring some of this data to life by combining it with other services and providing convenient ways of searching, filtering, browsing and visualizing the data.

The traditional way of handling and browsing RDF data would be to keep the data in an RDF store and then provide a generic user interface which lists out the various properties of a resource. More advanced approaches could include customised templates for certain resource types. These interfaces tend to be a bit technical for the average user. Uriverse has taken a different approach and has massaged the data it so that it fits into the Drupal content management system. This involves looking at the structure of the content and then decomposing it into a shape which fits the CMS. During the transform process we need to consider such questions as: What content type should this be? Do these categories or types fit into a taxonomy? Should this category be a term or a node? What CCK field can these properties be mapped to? What modules can hook into this data?

Most of the data from DBPedia converted across without too much loss which is a testament to the flexibility within Drupal 6. However, there were a few cases where pragmatic decisions had to be made to enable the smooth operation of the site. The main compromise which needed to be made was to import only a limited number of properties exposed in the infoboxes.

All of the features of Drupal can then be applied to the data. Straight out of the box you have support for features such as the following:

  • faceted searching through Solr and Lucene
  • content rating and recommendation through the FiveStar module
  • taxonomy
  • pretty themes
  • multiple languages
  • integration with Google Maps through Geo module
  • timelines through the Simile module
  • commenting and moderation

Down the track we will be looking towards implementing

  • RDFa on all CCK fields
  • RDF queries to external data sources
  • content recommendation

By far the most impressive feature Drupal provides is the faceted search it offers with Solr. Solr is able to index taxonomy terms along with the content which provides a very handy way to filter results based on type. Instead of entering specific search terms you are able to start broad and then filter the results down.This is the main value added by Uriverse to the datasets it uses. At this point in time it is not possible to search Wikipedia using faceted search. Solr makes this possible.

Another feature of Drupal is the ability to support multiple languages. The DBPedia dataset really gives that feature a great workout. Language is also included as a facet in the search interface in Solr. It is therefore possible to search across all languages if you so desire.

The site has utilised a rating module which allows users to score a node on a 1-5 scale for certain nodetypes: people, organisations and artistic works such as books, songs and movies. If you wish to use this feature please sign up and get rating. In the future we will hopefully be able to install a recommendation engine to make suggestions based on your ratings.

Uriverse is a large site. Version 3.4 of the DBPedia dataset alone has around 3 000 000 unique objects, 10 000 000 when all 90 odd languages are considered. There are a further 300 000 000 relationships/properties. Shaping and importing this data has been somewhat of a challenge. I don’t ever want to see another ‘copy to tmp table’ in my ‘show processlist’ again :)

Finally, the site is a work in progress. Everything is always under construction, right? The initial aim was to get DBPedia into the DB and have it indexed by Solr. This has now been done. There’s lots more which can be done though. New features which leverage the semantics of the system should be added over time.


FAQ

Features

What are the main features of Uriverse?

Uriverse has implemented the following features thus far.

  • Partial import of DBpedia,
  • Import of some Flickr photos based on geo cordinates,
  • Faceted search using Solr,
  • Timelines using Simile widget,
  • Item lookups using unique IDs from the data,
  • Sorted lists using scalar values from the data,
  • Ratings of people, organizations, places, artistic works and music genres,
  • Location search for items based on geo cordinates.
What are the main features of DBPedia?

DBpedia is “a community effort to extract structured information from Wikipedia and to make this information available on the Web.” The DBpedia dataset is quite extensive and covers a wide range of things in a number of languages: According to it

The DBpedia knowledge base currently describes more than 2.9 million things, including at least 282,000 persons, 339,000 places (including 241,000 populated places), 88,000 music albums, 44,000 films, 15,000 video games, 119,000 organizations (including 20,000 companies and 29,000 educational institutions), 130,000 species and 4400 diseases. The DBpedia knowledge base features labels and abstracts for these things in 91 different languages; 807,000 links to images and 3,840,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories. The knowledge base consists of 479 million pieces of information (RDF triples) out of which 190 million were extracted from the English edition of Wikipedia and 289 million were extracted from other language editions.

DBpedia Integration

Why bother importing DBpedia into Drupal? Wouldn’t it be better to keep it external?

Good question. During the course of the development this question did pop up more than once. DBpedia is a very large and varied dataset which did present a good number of issues during the import. The thorniest issues we based on mapping the DBpedia ontology to Drupal content types and then deciding how to map the infobox properties. In the end we had to make a number of pragmatic decisions to ease the process a bit. The end result is that we are able to leverage the various features of Drupal to display and augment the data. Having the DBpedia resources as first class ‘objects’ in the system provides a solid base for the future.

Has all of DBpedia been imported?

We have attempted to import as much of DBpedia as possible. This includes:

  • all of the languages bar one or two,
  • all redirects and disambiguations
  • geo cords, names, internal pagelinks, wikipedia links and homepages
  • We have not included:

  • same as links
  • instance info for other ontologies
  • external links (importing 300M links is too painful).
  • The main concession we had to make concerns the importation of infobox properties. Those familiar with Drupal will be aware of the use of the Content Construction Kit to store property data. The implementation of the backend means that some data structures will be inefficient in terms of storage and query. This meant that we had to select properties which provided the most bang for our buck in terms of information density and number of queries issued on the database. In practice this meant we tended to chose properties which were :

    • applicable to the root of content types, eg. to people in general rather than astronauts,
    • common across a number of content types, eg. owner, length, country, language,
    • unique to a single content type, eg. person.birthdate,
    • inverse functional properties suitable for lookup and future data integration, eg. work.isbn,
    • scalars suitable for ordering lists, eg. person.networth, company.employees, event.date.

    In short it was a tradeoff. In the end we imported about 150 properties out of 1000 or so.

    What version of DBpedia is loaded?

    The dataset is from DBpedia 3.4 published from a Wikipedia dump from September 2009. This version was a marked improvement over 3.3 as it contained validated properties in the infobox fields.

    Ratings and Reviews

    What can be rated and reviewed?

    Currently the following content types can be rated and reviewed: people, places, artistic works, music genres and organizations. Anonymous users can leave a review without logging in. However, if you wish to leave a rating you must join up to Uriverse first.

    Are reviews moderated?

    Reviews are moderated by Uriverse. We attempt to ensure that reviews are of a suitable standard and quality for publication. Firstly, a basic standard must first be reached before acceptance – the review should not be racist, homophobic, sexist or vilify any group. Secondly the review must have a sufficient degree of substance. A “me too” comment won’t suffice, however, the bar is set relatively low so reviews which make at least one point should be OK. Uriverse reserves the right to remove comments at any time.

    The future

    Interested in collaboration?

    Definitely.

    If you have suggestions for ways Uriverse can be integrated with other sources then we would be happy to hear from you. For example, if you publish or know of widgets which can be easily hooked into the system then these would be an easy win for us. Overtime we would like to integrate the site more seriously with other sources and would be interested in discussions with any other sem web or Drupal hackers out there.

    3 Comments

    1. Goonbroab
      Posted November 9, 2011 at 8:56 am | Permalink

      What is this

    2. Posted February 15, 2012 at 6:40 am | Permalink

      Hi,

      I am interested in learning more about how you imported DBPedia into Drupal. I am attempting something similar for a small subset of Wikipedia – but using Drupal 7.

    3. Posted February 15, 2012 at 7:38 am | Permalink

      Hi Alex.

      I wrote a bit about how I imported it over this way: http://cruncht.com/361/uriverse-dbpedia-drupal-case-study/

      It was a labour of love which took a lot of SQL hacking to get to work. I could not afford to do node_loads and node_saves for 13M nodes so I opted to do the DB inserts directly. That way I could get 1000s of inserts a second rather than 10. I learnt a lot about MySQL, indexes, efficient queries and how to import stuff as quickly as possible. Also, how to recover from where you left off when a long running process died. All up it took a couple of months of tooling around with code and SQL. Once all of it was in it took another couple of months for Solr to index it all :)

      I wouldn’t recommend this approach as it is a one hit import which takes a lot of effort.

      If you have a smaller dataset then you have a few options.

      Firstly, the migrate module couple be a good way to go if you have complex relationships and IDs to maintain. I really want to get into this big time one of these days. Imports will still be slow but you have the most flexibility.

      Secondly, you have the feeds module. Good if you have a flat data structure and one to one mappings. You might be able to get away with crafting up a CSV and importing that way. I believe that Lin Clark has some good screencasts with Feeds and SPARQL (IIRC). Theres a lot of flexibility in Feeds too so this might be an easier option for you if you like wiring up config instead of writing code.

      Thirdly, you can go old school and just write your own PHP script and do it with node_saves. Fire it up with “drush scr”. I like this approach because it feels natural to me. However, Migrate gives you a lot of nice goodies such as rollback, tracking IDs and making stub objects which will save you from pulling your hair out.

      Finally, you have SQL inserts if you are handling a very big dataset. This will be more tedious now because of the way fields is handled. ie. a lot more inserts over more tables. You’ll get to know Drupal’s schema well though.

      Oh yeah – another option is to just keep Dbpedia in a triple store and then provide a view onto that. That way you have more chance of keeping up with updates from Dbpedia. It isn’t updated that often though.

      You may also want to take a look at Freebase. They have a nice API, up to date data and links to Dbpedia as well. You might consider importing from there using the search, image, mql etc apis. I’ve been doing a bit of that lately and it is quite pleasant.

      All the best with it.

    Post a Comment

    Your email is never published nor shared. Required fields are marked *

    *
    *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>