Uriverse, a Drupal based website, was released in January 2010. Much of the data in Uriverse is based upon a data import from DBpedia, a semantic version of Wikipedia. Uriverse contains over 13M nodes and contains 90 languages covering around 3M primary subjects. This article is a case study of how the import was done and the challenges faced.
Drupal proves itself to be a flexible system which can handle large amounts of data so long as some bottlenecks are worked around and the hardware, particularly RAM, is sufficient to handle database indexes. The large data set was tedious to load but in the end the value added by Drupal made it worth it.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.
Over the years I have had an interest in the semantic web and linked data in particular. The DBpedia project had always impressed me because it was an effort which allowed Wikipedia to be used as a hub of subjects for use in the semantic web. Wikipedia represents most of the commonly discussed subjects of our times, created and edited from the ground up by real people. It therefore forms a practical basis from which to build out relationships to other data sets. If you are looking for subjects to represent things you want to talk about the DBpedia is a good place to start. There is an increasing momentum around it as the linked data meme starts to spread.
Not only has the DBpedia project formalized these subjects, it has extracted a large amount of information which was codified in the somewhat unstructured wikitext syntax used by Wikipedia. The knowledge within Wikipedia has been made explicit in a way that can be used (browsed, queried and linked) easily by other systems. The DBpedia data set therefore provides a convenient store for import into a CMS such as Drupal.
Choosing to import DBpedia into a content management system is not necessarily a natural thing to do. The RDF data model is very simple and flexible and allows for all kinds of data structures which may not fit well within a system such as Drupal. It may seem like I was attempting to get the worms back into the can after it had been opened. At times it did However, there was a good deal of regularity and structure within DBpedia. The provision of an ontology and a “strict” infobox mapping in version 3.4 made the importation process possible. Whilst not everything could be imported, most of DBpedia made it in. Concessions had to be made along the way, and the places where Drupal proved to be too inflexible are indicative of areas which could be improved in Drupal. More on that later.
I chose Drupal as the target platform because I had been impressed with the flexibility I had seen from the system. Unlike other popular blogging platforms/CMSs, Drupal has a decent way of specifying a schema through the Content Construction Kit (CCK). I believed that it was possible to mimic the basic subject-predicate-object structure of RDF. Drupal supports different content types with custom properties (strings, dates, integers, floats) and relationships between objects (node references, node referrer). It also has a relatively strong category system baked in which can be used for filtering. Drupal also offers a lot of other attractive features apart from the data modeling: users, permissions, themes, maps, ratings, timelines, data views, friendly URLs, SEO, etc. Utilizing these features was the carrot to get the data into the system.
The DBpedia data set
- 2.9 million things
- 282,000 persons
- 339,000 places
- 88,000 music albums
- 44,000 films
- 15,000 video games
- 119,000 organizations
- 130,000 species
- 4,400 diseases
- 91 different languages
- 807,000 links to images
- 3,840,000 links to external web pages
- 415,000 Wikipedia categories
The data set is provided in two main formats: N-Triples and CSV. The N-Triples format is suitable for loading into RDF stores with RDF software such as ARC. The CSV format was handy for quickly loading data into a relational database such as MySQL. In a data set as big as DBpedia it is essential that you do things the quickest way lest you will be spending weeks importing the data. Here are some very rough comparisons of import speed. NB: These are from memory and are very rough estimates.
|NTriples into ARC||100|
|MySQL inserts from a select on another table||1 000|
|CSV, LOAD DATA IN FILE (same disk)||10 000|
|CSV, LOAD DATA IN FILE (different disk)||100 000|
Obviously if you are importing 100s of millions of triples these comparisons are important. (I can’t quite believe the last result but that is what I remember it to be.) Put your dumps and DB on different disks when doing imports! Mind you, adding the indexes after the data has been imported is still a very time consuming process requiring data to be copied to disk so the super fast results are a bit misleading.
Mapping to Drupal structures
The DBpedia data set can be divided into two parts: infobox and non-infobox. Infobox data includes all of the varied properties which are applied to each class as defined in the ontology. As such, it presents a substantial data modeling exercise to fit it into Drupal. Much of my time was spent analyzing the class hierarchy and properties to work out the sweet spot as to how content types and properties would be allocated. The non-infobox data is much more predictable and easier to import in standard ways.
|Resources||Nodes||Natural mapping. Wikipedia article maps to a DBpedia subject maps to a Drupal node.|
|Classes||Taxonomy||Class hierarchy in ontology maps cleanly to taxonomy.|
|Classes||Content Types||Each DBpedia class belonged to a base super class which forms the Content type. eg. actors and politicians are both Persons. Person is the base class and therefore becomes the natural candidate for the Content Type in Drupal.|
|Categories||Nodes||Wikipedia categories are messy (irregular, loops) and better suited to being a Node.|
|Language Labels||Translations||A base English node with translations handles the subject + different language labels for DBpedia well, if not perfectly.|
|URI||URL Alias||The URL in Wikipedia maps to DBpedia maps to Drupal URL Alias.|
|Articles Labels||Node title||In various languages.|
|Long Abstract||Node content||In various languages.|
|Short Abstract||Node teaser||In various languages.|
|Wikipage||CCK Link||Applied to all translations.|
|Homepage||CCK Link||Applied to all translations.|
|Image||CCK Link||Applied to English version only.|
|Instance Type||Applied to English version only.|
|Redirect||Node with CCK Node Ref||Different names/spellings redirect to English version.|
|Disambiguation||Node with CCK Node Ref||Points to various English versions.|
|Page Links||CCK Node ref||Untyped links between Nodes|
|External Links||Ignored||Too many to import.|
|Geo||CCK Location||Applied to English version only.|
|Person data – Names||CCK Strings||Applied to people and organizations.|
|Infobox – strict||Various CCK fields||DBpedia ontology specifies which classes have which properties.|
Content Types, CCK and Importation Concessions
There were a number of areas where there was not a clear mapping between DBpedia and the features offered by CCK. The notes below refer to Drupal 6, and do not consider the Fields in Core initiative in Drupal 7.
Drupal handles object properties through a system known as the Content Construction Kit (CCK). CCK offers a handy interface for defining a schema for various content types. Each object (node) is a instance of a single class (content type). Each node therefore has the properties of its content type.
Those of you familiar with the inner workings of CCK in Drupal 6 will understand the idiosyncrasies of the way CCK works behind the scenes. On the backend things work as expected up to a certain level and then they get a bit complicated. Properties for a content type are grouped together in a single database table, as you would expect. There are two exceptions to this rule. Firstly, if the property can have multiple values, it is stored in a separate table. This too is natural enough. Secondly, if two content types share a property then it is split out into its own table. This is a bit strange and can catch you unaware if you aren’t expecting it. However, it is sensible enough as it allows easy queries across content types on a single property.
Things become tricky when you have (i) lots of “multi” properties or (ii) lots of “shared” properties. Drupal needs to issue a new query on a different table to get the data back for that property. This is alright for most sites but has the potential to be a worry in the case of DBpedia where there are a massive amount of different relationships and in some cases relatively few instances of those relationships. ie. the data is not very dense. We are potentially talking about 100s of properties which would need to be retrieved.
Unfortunately, in these cases it make sense to pick only the properties where you get he most bang for your buck. Which “shared” properties are shared amongst the most types? Which “multi” properties have the most instances? Which “single” properties have the most density down the rows? Along the way we had to be pragmatic and pick and choose the properties we would support. At the end of the day this wasn’t a limitation of what CCK, rather a sensible decisions were made to stop the database exploding out into thousands of tables with very little data in them.
I don’t see an easy way to solve the table “explosion” problem save from moving to an RDF datastore such as that implemented by ARC. For data sets which have demands similar to DBpedia it makes sense to have something such as an RDF store in the backend. This conclusion is incredibly ironic given the lengths I have gone to to get the data into DBpedia. All was not lost however as the majority of data made it in and is usable within Drupal using standard techniques.
Interestingly this very issue seems to have plagued the backend design process for Drupal 7. According to the DADS Final Report (4th Feb 2009), “CCK 2 for Drupal will drop its variable schema and use the multi-value schema style for all fields.” Hmmm. I haven’t checked out Drupal 7 in this much detail but if this is true then the table explosion problem is going to be worse in Drupal 7. I’m not abreast of current developments here so I can’t comment further.
No sub classing properties
Drupal doesn’t allow for for sub classing content types. Each content type exists as its own base class. This means that we have to define base content types with all of the properties of the contained subclasses. The ontology in DBpedia can be up to four levels deep: Eurovsion Song Contest Entry is the deepest, preceded by Song, Musical Work, Work and finally Thing (a catch all at the root). This of course leads to base content types with many properties which will be null for the instance in question. The database tables would become very wide and have relatively low density of information.
The Taxonomy system does allow us to partially work around the sub classing problem. A class hierarchy maps nicely to a hierarchy of Terms in a Taxonomy. Further, multiple Terms can be applied to a Node making it possible to specify different classes for a node. It doesn’t cover the difficulty of property storage however.
No Multiple inheritance
When it comes to data modeling there are different ways to handle typing. The most simplistic and limited way is to allow instances to have a single class. This is the way Drupal currently works with content types and CCK. Each node belongs to a single content type and has the properties defined by CCK for that node. A more flexible way of modeling data allows for multiple inheritance where an instance can have more than one class.
Where an instance did straddle two base classes it was impossible to carry data for both types. This query shows all “people” who are “works” as well. I think these cases can be put down to DBpedia being a bit too promiscuous in the infoboxes it processes for each article. This isn’t a strong argument for multiple inheritance because the data is probably erroneous, however, it does demonstrate an area where modeling could be more flexible.
In some cases a compound type was required. For example, images data had three components: thumbnail link, depiction link and copyright info link. All three of these should have been considered one unit, however this is not possible through the standard CCK interface which handles atomistic primitives. Because these image properties applied to multiple content types, the end outcome was that they were represented by different database tables. It was very frustrating to know three queries were being issues when one (or none) would suffice. It is possible to define your own custom datatypes through a module but this is a fairly high barrier to jump.
Interfaces: A possible solution
Freebase is a collaboratively edited system of open data which is similar to DBpedia in many respects. The main difference is that Freebase allows users to create and edit content according to certain schemas. One of the very impressive aspects of the Freebase system is its ability to support multiple types, or co-types, for an object. From the Freebase data modeling guide we have:
A novel aspect of the Metaweb system is that instances may have multiple types. A single topic such as “Kevin Bacon” may have multiple types such as a Person, Film Actor, TV Actor and Musical Artist. and others. Since no single type could encapsulate such diversity, multiple types are required to hold all properties to fully describe Kevin Bacon and his life.
This approach has the advantage of grouping properties in a table allowing for fast retrieval and querying, as well as allowing for flexibility and sensible design. I believe that Drupal could benefit from the Freebase approach. The current content type + CCK model could remain in place AND be augmented by an interface system which allowed for grouping of properties for various types. To take the Kevin Bacon example, “Kevin Bacon” would be a Person content type with the base properties of Birthday and Deathday. “Kevin Bacon” would then have the FilmActor, TVActor and MusicalArtist interfaces which could be represented by separate tables on the backend. I believe that this offers good flexibility for those desiring a powerful system whilst maintaining simplicity for those who just need base types. It also solves a lot of the hand wringing which goes with the way some CCK tables are formed.
Import into Drupal
As a new comer to Drupal, by far the most disappointing aspect of the system was the lack of a clear and easy to understand API. I assumed that there would be a nice object abstraction which I could use to populate the system. I gradually came to understand that most development was done by examining the excretions of print_r() to determine what data was available at that particular moment. Where was the interface to content types, CCK and nodes? How could I create these programatically? There were times where I stopped and paused and considered a framework which was cleaner and more lightweight. The rich functionality was the thing that kept me though.
The large size of the data set pretty much dictated that it needed to be imported in the most efficient way possible. If I accepted a 1 second overhead for a save/update I would be waiting four months at least for the data to load. So, notwithstanding the state of the API in Drupal 6, a straight database import was the order of the day. After a bit of reverse engineering mucking around I had a few PHP/SQL scripts which could insert content pretty quickly, with insertion rates of around 1000 rows a second.
The import process followed these simplified steps for the various pieces of DBpedia.
- Import DBpedia data from CSV format into MySQL.
- Create a staging_node table with columns: title, language, teaser, content, url_alias, nid.
- DBpedia data populated into staging_node and cleaned.
- staging_node data copied into Drupal database.
The process was therefore semi automated with a number of scripts. It still took a few weeks to run through from start to finish. Importing 75M page links was the final time consuming process. I could only get throughput of about <200 rows a second as a url_alias to id lookup was required. This part of the process took around 7 days. Not something I want to repeat.
During the import I realized a few things about MySQL techniques which may come in handy for other people.
- Loading data into MySQL direct from file, with no index is very fast.
- It’s even faster if the source file is on a different disk to the DB.
- The overhead from very heavy/large select queries can be minimized by using mysql_unbuffered_query. The connection can be fragile though so be prepared with an internal counter so you know where the process got up to before it died. You can’t write to the table you are reading from and it is locked for other operations.
- Sometimes pure SQL is a good way to go: INSERT INTO SELECT
- Sometimes joining is no good and running a SELECT for each row is best (if caching and keys) are working.
- SHOW FULL PROCESSLIST is your friend.
- Sorting is your enemy.
- Dumping temp tables to disk must be avoided. Try tweaking your conf. You want to hear the cooling fans humming rather than the disk ticking away.
Characteristics of the rich DBpedia data set provided a good foundation for using the following Drupal features:
- Image thumbnails pointing to Wikipedia display in the contents expanding to full depiction via thickbox.
- Geo cordinates for Nodes displayed in Google Map via Geo module.
- Geo Cordinates used with Flickr API to pull back Creative Commons images for places from Flickr.
- FiveStar ratings used on People, Places, Organizations and Music Genres.
- Views provided top rated lists of the most popular things.
- Simile Timeline widget used to display chronology of events.
- Solr used as search engine and filters on language and class.
- Solr recommends “More like this” based on Node content.
- Filtered Views provide lookups for titles, firstname, lastname and geo coordinates
- Various node properties allow for sorted list Views of richest people etc.
OK. So the data is in Drupal – were there any problems when it came to running the site? Yes, I ran into a few challenges along the way. Some were fixed, others worked around and others still remain a thorn in our side.
The database is the most pressing concern when it comes to performance on a big site. If simple queries run slowly then the site will not function acceptably even for low traffic. The most important area is to ensure that indexes for key tables have been loaded into the key buffer. Lets look at a couple of simple selects with and without a primed key buffer.
|Query||No key buffer||Key buffer|
|select sql_no_cache title from node where nid=1000000;||0.05s||0.00s|
|select sql_no_cache dst from url_alias where src=’node/1000000′;||0.02s||0.00s|
If these indexes aren’t in RAM then the most basic of lookups will take a long time, ie. 0.07s. If you are on a page with a view with 50 nodes to look up then just getting the title and path out will take (0.07s * 50) 3.5 seconds. Note that this doesn’t include all the other processing Drupal must do. This in completely unacceptable and so it is mandatory to get these indexes into RAM. I recommend putting the following SQL into a file and running it every time MySQL is started up using the init-file variable in my.cnf.
LOAD INDEX INTO CACHE node;
LOAD INDEX INTO CACHE node_revisions;
LOAD INDEX INTO CACHE url_alias;
LOAD INDEX INTO CACHE term_data;
LOAD INDEX INTO CACHE term_node;
On massive sites you probably won’t be able to get all of the node indexes into the key buffer, even if you have been generous in its allocation (up to 50% of RAM). In this case I resorted to running a query which seems to get the node nids into the buffer whilst leaving out all the other indexes which aren’t used as much. It takes a while to run but does the trick.
select count(n.nid), count(n2.nid) from node n inner join node n2 on n.nid=n2.nid;
In the best case scenario we would all have RAM (money) to burn but unfortunately that’s generally not the case.
Core and Contributed modules
There are a few areas in Drupal which bogged down when handling huge data sets. As I developed the site I took note of problematic areas. Most of these areas are probably well known so we’ll just mention them briefly.
- In general a node load is very heavy. Lazy loading of CCK properties would make the system much faster if CCK didn’t have to be loaded. It would also mean the API could be used when speed is an issue. ie. when processing millions of nodes at once. During import and update, the solution is to work directly with the database. Just for a laugh I tried setting the status to 0 for a large node but the 404 still tried to load the whole node and then died.
- Editing a node with many multi properties it all but impossible. The edit page size is massive and RAM/CPU is hammered. Solution is not to edit! Real solution is to page multi properties in node edit.
- Viewing a page with many multi properties requires that the properties be paged. Looking up all those nodes gets slow quickly even with fast database. Solution is to use a module such as CCK Pager.
- Viewing list of content is not possible as SQL query relies on a join to user table. This query kills the database. Solution is to make a small hack to core to stop this join. If users were looked up with separate queries then this would be a better solution.
- Search is impossible from many angles. Queries for indexing are very slow. Database not designed to handle such large amounts of data. Solution is to let Solr handle search.
- Solr search design is generally good with data held in apachesolr_search_node. However, Solr indexing can be a drain if you exclude node types from search. The preparatory query to return the nodes will inner join to node and lead to a very slow query. It returns after a while (70s for me) so you can live with it. Definitely not something you want to be doing regularly on production server as CPU goes to 100%. Solution is (i) to replicate node type data in apachesolr_search_node or (ii) get the excluded nodes anyway and ignore them. First option is best.
- Taxonomy pages fail when there are many nodes in a category. SQL query is very slow. Solution is to let Solr show Taxonomy pages.
- Strangely, the Search module still was running even when Solr was handling search. Core had to be hacked to turn it off. There must be a better way but I couldn’t see it. Make sure the “search” tables aren’t populated when using Solr.
- Views queries can be slow with no extra indexes on content tables.
- Views queries use left joins which can be slow. Inner joins exclude a lot of rows you don’t need. Solution is to rewrite the Views SQL with a hook.
- Displaying nodes or teasers in Views can be very RAM intensive if the displayed nodes are very heavy with CCK fields. Solution is to use fields.
- The Voting API module stores data in an inefficient way leading to some hairy indexes on the table and some equally hairy queries in Views when trying to display ratings. Be careful.
- CCK node referrers datatype has a mandatory sort order specified. This kills query performance for large result sets.
- XML Sitemap has some queries which return massive data sets which need to be dumped to disk. I know this module is in the process of being reworked.
Solr: Some stats
Uriverse uses Solr as its search engine and it is a component which has performed remarkably well. The filtering capabilities of Solr are excellent and the speed at which results come back are very impressive, even for large corpuses. Since it runs as a service over HTTP it is possible to deploy it to a second server to reduce load on the web server/DB box.
It takes a while to index 10M articles (categories, redirects, disambiguations, photos and pages were excluded from the 13M) and it is a task not suited for the standard Drupal cron. A custom script was written to build the index so that it could hammer away almost constantly without upsetting the other tasks cron performs. Initially the script was written to be a long running process which would potentially run for months. However, a memory leak in Drupal meant that this was not possible. After 10 000 nodes RAM became to much for PHP and the script died. The solution was to limit the number of nodes processed each time. The script now processes around 8 000 nodes every 30 minutes. It therefore takes around a month to build the index.
On the web there are quite a few articles regarding memory usage with Solr. The JVM needs to be given enough room when started. These reports had me concerned because I am running 32 bit machines with only 3G at my disposal. Would an index of 10M articles run in a JVM limited to around 2G? What size would the index be on disk? These are the numbers for the 7883304 articles currently in the index:
|Disk||47.5 GB||6.0 KB|
|RAM||1.5 GB||198 B|
Obviously these numbers are dependent on the average size of the title, body, fields and taxonomy. RAM is also affected by number of facets, sorting and server load. I therefore I have been very conservative in what I have indexed and have turned of sorting in the interface. It looks like the current RAM allocation will be sufficient.
Most Drupal performance best practices (opcode cache, aggregation, compression, Boost, Expires, database indexes, views and block caches) have been followed to get the most out of the site. A CDN has not been deployed because Uriverse doesn’t serve many images from its own server. The images that are served have an expires header and all other images come from Wikipedia and Flickr which probably have their own CDN solutions in place. Further optimizations would include going with MPM Worker + fcgid or Nginx. This will be required if traffic picks up and MaxClients is reached.
Two problematic areas remain. The first is the amount of RAM available for the database indexes. It would be nice to be able to increase that one day. Ordinary node page build times do come in at respectable times considering so this is not such a big issue. The second problematic area is some of the queries in Views. A bit more research is required here but it is likely that some Views will have to be dumped if they are hitting the disk with temp tables. Sometimes its easiest to forgo some functionality to maintain the health of the server.
All up the project has taken longer than expected – more than a few months. Most of the time was spent wrangling the data rather than fighting Drupal, although there were quite a few issues to work through, this being my first serious Drupal project. If I knew of the pain I would suffer from having to massage and prepare the data as well as the patience required to babysit the import process over days and weeks then I probably wouldn’t have commenced the project. That said, I am pleased with the outcome now that it is all in. I am able to leverage the power of many contributed modules to bring the data to life. There is a great sense of satisfaction seeing Solr return results from 10M articles in 90 languages, as well as the pretty theme, Google Maps, Thickbox, Similie timelines and Five Star ratings. I am humbled by the efforts of all Drupal contributors over the years. What I am left with now is a good platform which will form the basis of future data aggregation efforts.
- Populated Place
- Body Of Water
- Historic Place
- Protected Area
- Lunar Crater
- World Heritage Site
- Site Of Special Scientific Interest
- Wine Region
- Ski Area
- Military Person
- Office Holder
- College Coach
- Playboy Playmate
- British Royalty
- Fictional Character
- Football Manager
- Ethnic Group
- Mean Of Transportation
- Olympic Result
- Chemical Compound
- Supreme Court Of The United States Case
- Music Genre