<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Cruncht</title>
	<atom:link href="http://cruncht.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://cruncht.com</link>
	<description>Semantic web development and publishing</description>
	<lastBuildDate>Tue, 16 Feb 2010 11:20:29 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Uriverse publishes DBpedia in Drupal: A case study</title>
		<link>http://cruncht.com/361/uriverse-dbpedia-drupal-case-study</link>
		<comments>http://cruncht.com/361/uriverse-dbpedia-drupal-case-study#comments</comments>
		<pubDate>Thu, 11 Feb 2010 12:05:18 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=361</guid>
		<description><![CDATA[Uriverse, a Drupal based website, was released in January 2010. Much of the data in Uriverse is based upon a data import from DBpedia, a semantic version of Wikipedia. Uriverse contains over 13M nodes and contains 90 languages covering around 3M primary subjects. This article is a case study of how the import was done [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/361/uriverse-dbpedia-drupal-case-study" title="Uriverse publishes DBpedia in Drupal: A case study"><img src="http://cruncht.com/wp-content/uploads/2010/02/kevin-bacon-150x150.png" alt="" class="feed-image" title="Uriverse screenshot with kevin Bacon" /></a><p><a href="http://uriverse.com/">Uriverse</a>, a <a href="http://drupla.org/">Drupal</a> based website, was released in January 2010. Much of the data in Uriverse is based upon a data import from <a href="http://dbpedia.org/">DBpedia</a>, a semantic version of <a href="http://wikipedia.org">Wikipedia</a>. Uriverse contains over 13M nodes and contains 90 languages covering around 3M primary subjects. This article is a case study of how the import was done and the challenges faced.</p>
<p><span id="more-361"></span></p>
<p>Drupal proves itself to be a flexible system which can handle large amounts of data so long as some bottlenecks are worked around and the hardware, particularly RAM, is sufficient to handle database indexes. The large data set was tedious to load but in the end the value added by Drupal made it worth it.</p>
<h2 id="motivation">Motivation</h2>
<blockquote><p><a href="http://dbpedia.org/About">DBpedia</a> is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.</p></blockquote>
<p>Over the years I have had an <a href="http://topicmap.com/">interest</a> in the semantic web and <a href="http://linkeddata.org/">linked data</a> in particular. The DBpedia project had always impressed me because it was an effort which allowed Wikipedia to be used as a hub of subjects for use in the semantic web. Wikipedia represents most of the commonly discussed subjects of our times, created and edited from the ground up by real people. It therefore forms a practical basis from which to build out relationships to other data sets. If you are looking for subjects to represent things you want to talk about the DBpedia is a good place to start. There is an increasing momentum around it as the <a href="http://linkeddata.org/">linked data</a> meme starts to spread.</p>
<div id="attachment_374" class="wp-caption alignright" style="width: 309px"><a title="Linked datasets." href="http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html"><img class="size-medium wp-image-374 " title="Linked Datasets" src="http://cruncht.com/wp-content/uploads/2010/02/lod-datasets_2009-03-05-299x228.png" alt="" width="299" height="228" /></a><p class="wp-caption-text">The Linked Open Data Cloud</p></div>
<p>Not only has the DBpedia project formalized these subjects, it has extracted a large amount of information which was codified in the somewhat unstructured wikitext syntax used by Wikipedia. The knowledge within Wikipedia has been made explicit in a way that can be used (browsed, queried and linked) easily by other systems. The DBpedia data set therefore provides a convenient store for import into a CMS such as Drupal.</p>
<p>Choosing to import DBpedia into a content management system is not necessarily a natural thing to do. The RDF data model is very simple and flexible and allows for all kinds of data structures which may not fit well within a system such as Drupal. It may seem like I was attempting to get the worms back into the can after it had been opened. At times it did <img src='http://cruncht.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  However, there was a good deal of regularity and structure within DBpedia. The provision of an ontology and a &#8220;strict&#8221; infobox mapping in version 3.4 made the importation process possible. Whilst not everything could be imported, most of DBpedia made it in. Concessions had to be made along the way, and the places where Drupal proved to be too inflexible are indicative of areas which could be improved in Drupal. More on that later.</p>
<p>I chose Drupal as the target platform because I had been impressed with the flexibility I had seen from the system. Unlike other popular blogging platforms/CMSs, Drupal has a decent way of specifying a schema through the <a href="http://drupal.org/project/cck">Content Construction Kit</a> (CCK). I believed that it was possible to mimic the basic <a href="http://en.wikipedia.org/wiki/Resource_Description_Framework">subject-predicate-object structure of RDF</a>. Drupal supports different content types with custom properties (strings, dates, integers, floats) and relationships between objects (node references, node referrer). It also has a relatively strong category system baked in which can be used for filtering. Drupal also offers a lot of other attractive features apart from the data modeling: users, permissions, themes, maps, ratings, timelines, data views, friendly URLs, SEO, etc. Utilizing these features was the carrot to get the data into the system.</p>
<h2 id="dbpedia">The DBpedia data set</h2>
<p><a href="http://downloads.dbpedia.org/3.4/">Version 3.4</a> of DBpedia is based upon a September 2009 dump from Wikipedia. Here&#8217;s a quick list of some of the <a href="http://wiki.dbpedia.org/Datasets">data set</a>&#8217;s statistics:</p>
<ul>
<li>2.9 million things</li>
<li>282,000 persons</li>
<li>339,000 places</li>
<li>88,000 music albums</li>
<li>44,000 films</li>
<li>15,000 video games</li>
<li>119,000 organizations</li>
<li>130,000 species</li>
<li>4,400 diseases</li>
<li>91 different languages</li>
<li>807,000 links to images</li>
<li>3,840,000 links to external web pages</li>
<li>415,000 Wikipedia categories</li>
</ul>
<p>The data set is provided in two main formats: <a href="http://www.w3.org/2001/sw/RDFCore/ntriples/">N-Triples</a> and CSV. The N-Triples format is suitable for loading into RDF stores with RDF software such as <a href="http://arc.semsol.org/">ARC</a>. The CSV format was handy for quickly loading data into a relational database such as MySQL. In a data set as big as DBpedia it is essential that you do things the quickest way lest you will be spending weeks importing the data. Here are some very rough comparisons of import speed. NB: These are from memory and are very rough estimates.</p>
<table style="height: 110px;" width="531">
<tbody>
<tr>
<th style="text-align: left;">Method</th>
<th style="text-align: left;">Throughput (triples/sec)</th>
</tr>
<tr>
<td>NTriples into ARC</td>
<td>100</td>
</tr>
<tr>
<td>MySQL inserts from a select on another table</td>
<td>1 000</td>
</tr>
<tr>
<td>CSV, LOAD DATA IN FILE (same disk)</td>
<td>10 000</td>
</tr>
<tr>
<td>CSV, LOAD DATA IN FILE (different disk)</td>
<td>100 000</td>
</tr>
</tbody>
</table>
<p>Obviously if you are importing 100s of millions of triples these comparisons are important. (I can&#8217;t quite believe the last result but that is what I remember it to be.) Put your dumps and DB on different disks when doing imports! Mind you, adding the indexes after the data has been imported is still a very time consuming process requiring data to be copied to disk so the super fast results are a bit misleading.</p>
<h2 id="mapping">Mapping to Drupal structures</h2>
<p>The DBpedia data set can be divided into two parts: infobox and non-infobox. Infobox data includes all of the varied properties which are applied to each class as defined in the ontology. As such, it presents a substantial data modeling exercise to fit it into Drupal. Much of my time was spent analyzing the class hierarchy and properties to work out the sweet spot as to how content types and properties would be allocated. The non-infobox data is much more predictable and easier to import in standard ways.</p>
<table>
<tbody>
<tr>
<th style="text-align: left;">DBpedia</th>
<th style="text-align: left;">Drupal</th>
<th style="text-align: left;">Comment</th>
</tr>
<tr>
<td>Resources</td>
<td>Nodes</td>
<td>Natural mapping. Wikipedia article maps to a DBpedia subject maps to a Drupal node.</td>
</tr>
<tr>
<td>Classes</td>
<td>Taxonomy</td>
<td><a href="#class-hierarchy">Class hierarchy</a> in ontology maps cleanly to taxonomy.</td>
</tr>
<tr>
<td>Classes</td>
<td>Content Types</td>
<td>Each DBpedia class belonged to a base super class which forms the Content type. eg. actors and politicians are both Persons. Person is the base class and therefore becomes the natural candidate for the Content Type in Drupal.</td>
</tr>
<tr>
<td>Categories</td>
<td>Nodes</td>
<td>Wikipedia categories are messy (irregular, loops) and better suited to being a Node.</td>
</tr>
<tr>
<td>Language Labels</td>
<td>Translations</td>
<td>A base English node with translations handles the subject + different language labels for DBpedia well, if not perfectly.</td>
</tr>
<tr>
<td>URI</td>
<td>URL Alias</td>
<td>The URL in Wikipedia maps to DBpedia maps to Drupal URL Alias.</td>
</tr>
<tr>
<td>Articles Labels</td>
<td>Node title</td>
<td>In various languages.</td>
</tr>
<tr>
<td>Long Abstract</td>
<td>Node content</td>
<td>In various languages.</td>
</tr>
<tr>
<td>Short Abstract</td>
<td>Node teaser</td>
<td>In various languages.</td>
</tr>
<tr>
<td>Wikipage</td>
<td>CCK Link</td>
<td>Applied to all translations.</td>
</tr>
<tr>
<td>Homepage</td>
<td>CCK Link</td>
<td>Applied to all translations.</td>
</tr>
<tr>
<td>Image</td>
<td>CCK Link</td>
<td>Applied to English version only.</td>
</tr>
<tr>
<td>Instance Type</td>
<td></td>
<td>Applied to English version only.</td>
</tr>
<tr>
<td>Redirect</td>
<td>Node with CCK Node Ref</td>
<td>Different names/spellings redirect to English version.</td>
</tr>
<tr>
<td>Disambiguation</td>
<td>Node with CCK Node Ref</td>
<td>Points to various English versions.</td>
</tr>
<tr>
<td>Page Links</td>
<td>CCK Node ref</td>
<td>Untyped links between Nodes</td>
</tr>
<tr>
<td>External Links</td>
<td>Ignored</td>
<td>Too many to import.</td>
</tr>
<tr>
<td>Geo</td>
<td>CCK Location</td>
<td>Applied to English version only.</td>
</tr>
<tr>
<td>Person data &#8211; Names</td>
<td>CCK Strings</td>
<td>Applied to people and organizations.</td>
</tr>
<tr>
<td>Infobox &#8211; strict</td>
<td>Various CCK fields</td>
<td>DBpedia ontology specifies which classes have which properties.</td>
</tr>
</tbody>
</table>
<h2 id="cck">Content Types, CCK and Importation Concessions</h2>
<p>There were a number of areas where there was not a clear mapping between DBpedia and the features offered by CCK. The notes below refer to Drupal 6, and do not consider the Fields in Core initiative in Drupal 7.</p>
<h3 id="propert-explosion">Property Explosion</h3>
<p>Drupal handles object properties through a system known as the <a href="http://drupal.org/project/cck">Content Construction Kit</a> (CCK). CCK offers a handy interface for defining a schema for various content types. Each object (node) is a instance of a single class (content type). Each node therefore has the properties of its content type.</p>
<p>Those of you familiar with the inner workings of CCK in Drupal 6 will understand the idiosyncrasies of the way CCK works behind the scenes. On the backend things work as expected up to a certain level and then they get a bit complicated. Properties for a content type are grouped together in a single database table, as you would expect. There are two exceptions to this rule. Firstly, if the property can have multiple values, it is stored in a separate table. This too is natural enough. Secondly, if two content types share a property then it is split out into its own table. This is a bit strange and can catch you unaware if you aren&#8217;t expecting it. However, it is sensible enough as it allows easy queries across content types on a single property.</p>
<p>Things become tricky when you have (i) lots of &#8220;multi&#8221; properties or (ii) lots of &#8220;shared&#8221; properties. Drupal needs to issue a new query on a different table to get the data back for that property. This is alright for most sites but has the potential to be a worry in the case of DBpedia where there are a massive amount of different relationships and in some cases relatively few instances of those relationships. ie. the data is not very dense. We are potentially talking about 100s of properties which would need to be retrieved.</p>
<p>Unfortunately, in these cases it make sense to pick only the properties where you get he most bang for your buck. Which &#8220;shared&#8221; properties are shared amongst the most types? Which &#8220;multi&#8221; properties have the most instances? Which &#8220;single&#8221; properties have the most density down the rows? Along the way we had to be pragmatic and pick and choose the properties we would support. At the end of the day this wasn&#8217;t a limitation of what CCK, rather a sensible decisions were made to stop the database exploding out into thousands of tables with very little data in them.</p>
<p>I don&#8217;t see an easy way to solve the table &#8220;explosion&#8221; problem save from moving to an RDF datastore such as that implemented by <a href="http://arc.semsol.org/">ARC</a>. For data sets which have demands similar to DBpedia it makes sense to have something such as an RDF store in the backend. This conclusion is incredibly ironic given the lengths I have gone to to get the data into DBpedia. All was not lost however as the majority of data made it in and is usable within Drupal using standard techniques.</p>
<p>Interestingly this very issue seems to have plagued the backend design process for Drupal 7. According to the <a href="http://groups.drupal.org/files/DADS Final Report.pdf">DADS Final Report</a> (4th Feb 2009), &#8220;CCK 2 for Drupal will drop its variable schema and use the multi-value schema style for all fields.&#8221; Hmmm. I haven&#8217;t checked out Drupal 7 in this much detail but if this is true then the table explosion problem is going to be worse in Drupal 7. I&#8217;m not abreast of current developments here so I can&#8217;t comment further.</p>
<h3 id="no-subclasses">No sub classing properties</h3>
<p>Drupal doesn&#8217;t allow for for sub classing content types. Each content type exists as its own base class. This means that we have to define base content types with all of the properties of the contained subclasses. The ontology in DBpedia can be up to four levels deep: <a href="http://uriverse.com/dbp/class/EurovisionSongContestEntry">Eurovsion Song Contest Entry</a> is the deepest, preceded by <a href="http://uriverse.com/dbp/class/Song">Song</a>, <a href="http://uriverse.com/dbp/class/MusicalWork">Musical Work</a>, <a href="http://uriverse.com/dbp/class/Work">Work</a> and finally Thing (a catch all at the root). This of course leads to base content types with many properties which will be null for the instance in question. The database tables would become very wide and have relatively low density of information.</p>
<p>The <a href="http://drupal.org/handbook/modules/taxonomy">Taxonomy system</a> does allow us to partially work around the sub classing problem. A class hierarchy maps nicely to a hierarchy of Terms in a Taxonomy. Further, multiple Terms can be applied to a Node making it possible to specify different classes for a node. It doesn&#8217;t cover the difficulty of property storage however.</p>
<h3 id="no-multiple-inheritance">No Multiple inheritance</h3>
<p>When it comes to data modeling there are different ways to handle typing. The most simplistic and limited way is to allow instances to have a single class. This is the way Drupal currently works with content types and CCK. Each node belongs to a single content type and has the properties defined by CCK for that node. A more flexible way of modeling data allows for multiple inheritance where an instance can have more than one class.</p>
<p>Where an instance did straddle two base classes it was impossible to carry data for both types. This <a href="http://uriverse.com/search/apachesolr_search?filters=tid%3A5141%20type%3Adwrk&amp;solrsort=created%20desc&amp;retain-filters=1">query</a> shows all &#8220;people&#8221; who are &#8220;works&#8221; as well. I think these cases can be put down to DBpedia being a bit too promiscuous in the infoboxes it processes for each article. This isn&#8217;t a strong argument for multiple inheritance because the data is probably erroneous, however, it does demonstrate an area where modeling could be more flexible.</p>
<h3 id="compound-types">Compound Types</h3>
<p>In some cases a compound type was required. For example, images data had three components: thumbnail link, depiction link and copyright info link. All three of these should have been considered one unit, however this is not possible through the standard CCK interface which handles atomistic primitives. Because these image properties applied to multiple content types, the end outcome was that they were represented by different database tables. It was very frustrating to know three queries were being issues when one (or none) would suffice. It is possible to <a href="http://www.poplarware.com/cckfieldmodule.html">define your own custom datatypes</a> through a module but this is a fairly high barrier to jump.</p>
<h3 id="interfaces">Interfaces: A possible solution</h3>
<p><a href="http://www.freebase.com/">Freebase</a> is a collaboratively edited system of open data which is similar to DBpedia in many respects. The main difference is that Freebase allows users to create and edit content according to certain schemas. One of the very impressive aspects of the Freebase system is its ability to support multiple types, or <a href="http://www.freebase.com/view/en/data_modeling_guide">co-types</a>, for an object. From the Freebase data modeling guide we have:</p>
<blockquote><p>A novel aspect of the Metaweb system is that instances may have multiple types. A single topic such as &#8220;<a href="http://www.freebase.com/view/en/kevin_bacon">Kevin Bacon</a>&#8221; may have multiple types such as a Person, Film Actor, TV Actor and Musical Artist. and others. Since no single type could encapsulate such diversity, multiple types are required to hold all properties to fully describe Kevin Bacon and his life.</p></blockquote>
<p>This approach has the advantage of grouping properties in a table allowing for fast retrieval and querying, as well as allowing for flexibility and sensible design. I believe that Drupal could benefit from the Freebase approach. The current content type + CCK model could remain in place AND be augmented by an interface system which allowed for grouping of properties for various types. To take the Kevin Bacon example, &#8220;Kevin Bacon&#8221; would be a Person content type with the base properties of Birthday and Deathday. &#8220;Kevin Bacon&#8221; would then have the FilmActor, TVActor and MusicalArtist interfaces which could be represented by separate tables on the backend. I believe that this offers good flexibility for those desiring a powerful system whilst maintaining simplicity for those who just need base types. It also solves a lot of the hand wringing which goes with the way some CCK tables are formed.</p>
<h2 id="import">Import into Drupal</h2>
<p>As a new comer to Drupal, by far the most disappointing aspect of the system was the lack of a clear and easy to understand API. I assumed that there would be a nice object abstraction which I could use to populate the system. I gradually came to understand that most development was done by examining the excretions of print_r() to determine what data was available at that particular moment. Where was the interface to content types, CCK and nodes? How could I create these programatically? There were times where I stopped and paused and considered a framework which was cleaner and more lightweight. The rich functionality was the thing that kept me though.</p>
<p>The large size of the data set pretty much dictated that it needed to be imported in the most efficient way possible. If I accepted a 1 second overhead for a save/update I would be waiting four months at least for the data to load. So, notwithstanding the state of the API in Drupal 6, a straight database import was the order of the day. After a bit of reverse engineering mucking around I had a few PHP/SQL scripts which could insert content pretty quickly, with insertion rates of around 1000 rows a second.</p>
<p>The import process followed these simplified steps for the various pieces of DBpedia.</p>
<ul>
<li>Import DBpedia data from CSV format into MySQL.</li>
<li>Create a staging_node table with columns: title, language, teaser, content, url_alias, nid.</li>
<li>DBpedia data populated into staging_node and cleaned.</li>
<li>staging_node data copied into Drupal database.</li>
</ul>
<p>The process was therefore semi automated with a number of scripts. It still took a few weeks to run through from start to finish. Importing 75M page links was the final time consuming process. I could only get throughput of about &lt;200 rows a second as a url_alias to id lookup was required. This part of the process took around 7 days. Not something I want to repeat.</p>
<p>During the import I realized a few things about MySQL techniques which may come in handy for other people.</p>
<ul>
<li>Loading data into MySQL direct from file, with no index is very fast.</li>
<li>It&#8217;s even faster if the source file is on a different disk to the DB.</li>
<li>The overhead from very heavy/large select queries can be minimized by using mysql_unbuffered_query. The connection can be fragile though so be prepared with an internal counter so you know where the process got up to before it died. You can&#8217;t write to the table you are reading from and it is locked for other operations.</li>
<li>Sometimes pure SQL is a good way to go: INSERT INTO SELECT</li>
<li>Sometimes joining is no good and running a SELECT for each row is best (if caching and keys) are working.</li>
<li>SHOW FULL PROCESSLIST is your friend.</li>
<li>Sorting is your enemy.</li>
<li>Dumping temp tables to disk must be avoided. Try tweaking your conf. You want to hear the cooling fans humming rather than the disk ticking away.</li>
</ul>
<h2 id='features'>Drupal Features</h2>
<p>Characteristics of the rich DBpedia data set provided a good foundation for using the following Drupal features:</p>
<ul>
<li>Image thumbnails pointing to Wikipedia display in the contents expanding to full depiction via thickbox.</li>
<li>Geo cordinates for Nodes displayed in Google Map via Geo module.</li>
<li>Geo Cordinates used with Flickr API to pull back Creative Commons images for places from Flickr.</li>
<li>FiveStar ratings used on People, Places, Organizations and Music Genres.</li>
<li>Views provided <a href='http://uriverse.com/top'>top rated lists</a> of the most popular things.</li>
<li>Simile Timeline widget used to display <a href='http://uriverse.com/timeline'>chronology of events</a>.</li>
<li>Solr used as <a href='http://uriverse.com/search/apachesolr_search'>search</a> engine and filters on language and class.</li>
<li>Solr recommends &#8220;More like this&#8221; based on Node content.</li>
<li>Filtered Views provide <a href='http://uriverse.com/lookup'>lookups</a> for titles, firstname, lastname and geo coordinates</li>
<li>Various node properties allow for <a href='http://uriverse.com/list'>sorted list Views</a> of <a href=http://uriverse.com/list/richest-people'>richest people</a> etc.</li>
</ul>
<h2 id="performance-shortcomings">Performance Shortcomings</h2>
<p>OK. So the data is in Drupal &#8211; were there any problems when it came to running the site? Yes, I ran into a few challenges along the way. Some were fixed, others worked around and others still remain a thorn in our side.</p>
<h3 id="indexes">Database Indexes</h3>
<p>The database is the most pressing concern when it comes to performance on a big site. If simple queries run slowly then the site will not function acceptably even for low traffic. The most important area is to ensure that indexes for key tables have been loaded into the key buffer. Lets look at a couple of simple selects with and without a primed key buffer.</p>
<table style="height: 70px;" width="579">
<tbody>
<tr>
<th style="text-align: left;">Query</th>
<th style="text-align: left;">No key buffer</th>
<th style="text-align: left;">Key buffer</th>
</tr>
<tr>
<td>select sql_no_cache title from node where nid=1000000;</td>
<td>0.05s</td>
<td>0.00s</td>
</tr>
<tr>
<td>select sql_no_cache dst from url_alias where src=&#8217;node/1000000&#8242;;</td>
<td>0.02s</td>
<td>0.00s</td>
</tr>
</tbody>
</table>
<p>If these indexes aren&#8217;t in RAM then the most basic of lookups will take a long time, ie. 0.07s. If you are on a page with a view with 50 nodes to look up then just getting the title and path out will take (0.07s * 50) 3.5 seconds. Note that this doesn&#8217;t include all the other processing Drupal must do. This in completely unacceptable and so it is mandatory to get these indexes into RAM. I recommend putting the following SQL into a file and running it every time MySQL is started up using the init-file variable in my.cnf.</p>
<p><code><br />
USE drupal6;<br />
LOAD INDEX INTO CACHE node;<br />
LOAD INDEX INTO CACHE node_revisions;<br />
LOAD INDEX INTO CACHE url_alias;<br />
LOAD INDEX INTO CACHE term_data;<br />
LOAD INDEX INTO CACHE term_node;<br />
</code></p>
<p>On massive sites you probably won&#8217;t be able to get all of the node indexes into the key buffer, even if you have been generous in its allocation (up to 50% of RAM). In this case I resorted to running a query which seems to get the node nids into the buffer whilst leaving out all the other indexes which aren&#8217;t used as much. It takes a while to run but does the trick.</p>
<p><code><br />
select count(n.nid), count(n2.nid) from node n inner join node n2 on n.nid=n2.nid;<br />
</code></p>
<p>In the best case scenario we would all have RAM (money) to burn but unfortunately that&#8217;s generally not the case.</p>
<h3 id="drupal">Core and Contributed modules</h3>
<p>There are a few areas in Drupal which bogged down when handling huge data sets. As I developed the site I took note of problematic areas. Most of these areas are probably well known so we&#8217;ll just mention them briefly.</p>
<ul>
<li>In general a <strong>node load is very heavy</strong>. Lazy loading of CCK properties would make the system much faster if CCK didn&#8217;t have to be loaded. It would also mean the API could be used when speed is an issue. ie. when processing millions of nodes at once. During import and update, the solution is to work directly with the database. Just for a laugh I tried setting the status to 0 for a large node but the 404 still tried to load the whole node and then died.</li>
<li><strong>Editing a node with many multi properties</strong> it all but impossible. The edit page size is massive and RAM/CPU is hammered. Solution is not to edit! Real solution is to page multi properties in node edit.</li>
<li><strong>Viewing a page with many multi properties</strong> requires that the properties be paged. Looking up all those nodes gets slow quickly even with fast database. Solution is to use a module such as CCK Pager.</li>
<li><strong>Viewing list of content</strong> is not possible as SQL query relies on a join to user table. This query kills the database. Solution is to make a small hack to core to stop this join. If users were looked up with separate queries then this would be a better solution.</li>
<li><strong>Search is impossible</strong> from many angles. Queries for indexing are very slow. Database not designed to handle such large amounts of data. Solution is to let Solr handle search.</li>
<li>Solr search design is generally good with data held in apachesolr_search_node. However, <strong>Solr indexing can be a drain</strong> if you exclude node types from search. The preparatory query to return the nodes will inner join to node and lead to a very slow query. It returns after a while (70s for me) so you can live with it. Definitely not something you want to be doing regularly on production server as CPU goes to 100%. Solution is (i) to replicate node type data in apachesolr_search_node or (ii) get the excluded nodes anyway and ignore them. First option is best.</li>
<li>Taxonomy pages fail when there are <strong>many nodes in a category</strong>. SQL query is very slow. Solution is to let Solr show Taxonomy pages.</li>
<li>Strangely, the <strong>Search module still was running</strong> even when Solr was handling search. Core had to be hacked to turn it off. There must be a better way but I couldn&#8217;t see it. Make sure the &#8220;search&#8221; tables aren&#8217;t populated when using Solr.</li>
<li>Views queries can be slow with no <strong>extra indexes on content tables</strong>.</li>
<li><strong>Views queries use left joins</strong> which can be slow. Inner joins exclude a lot of rows you don&#8217;t need. Solution is to rewrite the Views SQL with a hook.</li>
<li><strong>Displaying nodes or teasers in Views</strong> can be very RAM intensive if the displayed nodes are very heavy with CCK fields. Solution is to use fields.</li>
<li>The <strong>Voting API module stores data in an inefficient way</strong> leading to some hairy indexes on the table and some equally hairy queries in Views when trying to display ratings. Be careful.</li>
<li><strong>CCK node referrers</strong> datatype has a mandatory sort order specified. This kills query performance for large result sets.</li>
<li><strong>XML Sitemap</strong> has some queries which return massive data sets which need to be dumped to disk. I know this module is in the process of being reworked.</li>
</ul>
<h2 id="solr">Solr: Some stats</h2>
<p>Uriverse uses <a href="http://drupal.org/project/apachesolr">Solr</a> as its search engine and it is a component which has performed remarkably well. The filtering capabilities of Solr are excellent and the speed at which results come back are very impressive, even for large corpuses. Since it runs as a service over HTTP it is possible to deploy it to a second server to reduce load on the web server/DB box.</p>
<p>It takes a while to index 10M articles (categories, redirects, disambiguations, photos and pages were excluded from the 13M) and it is a task not suited for the standard Drupal cron. A custom script was written to build the index so that it could hammer away almost constantly without upsetting the other tasks cron performs. Initially the script was written to be a long running process which would potentially run for months. However, a <a href="http://drupal.org/node/656370">memory leak in Drupal</a> meant that this was not possible. After 10 000 nodes RAM became to much for PHP and the script died. The solution was to limit the number of nodes processed each time. The script now processes around 8 000 nodes every 30 minutes. It therefore takes around a month to build the index.</p>
<p>On the web there are quite a few articles regarding <a href="http://stackoverflow.com/questions/1546898/how-to-reduce-solr-memory-usage">memory usage with Solr</a>. The JVM needs to be given enough room when started. These reports had me concerned because I am running 32 bit machines with only 3G at my disposal. Would an index of 10M articles run in a JVM limited to around 2G? What size would the index be on disk? These are the numbers for the 7883304 articles currently in the index:</p>
<table style="height: 70px;" width="395">
<tbody>
<tr>
<th style="text-align: left;">Resource</th>
<th style="text-align: left;">Total</th>
<th style="text-align: left;">Per Article</th>
</tr>
<tr>
<td>Disk</td>
<td>47.5 GB</td>
<td>6.0 KB</td>
</tr>
<tr>
<td>RAM</td>
<td>1.5 GB</td>
<td>198 B</td>
</tr>
</tbody>
</table>
<p>Obviously these numbers are dependent on the average size of the title, body, fields and taxonomy. RAM is also affected by number of facets, sorting and server load. I therefore I have been very conservative in what I have indexed and have turned of sorting in the interface. It looks like the current RAM allocation will be sufficient.</p>
<h2 id="performance">Performance</h2>
<p>Most <a href="http://cruncht.com/75/drupal-performance-scalability">Drupal performance best practices</a> (opcode cache, aggregation, compression, Boost, Expires, database indexes, views and block caches) have been followed to get the most out of the site. A CDN has not been deployed because Uriverse doesn&#8217;t serve many images from its own server. The images that are served have an expires header and all other images come from Wikipedia and Flickr which probably have their own CDN solutions in place. Further optimizations would include going with MPM Worker + fcgid or Nginx. This will be required if traffic picks up and MaxClients is reached.</p>
<p>Two problematic areas remain. The first is the amount of RAM available for the database indexes. It would be nice to be able to increase that one day. Ordinary node page build times do come in at respectable times considering so this is not such a big issue. The second problematic area is some of the queries in Views. A bit more research is required here but it is likely that some Views will have to be dumped if they are hitting the disk with temp tables. Sometimes its easiest to forgo some functionality to maintain the health of the server.</p>
<h2 id="conclusion">Conclusion</h2>
<p>All up the project has taken longer than expected &#8211; more than a few months. Most of the time was spent wrangling the data rather than fighting Drupal, although there were quite a few issues to work through, this being my first serious Drupal project. If I knew of the pain I would suffer from having to massage and prepare the data as well as the patience required to babysit the import process over days and weeks then I probably wouldn&#8217;t have commenced the project. That said, I am pleased with the outcome now that it is all in. I am able to leverage the power of many contributed modules to bring the data to life. There is a great sense of satisfaction seeing Solr return results from 10M articles in 90 languages, as well as the pretty theme, Google Maps, Thickbox, Similie timelines and Five Star ratings. I am humbled by the efforts of all Drupal contributors over the years. What I am left with now is a good platform which will form the basis of future data aggregation efforts.</p>
<h2 id="appendices">Appendices</h2>
<h3 id="class-hierarchy">Class hierarchy</h3>
<ul>
<li><a href="http://uriverse.com/dbp/class/Place">Place</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/PopulatedPlace">Populated Place</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Country">Country</a></li>
<li><a href="http://uriverse.com/dbp/class/Area">Area</a></li>
<li><a href="http://uriverse.com/dbp/class/Municipality">Municipality</a></li>
<li><a href="http://uriverse.com/dbp/class/City">City</a></li>
<li><a href="http://uriverse.com/dbp/class/Continent">Continent</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/BodyOfWater">Body Of Water</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/River">River</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Canal">Canal</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Lake">Lake</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/HistoricPlace">Historic Place</a></li>
<li><a href="http://uriverse.com/dbp/class/Mountain">Mountain</a></li>
<li><a href="http://uriverse.com/dbp/class/Building">Building</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Airport">Airport</a></li>
<li><a href="http://uriverse.com/dbp/class/Station">Station</a></li>
<li><a href="http://uriverse.com/dbp/class/Skyscraper">Skyscraper</a></li>
<li><a href="http://uriverse.com/dbp/class/Bridge">Bridge</a></li>
<li><a href="http://uriverse.com/dbp/class/Stadium">Stadium</a></li>
<li><a href="http://uriverse.com/dbp/class/ShoppingMall">Shopping Mall</a></li>
<li><a href="http://uriverse.com/dbp/class/Lighthouse">Lighthouse</a></li>
<li><a href="http://uriverse.com/dbp/class/Hospital">Hospital</a></li>
<li><a href="http://uriverse.com/dbp/class/LaunchPad">Launch Pad</a></li>
<li><a href="http://uriverse.com/dbp/class/HistoricBuilding">Historic Building</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/ProtectedArea">Protected Area</a></li>
<li><a href="http://uriverse.com/dbp/class/LunarCrater">Lunar Crater</a></li>
<li><a href="http://uriverse.com/dbp/class/World Heritage Site">World Heritage Site</a></li>
<li><a href="http://uriverse.com/dbp/class/Park">Park</a></li>
<li><a href="http://uriverse.com/dbp/class/Island">Island</a></li>
<li><a href="http://uriverse.com/dbp/class/SiteOfSpecialScientificInterest">Site Of Special Scientific Interest</a></li>
<li><a href="http://uriverse.com/dbp/class/WineRegion">Wine Region</a></li>
<li><a href="http://uriverse.com/dbp/class/SkiArea">Ski Area</a></li>
<li><a href="http://uriverse.com/dbp/class/Cave">Cave</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Infrastructure">Infrastructure</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Road">Road</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Planet">Planet</a></li>
<li><a href="http://uriverse.com/dbp/class/Person">Person</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/MilitaryPerson">Military Person</a></li>
<li><a href="http://uriverse.com/dbp/class/Athlete">Athlete</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/FormulaOneRacer">Formula One Racer</a></li>
<li><a href="http://uriverse.com/dbp/class/IceHockeyPlayer">IceHockey Player</a></li>
<li><a href="http://uriverse.com/dbp/class/BaseballPlayer">Baseball Player</a></li>
<li><a href="http://uriverse.com/dbp/class/Wrestler">Wrestler</a></li>
<li><a href="http://uriverse.com/dbp/class/BasketballPlayer">Basketball Player</a></li>
<li><a href="http://uriverse.com/dbp/class/RugbyPlayer">Rugby Player</a></li>
<li><a href="http://uriverse.com/dbp/class/Cyclist">Cyclist</a></li>
<li><a href="http://uriverse.com/dbp/class/FigureSkater">Figure Skater</a></li>
<li><a href="http://uriverse.com/dbp/class/GaelicGamesPlayer">Gaelic Games Player</a></li>
<li><a href="http://uriverse.com/dbp/class/Boxer">Boxer&#8217;</a></li>
<li><a href="http://uriverse.com/dbp/class/TennisPlayer">Tennis Player</a></li>
<li><a href="http://uriverse.com/dbp/class/NationalCollegiateAthleticAssociationAthlete">National Collegiate Athletic Association Athlete</a></li>
<li><a href="http://uriverse.com/dbp/class/NascarDriver">Nascar Driver</a></li>
<li><a href="http://uriverse.com/dbp/class/PokerPlayer">Poker Player</a></li>
<li><a href="http://uriverse.com/dbp/class/BadmintonPlayer">Badminton Player</a></li>
<li><a href="http://uriverse.com/dbp/class/Cricketer">Cricketer</a></li>
<li><a href="http://uriverse.com/dbp/class/FootballPlayer">Football Player</a></li>
<li><a href="http://uriverse.com/dbp/class/SoccerPlayer">Soccer Player</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/OfficeHolder">Office Holder</a></li>
<li><a href="http://uriverse.com/dbp/class/Scientist">Scientist</a></li>
<li><a href="http://uriverse.com/dbp/class/CollegeCoach">College Coach</a></li>
<li><a href="http://uriverse.com/dbp/class/Monarch">Monarch</a></li>
<li><a href="http://uriverse.com/dbp/class/Artist">Artist</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/MusicalArtist">Musical Artist</a></li>
<li><a href="http://uriverse.com/dbp/class/Actor">Actor</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/AdultActor">Adult Actor</a></li>
<li><a href="http://uriverse.com/dbp/class/VoiceActor">Voice Actor</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Writer">Writer</a></li>
<li><a href="http://uriverse.com/dbp/class/ComicsCreator">Comics Creator</a></li>
<li><a href="http://uriverse.com/dbp/class/Comedian">Comedian</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/PlayboyPlaymate">Playboy Playmate</a></li>
<li><a href="http://uriverse.com/dbp/class/Philosopher">Philosopher</a></li>
<li><a href="http://uriverse.com/dbp/class/Astronaut">Astronaut</a></li>
<li><a href="http://uriverse.com/dbp/class/Politician">Politician</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Senator">Senator</a></li>
<li><a href="http://uriverse.com/dbp/class/President">President</a></li>
<li><a href="http://uriverse.com/dbp/class/Governor">Governor</a></li>
<li><a href="http://uriverse.com/dbp/class/Congressman">Congressman</a></li>
<li><a href="http://uriverse.com/dbp/class/PrimeMinister">Prime Minister</a></li>
<li><a href="http://uriverse.com/dbp/class/MemberOfParliament">Member Of Parliament</a></li>
<li><a href="http://uriverse.com/dbp/class/Chancellor">Chancellor</a></li>
<li><a href="http://uriverse.com/dbp/class/Mayor">Mayor</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Model">Model</a></li>
<li><a href="http://uriverse.com/dbp/class/Journalist">Journalist</a></li>
<li><a href="http://uriverse.com/dbp/class/Criminal">Criminal</a></li>
<li><a href="http://uriverse.com/dbp/class/BritishRoyalty">British Royalty</a></li>
<li><a href="http://uriverse.com/dbp/class/Architect">Architect</a></li>
<li><a href="http://uriverse.com/dbp/class/FictionalCharacter">Fictional Character</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/ComicsCharacter">Comics Character</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Celebrity">Celebrity</a></li>
<li><a href="http://uriverse.com/dbp/class/Cleric">Cleric</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Saint">Saint</a></li>
<li><a href="http://uriverse.com/dbp/class/ChristianBishop">Christian Bishop</a></li>
<li><a href="http://uriverse.com/dbp/class/Cardinal">Cardinal</a></li>
<li><a href="http://uriverse.com/dbp/class/Pope">Pope</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Judge">Judge</a></li>
<li><a href="http://uriverse.com/dbp/class/FootballManager">Football Manager</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Organisation">Organisation</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Band">Band</a></li>
<li><a href="http://uriverse.com/dbp/class/Company">Company</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/RecordLabel">Record Label</a></li>
<li><a href="http://uriverse.com/dbp/class/Airline">Airline</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/RadioStation">Radio Station</a></li>
<li><a href="http://uriverse.com/dbp/class/MilitaryUnit">Military Unit</a></li>
<li><a href="http://uriverse.com/dbp/class/SportsTeam">SportsTeam</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/SoccerClub">Soccer Club</a></li>
<li><a href="http://uriverse.com/dbp/class/FootballTeam">Football Team</a></li>
<li><a href="http://uriverse.com/dbp/class/HockeyTeam">Hockey Team</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Non-ProfitOrganisation">Non-Profit Organisation</a></li>
<li><a href="http://uriverse.com/dbp/class/Legislature">Legislature</a></li>
<li><a href="http://uriverse.com/dbp/class/Broadcast">Broadcast</a></li>
<li><a href="http://uriverse.com/dbp/class/TradeUnion">Trade Union</a></li>
<li><a href="http://uriverse.com/dbp/class/EducationalInstitution">Educational Institution</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/School">School</a></li>
<li><a href="http://uriverse.com/dbp/class/University">University</a></li>
<li><a href="http://uriverse.com/dbp/class/College">College</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/EthnicGroup">Ethnic Group</a></li>
<li><a href="http://uriverse.com/dbp/class/Work">Work</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/MusicalWork">Musical Work</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Album">Album</a></li>
<li><a href="http://uriverse.com/dbp/class/Single">Single</a></li>
<li><a href="http://uriverse.com/dbp/class/Song">Song</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/EurovisionSongContestEntry">Eurovision Song Contest Entry</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Film">Film</a></li>
<li><a href="http://uriverse.com/dbp/class/Book">Book</a></li>
<li><a href="http://uriverse.com/dbp/class/VideoGame">Video Game</a></li>
<li><a href="http://uriverse.com/dbp/class/TelevisionShow">Television Show</a></li>
<li><a href="http://uriverse.com/dbp/class/TelevisionEpisode">Television Episode</a></li>
<li><a href="http://uriverse.com/dbp/class/Software">Software</a></li>
<li><a href="http://uriverse.com/dbp/class/Newspaper">Newspaper</a></li>
<li><a href="http://uriverse.com/dbp/class/Magazine">Magazine</a></li>
<li><a href="http://uriverse.com/dbp/class/Musical">Musical</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Device">Device</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/AutomobileEngine">Automobile Engine</a></li>
<li><a href="http://uriverse.com/dbp/class/Weapon">Weapon</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/MeanOfTransportation">Mean Of Transportation</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Ship">Ship</a></li>
<li><a href="http://uriverse.com/dbp/class/Aircraft">Aircraft</a></li>
<li><a href="http://uriverse.com/dbp/class/Automobile">Automobile</a></li>
<li><a href="http://uriverse.com/dbp/class/AutomobilePlatform">Automobile Platform</a></li>
<li><a href="http://uriverse.com/dbp/class/Rocket">Rocket</a></li>
<li><a href="http://uriverse.com/dbp/class/SpaceShuttle">Space Shuttle</a></li>
<li><a href="http://uriverse.com/dbp/class/Spacecraft">Spacecraft</a></li>
<li><a href="http://uriverse.com/dbp/class/SpaceStation">SpaceStation</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/AnatomicalStructure">AnatomicalStructure</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Bone">Bone</a></li>
<li><a href="http://uriverse.com/dbp/class/Artery">Artery</a></li>
<li><a href="http://uriverse.com/dbp/class/Vein">Vein</a></li>
<li><a href="http://uriverse.com/dbp/class/Lymph">Lymph</a></li>
<li><a href="http://uriverse.com/dbp/class/Nerve">Nerve</a></li>
<li><a href="http://uriverse.com/dbp/class/Brain">Brain</a></li>
<li><a href="http://uriverse.com/dbp/class/Muscle">Muscle</a></li>
<li><a href="http://uriverse.com/dbp/class/Embryology">Embryology</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/OlympicResult">Olympic Result</a></li>
<li><a href="http://uriverse.com/dbp/class/Event">Event</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/MilitaryConflict">Military Conflict</a></li>
<li><a href="http://uriverse.com/dbp/class/MusicFestival">Music Festival</a></li>
<li><a href="http://uriverse.com/dbp/class/Convention">Convention</a></li>
<li><a href="http://uriverse.com/dbp/class/YearInSpaceflight">Year In Spaceflight</a></li>
<li><a href="http://uriverse.com/dbp/class/SpaceMission">Space Mission</a></li>
<li><a href="http://uriverse.com/dbp/class/FilmFestival">Film Festival</a></li>
<li><a href="http://uriverse.com/dbp/class/SportsEvent">Sports Event</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/GrandPrix">Grand Prix</a></li>
<li><a href="http://uriverse.com/dbp/class/WrestlingEvent">Wrestling Event</a></li>
<li><a href="http://uriverse.com/dbp/class/MixedMartialArtsEvent">Mixed Martial Arts Event</a></li>
<li><a href="http://uriverse.com/dbp/class/Race">Race</a></li>
<li><a href="http://uriverse.com/dbp/class/Olympics">Olympics</a></li>
<li><a href="http://uriverse.com/dbp/class/WomensTennisAssociationTournament">Womens Tennis Association Tournament</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Language">Language</a></li>
<li><a href="http://uriverse.com/dbp/class/ChemicalCompound">Chemical Compound</a></li>
<li><a href="http://uriverse.com/dbp/class/Species">Species</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Eukaryote">Eukaryote</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Plant">Plant</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/GreenAlga">Green Alga</a></li>
<li><a href="http://uriverse.com/dbp/class/Moss">Moss</a></li>
<li><a href="http://uriverse.com/dbp/class/ClubMoss">Club Moss</a></li>
<li><a href="http://uriverse.com/dbp/class/Fern">Fern</a></li>
<li><a href="http://uriverse.com/dbp/class/Cycad">Cycad</a></li>
<li><a href="http://uriverse.com/dbp/class/Ginkgo">Ginkgo</a></li>
<li><a href="http://uriverse.com/dbp/class/Conifer">Conifer</a></li>
<li><a href="http://uriverse.com/dbp/class/Gnetophytes">Gnetophytes</a></li>
<li><a href="http://uriverse.com/dbp/class/FloweringPlant">Flowering Plant</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Grape">Grape</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Animal">Animal</a>
<ul>
<li><a href="http://uriverse.com/dbp/class/Fish">Fish</a></li>
<li><a href="http://uriverse.com/dbp/class/Amphibian">Amphibian</a></li>
<li><a href="http://uriverse.com/dbp/class/Reptile">Reptile</a></li>
<li><a href="http://uriverse.com/dbp/class/Bird">Bird</a></li>
<li><a href="http://uriverse.com/dbp/class/Mammal">Mammal</a></li>
<li><a href="http://uriverse.com/dbp/class/Insect">Insect</a></li>
<li><a href="http://uriverse.com/dbp/class/Arachnid">Arachnid</a></li>
<li><a href="http://uriverse.com/dbp/class/Crustacean">Crustacean</a></li>
<li><a href="http://uriverse.com/dbp/class/Mollusca">Mollusca</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Fungus">Fungus</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Archaea">Archaea</a></li>
<li><a href="http://uriverse.com/dbp/class/Bacteria">Bacteria</a></li>
</ul>
</li>
<li><a href="http://uriverse.com/dbp/class/Protein">Protein</a></li>
<li><a href="http://uriverse.com/dbp/class/Disease">Disease</a></li>
<li><a href="http://uriverse.com/dbp/class/Drug">Drug</a></li>
<li><a href="http://uriverse.com/dbp/class/SupremeCourtOfTheUnitedStatesCase">Supreme Court Of The United States Case</a></li>
<li><a href="http://uriverse.com/dbp/class/Website">Website</a></li>
<li><a href="http://uriverse.com/dbp/class/Music Genre">Music Genre</a></li>
<li><a href="http://uriverse.com/dbp/class/Currency">Currency</a></li>
<li><a href="http://uriverse.com/dbp/class/Beverage">Beverage</a></li>
<li><a href="http://uriverse.com/dbp/class/Award">Award</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/361/uriverse-dbpedia-drupal-case-study/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Performance Quick Reference (part 13)</title>
		<link>http://cruncht.com/103/drupal-performance-quick-reference</link>
		<comments>http://cruncht.com/103/drupal-performance-quick-reference#comments</comments>
		<pubDate>Mon, 08 Feb 2010 10:05:56 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=103</guid>
		<description><![CDATA[Time to revisit the different types of Drupal sites to see where gains can be made. What type of site do you have? This quick reference recaps the previous articles and lists the areas where different types of Drupal sites can improve performance.

All Sites

Get the best server for your budget and requirements.
Enable CSS and JS [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/103/drupal-performance-quick-reference" title="Drupal Performance Quick Reference (part 13)"><img src="http://cruncht.com/wp-content/uploads/2010/02/to_do_list-150x150.jpg" alt="To do list" class="feed-image" title="Drupal Performance checklist" /></a><p>Time to revisit the <a href='/83/drupal-server-profile'>different types of Drupal sites</a> to see where gains can be made. What type of site do you have? This quick reference recaps the previous articles and lists the areas where different types of Drupal sites can improve performance.</p>
<p><span id="more-103"></span></p>
<h2>All Sites</h2>
<ul>
<li>Get the best server for your budget and requirements.</li>
<li>Enable CSS and JS optimization in Drupal</li>
<li>Enable compression in Drupal</li>
<li>Enable Drupal page cache and consider Boost</li>
<li>Install APC if available</li>
<li>Ensure no slow queries from rouge modules</li>
<li>Tune MySQL for decent query cache and key buffer</li>
<li>Optimize file size where possible</li>
</ul>
<h2>Server: Low resources</h2>
<ul>
<li>Boost stops PHP load and Bootstrap</li>
<li>Sensible module selection</li>
<li>Avoid node load in views lists</li>
<li>Smaller JVMs possibly if running Solr</li>
<li>Nginx smaller than Apache</li>
<li>mod_fcgid has smaller footprint over mod_php</li>
</ul>
<h2>Server: Farm</h2>
<ul>
<li>Split off Solr
<li>Split off DB server, watch the latency</li>
<li>With Cache Router select Memcache over APC for shared pools</li>
<li>Master + slaves for DB</li>
<li>Load balancing across web servers</li>
</ul>
<h2>Size: Many Nodes</h2>
<ul>
<li>Buy more RAM for database indexes</li>
<li>Index columns, especially for views</li>
<li>Thoroughly check slow queries</li>
<li>Warm up database</li>
<li>Swap in Solr for search</li>
<li>Solr to handle taxonomy pages</li>
</ul>
<h2>Activity: Many requests</h2>
<ul>
<li>Boost or</li>
<li>Pressflow and Varnish</li>
<li>Nginx over Apache</li>
<li>InnoDB on cache tables</li>
</ul>
<h2>Users: Mainly logged in</h2>
<ul>
<li>View/Block caching</li>
<li>CacheRouter (APC or Memcache)</li>
</ul>
<h2>Contention: Many Writes</h2>
<ul>
<li>InnoDB</li>
<li>Watchdog to file</li>
</ul>
<h2>Content: Heavy</h2>
<ul>
<li>Optimized files</li>
<li>Well positioned server</li>
<li>CDN</li>
</ul>
<h2>Functionality: Rich</h2>
<ul>
<li>Well behaved modules</li>
<li>Not too many modules</li>
<li>View/Block caching</li>
</ul>
<h2>Page browsing: Dispersed</h2>
<ul>
<li>Boost over Varnish if RAM is tight</li>
</ul>
<h2>Audience: Dispersed</h2>
<ul>
<li>CDN</li>
</ul>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/103/drupal-performance-quick-reference/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Page Rendering (part 12)</title>
		<link>http://cruncht.com/101/drupal-page-rendering</link>
		<comments>http://cruncht.com/101/drupal-page-rendering#comments</comments>
		<pubDate>Sun, 07 Feb 2010 10:04:30 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=101</guid>
		<description><![CDATA[The time for a page to render in a user&#8217;s browser is comprised of two factors. The first is the time it takes to build a page on the server. The second is the time it takes to send and render the page with all the contained components. This guide has mainly been concerned with [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/101/drupal-page-rendering" title="Drupal Page Rendering (part 12)"><img src="http://cruncht.com/wp-content/uploads/2010/02/slow-150x150.jpg" alt="Slow" class="feed-image" title="Drupal page rendering optimization" /></a><p>The time for a page to render in a user&#8217;s browser is comprised of two factors. The first is the time it takes to build a page on the server. The second is the time it takes to send and render the page with all the contained components. This guide has mainly been concerned with the former &#8211; how to get the most from your server, however, it is estimated that 80% to 90% of page rendering time is taken up during the rendering phase. </p>
<p><span id="more-101"></span></p>
<p> It&#8217;s no good to serve a cached page in the blink of an eye if there are countless included files which need to be requested and many large images which need to be transported across the globe. Optimizing page rendering time can make a noticeable difference to the user and is the cream on the cake of a well optimized site. It is therefore important to consider and optimize this final leg of the journey.</p>
<dl class='more'>
<dt><a href='http://wimleers.com/article/improving-drupals-page-loading-performance'>Improving Drupal&#8217;s page loading performance</a></dt>
<dd>Wim Leers covers all the bases on how to improve loading performance.</dd>
<dt><a href='http://www.amazon.com/High-Performance-Web-Sites-Essential/dp/0596529309'>High Performance Web Sites: Essential Knowledge for Front-End Engineers</a></dt>
<dd>Steve Souders, Chief Performance Yahoo! and author of YSlow extension, covers the Yahoo recommedations in this book.</dd>
<dt><a href='http://www.amazon.com/Even-Faster-Web-Sites-Performance/dp/0596522304'>High Even Faster Web Sites: Performance Best Practices for Web Developers</a></dt>
<dd>Another Steve Souders book covering Javascript (AJAX), Network (Image compression, chuncked encoding) and browser (CSS selectors, etc).</dd>
</dl>
<p>It is worthwhile reviewing <a href='http://developer.yahoo.com/performance/rules.html'>Yahoo&#8217;s YSlow recommendations</a> to see all of the optimizations which are possible. We cover selected areas where the default Drupal install can be improved upon.<br />
<h2><a id='network-requests'>Minimize HTTP Requests</a></h2>
<h3>Combined Files</h3>
<p>The <a href='/87/drupal-performance-out-of-the-box'>Out of The Box</a> section covered the inbuilt CSS and JS aggregation and file compression. The use of &#8220;combined files&#8221; is a significant factor in Drupal&#8217;s relatively good score in the YSlow tests. Make sure you have this enabled.</p>
<p class='summary'>All sites: Enable CSS and JS aggregation.</p>
<h3>CSS Sprites</h3>
<p>CSS Image Sprites are another method of cutting down the number of requests. This approach combines a number of smaller images into one large one which is then selectively displayed to the user through the use of background offset in CSS. It is a useful approach for thing such as small icons which can have a relatively large amount of HTTP overhead for each request. Something for the theme designers to consider.</p>
<p class='summary'>Custom designs: Use CSS sprites if appropriate.</p>
<dl class='more'>
<dt><a href='http://www.alistapart.com/articles/sprites'>CSS Sprites: Image Slicing’s Kiss of Death</a></dt>
<dd>Overview of how CSS sprites work and how they can be used.</dd>
<dt><a href='http://www.advomatic.com/blogs/jack-haas/lesson-usefulness-css-sprite-generators'>A lesson in the usefulness of CSS sprite generators</a></dt>
<dd>Covers commonly used spite generators.</dd>
</dl>
<h2><a id='cdn'>Use a Content Delivery Network (CDN)</a></h2>
<p>This is the number two recommended best practice.</p>
<blockquote><p>A content delivery network (CDN) is a collection of web servers distributed across multiple locations to deliver content more efficiently to users. The server selected for delivering content to a specific user is typically based on a measure of network proximity. For example, the server with the fewest network hops or the server with the quickest response time is chosen.<br /><a href='http://developer.yahoo.com/performance/rules.html#cdn'>http://developer.yahoo.com/performance/rules.html#cdn</a></p></blockquote>
<p>Of all the CDN web services <a href='http://www.simplecdn.com/'>SimpleCDN</a> seems to be getting positive press amongst Drupal folks as it is simple and cheap. It offers the &#8220;origin pull&#8221; Mirror Buckets service which will serve content from 3.9 cents to 1.9 cents per GB. At this price you will probably be saving money on your bandwidth costs as well as serving content faster.</p>
<p>The <a href='http://drupal.org/project/cdn'>CDN integration module</a> is the recommended module to use for integration with content delivery networks as it supports &#8220;origin pull&#8221; as well as push methods. It supports content delivery for a all CSS, JS, and image files (including ImageCache).</p>
<p class='summary'>High traffic, geographically dispersed: use CDN</p>
<dl class='more'>
<dt><a href='http://drupal.org/project/cdn'>CDN integration module</a></dt>
<dd>Wim Leers&#8217; fully featured module which integrates with a wide range of CDN servers.</dd>
<dt><a href='http://drupal.org/project/simplecdn'>SimpleCDN module</a></dt>
<dd>Simple CDN re-writes the URL of certain website elements (which can be extended using plugins) for use with a CDN Mirror service.</dd>
<dt><a href='http://wimleers.com/talk/drupalcon-dc-2009'>Drupal CDN integration: easier, more flexible and faster!</a></dt>
<dd>Slides covering advantages of CDNs and possible implementations.</dd>
<dt><a href='http://www.voxel.net/labs/mod_cdn'>mod_cdn</a></dt>
<dd>Apache2 module which shows some promise but not much info available for it with regards to Drupal.</dd>
<dt><a href='http://groups.drupal.org/node/47258'>Best Drupal CDN module?</a></dt>
<dd>Drupal Groups discussion.</dd>
</dl>
<p>On a related note many sites can benefit from judicial placement of the server if traffic tends to come from one place and no CDN is being used. Sites based out of the US may find the proximity of a site hosted in their area worth the extra cost of hosting.</p>
<h2><a id='expires-headers'>Add Expires Headers</a></h2>
<p>When a file is served by a web server an &#8220;Expires&#8221; header can be sent back to the client telling it that the content being sent will expire at a certain date in the future and that the content may be cached until that time. This speeds up page rendering because the client doesn&#8217;t have to send a GET request to see if the file has been modified.</p>
<p>By default the .htaccess file in the root of Drupal contains rules which sets a two week expiry for all files (CSS, JS, PNG, JPG, GIF) except for HTML which are considered to be dynamic and therefore not cachable.</p>
<p><code><br />
# Requires mod_expires to be enabled.<br />
<IfModule mod_expires.c><br />
  # Enable expirations.<br />
  ExpiresActive On<br />
  # Cache all files for 2 weeks after access (A).<br />
  ExpiresDefault A1209600<br />
  # Do not cache dynamically generated pages.<br />
  ExpiresByType text/html A1<br />
</IfModule><br />
</code></p>
<p>The Expires header will not be generated unless you have mod_expires enabled in Apache. To make sure it is enabled in Apache2 run the following as admin.</p>
<p><code><br />
# a2enmod expires<br />
# /etc/init.d/apache2 restart<br />
</code></p>
<p>Ensuring this is enabled will elevate your YSlow score by about 10 points or so.</p>
<p class='summary'>All sites: Configue Apache correctly for fewer requests.</p>
<h2><a id='gzip'>Gzip components</a></h2>
<p>You can Gzip by enabling compression in the performance area of admin. Alternatively you could configure Apache to do it.</p>
<p class='summary'>All Sites: Enable Gzip compression</p>
<h2><a id='optimize-images'>Optimize Images</a></h2>
<p>Binary files do not shrink significantly after Gzip compression. Gains can be made by ensuring that rich media such as images, audio and video are (i) targeted for the correct display resolution and (ii) have an appropriate amount of lossy compression applied. Since these files will generally only be downloaded once they do not benefit from caching in the client and so care must be taken to ensure that they are as small as reasonably possible.
<p class='summary'>All Sites: Compress binary files</p>
<dl class='more'>
<dt><a href='http://pmt.sourceforge.net/pngcrush/'>Pngcrush</a></dt>
<dd>Pngcrush is an optimizer for PNG (Portable Network Graphics) files. It can be run from a commandline in an MSDOS window, or from a UNIX or LINUX commandline.</dd>
</dl>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/101/drupal-page-rendering/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Drupal Benchmarking (part 11)</title>
		<link>http://cruncht.com/99/drupal-benchmarking</link>
		<comments>http://cruncht.com/99/drupal-benchmarking#comments</comments>
		<pubDate>Sat, 06 Feb 2010 10:03:15 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=99</guid>
		<description><![CDATA[Benchmarking a system is a reliable way to compare one setup with another and is particularly helpful when comparing different server configurations. We cover a few simple ways to benchmark a Drupal website.

A performant system is not just one which is fast for a single request. You also need to consider how the system performs [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/99/drupal-benchmarking" title="Drupal Benchmarking (part 11)"><img src="http://cruncht.com/wp-content/uploads/2010/02/tape-150x150.jpg" alt="Blue Tape" class="feed-image" title="Drupal Performance benchmarking" /></a><p>Benchmarking a system is a reliable way to compare one setup with another and is particularly helpful when comparing different server configurations. We cover a few simple ways to benchmark a Drupal website.</p>
<p><span id="more-99"></span></p>
<p>A performant system is not just one which is fast for a single request. You also need to consider how the system performs under stress (many requests) and how stable the system is (memory). Bechmarking with tools such as ab allows you to stress the server with many concurrent requests to replicate traffic when a site is being slashdotted. With a more customised setup they can also be used in more sophisticated ways to mimic traffic across a whole site.</p>
<dl class="more">
<dt><a href="http://drupal.org/node/79237">Benchmarking and profiling Drupal</a></dt>
<dd>Documentation which covers tools of the trade including Apache Bench (ab) and SIEGE.</dd>
</dl>
<h2><a id="apache-bench">Apache Bench (ab)</a></h2>
<p><a href="http://httpd.apache.org/docs/trunk/programs/ab.html">ab</a> is the most commonly used benchmarking tool in the community. It shows you have many requests per second your site is capable of serving. Concurrency can be set to 1 to get end to end speed results or increased to get a more realistic load for your site. Look to the &#8220;failed requests&#8221; and &#8220;request per second&#8221; results.</p>
<p>In order to test the speed of a single page, turn off page caching and run ab with concurrency of one to get a baseline.</p>
<p><code>ab -n 1000 -c 1 http://drupal6/node/1</code></p>
<p>To check scalability turn on the page cache and ramp up concurrent connections (10 to 50) to see how much the server can handle. You should also make sure keep alives are turned (-k) on as this leads to a more realistic result for a typical web browser. At higher concurrency levels making new connections can be a bottleneck. Also, set compression headers (-H) as most clients will support this feature.</p>
<p><code>ab -n 1000 -c 10 -k -H 'Accept-Encoding: gzip,deflate' http://drupal6/node/1</code></p>
<dl class="more">
<dt><a href="http://drupal.org/node/282862">Drupal Performance Measurement &amp; Benchmarking</a></dt>
<dd>Testing with ab and simple changes you can make within Drupal.</dd>
<dt><a href="http://drupaleasy.com/blogs/ryanprice/2009/04/drupal-performance-testing-apache-benchmark">On Drupal Performance: Testing with Apache Benchmark</a></dt>
<dd>Covers server side tools and walks through ab options and use.</dd>
<dt><a href="http://ezra-g.com/blog/20080229/benchmarking-authenticated-drupal-users-with-apachebench">Benchmarking Authenticated Drupal Users with ApacheBench</a></dt>
<dd>Demonstrates how to pull out current session id and how to pass that to ab so that authenticated users can be tested.</dd>
<dt><a href="http://groups.drupal.org/node/26485">Has anyone tried nginx caching with Drupal?</a></dt>
<dd>Illustrative discussion where different Drupal setups are benchmarked with ab.</dd>
</dl>
<h2><a id="jmeter">JMeter</a></h2>
<p><a href="http://jakarta.apache.org/jmeter/">JMeter</a> is a Java desktop app designed to test function and performance. It is the preferred testing tool of many administrators.</p>
<dl class="more">
<dt><a href="http://github.com/jacobSingh/Drupal-Performance-Testing-Suite">Drupal-Performance-Testing-Suite</a></dt>
<dd>Perl script which runs a jMeter test on Drupal and provides graphs.</dd>
</dl>
<h2><a id="benchmarking-thoughts">Thoughts on benchmarking</a></h2>
<p>Benchmarking is essential if you wish to have an objective comparison between different setups. However, it is not the final measurement with regards to performance. Remember that <a href="#page-rendering">page rendering</a> times are what are important for users and that too needs to be optimized. Also, benchmarks tend to be artificial in the sense that they often measure unrealistic situations. Will all of your requests be for one anonymous page only? Maybe in the Slashdot situation but there are other considerations obviously. Finally, it is easy to focus intently on the number, especially when it comes to caching scores, and forget that minor differences may not make so much of a difference to real life scenarios. Don&#8217;t forget the logged in user.</p>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/99/drupal-benchmarking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Custom Drupal Distributions (part 10)</title>
		<link>http://cruncht.com/97/custom-drupal-distributions</link>
		<comments>http://cruncht.com/97/custom-drupal-distributions#comments</comments>
		<pubDate>Fri, 05 Feb 2010 10:02:43 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[mercury]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[pressflow]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=97</guid>
		<description><![CDATA[There are a couple of projects which have made it easy to achieve performance gains by making slight amendments to core or packaging up the code in a helpful manner. Pressflow makes it possible to use a reverse proxy such as Varnish, amongst other things. Mercury packages Pressflow up as a Amazon EC2 image. The [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/97/custom-drupal-distributions" title="Custom Drupal Distributions (part 10)"><img src="http://cruncht.com/wp-content/uploads/2010/02/sharing-150x150.jpg" alt="Sharing" class="feed-image" title="Custom Drupal Distributions" /></a><p>There are a couple of projects which have made it easy to achieve performance gains by making slight amendments to core or packaging up the code in a helpful manner. Pressflow makes it possible to use a reverse proxy such as Varnish, amongst other things. Mercury packages Pressflow up as a Amazon EC2 image. The development of both these projects is the sign of a maturing ecosystem where serious deployments can easily be rolled out.</p>
<p><span id="more-97"></span></p>
<h2><a id='pressflow'>Pressflow</a></h2>
<p><a href='https://launchpad.net/pressflow'>Pressflow</a>  is a distribution which attempts to bring many of the improvements discussed above (SQL improvements, Varnish) into a single package. Pressflow is a standard Drupal install which has had its core modified to fix bottlenecks and facilitate the use of advanced caching features. FourKitchens don&#8217;t regard Pressflow as a fork since many of the initiatives found in Pressflow are contributed back into the development of the head of Drupal.
<p>So long as you haven&#8217;t hacked core yourself then using Pressflow is a simple matter of swapping out core drupal and replacing it with Pressflow.</p>
<p>In a nutshell Pressflow allows the following:</p>
<ul>
<li>Support for database replication</li>
<li>Support for Squid and Varnish reverse proxy caching</li>
<li>Optimization for MySQL</li>
<li>Optimization for PHP 5</li>
</ul>
<p class='summary'>High Traffic, Varnish required: Easy setup.</p>
<dl class='more'>
<dt><a href='http://fourkitchens.com/pressflow-makes-drupal-scale'>Pressflow makes Drupal scale</a></dt>
<dd>Announcement covering the advantages and features of Pressflow.</dd>
</dl>
<h2><a id='project-mercury'>Project Mercury</a></h2>
<p><a href='http://www.chapterthree.com/blog/josh_koenig/project_mercury_preconfigured_drupalvarnish_ec2_ami'>Project Mercury</a> is an innovative project from <a href=''>Chapter Three</a> which wraps up a tricked out PressFlow installation in to a preconfigured Amazon Machine Image (AMI) for use on Amazon EC2 instances. </p>
<dl class='more'>
<dt><a href='http://www.chapterthree.com/blog/zack_rosen/pantheon_project_blazes_ahead'>The Pantheon Project Blazes Ahead</a></dt>
<dd>Hot of the press: Mercury will also be available for deployment on other servers, not just EC2. Further there will be a Mercury On Demand service at Rackspace.</dd>
</dl>
<blockquote><p>The goal of this project is to make Drupal as fast as possible for as many people as possible. To that end, we are developing a pre-built Amazon Machine Image (AMI) which will allow anyone with an Amazon Web Services account to spin up an EC2 instance and see how all this works in real-time. The ultimate goal is a production-ready release that can be used for deploying real websites.</p></blockquote>
<p>If you want to get started using the image all you need to do is signup to Amazon Web Services and then start up an instance of your choosing. You know are in control of a fully configured, scalable server. This sounds easy in practice, however, if you are considering going down this path there are a couple of considerations:</p>
<ul>
<li>Amazon is not the cheapest provider of bandwidth, RAM and storage. Other virtual servers have better deals. You are paying for the ability to spawn servers on the fly</li>
<li>Persistent storage is an issue which needs to be overcome and managed if scaling out your web server.</li>
<li>There a bit of a learning curve with some of the tricks of the trade when managing the servers.</li>
</ul>
<p>Project Mercury and EC2 is a worthy combination if you really need the ability to serve massive amounts of traffic and also have the ability to temporarily scale out during peak times.</p>
<dl class='more'>
<dt><a href='http://groups.drupal.org/node/25617'>Project Mercury Benchmarks: 2000+ Requests Per Second!</a></dt>
<dd>Drupal is fast with APC + Page Cache. It is very fast with PressFlow and Varnish. NB. It would have been interesting to see how Boost went against Varnish for this test.</dd>
</dl>
<p>The configuration chosen by the project is interesting because it shows how other sites might go about setting up a scalable server. in brief the setup is as follows:</p>
<ul>
<li>Ubuntu 32 or 64 bit</li>
<li>Pressflow</li>
<li>APC for opcode cache</li>
<li>CacheRouter and APC/Memcached for No SQL caches</li>
<li>Varnish as reverse proxy</li>
</ul>
<p class='summary'>High Traffic, Varnish required, EC2 required: Easy setup considering.</p>
<dl class='more'>
<dt><a href='http://groups.drupal.org/node/25425'>Step-by-step: Setting up Varnish, Apache, APC and Solr Project Mercury Style</a></dt>
<dd>Step by step instructions for setting up Project Mercury on a Ubuntu server. Very helpful for admins wishing to install manually on their own server.</dd>
</dl>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/97/custom-drupal-distributions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Caching (part 9)</title>
		<link>http://cruncht.com/95/drupal-caching</link>
		<comments>http://cruncht.com/95/drupal-caching#comments</comments>
		<pubDate>Thu, 04 Feb 2010 10:02:08 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=95</guid>
		<description><![CDATA[Several advanced options exist to take caching to the next level in Drupal. With advanced caching Drupal is able to scale to levels required by the most demanding of sites.

We have already discussed (i) page  and block caching and (ii) other caches which come baked into the core of Drupal in the Out of [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/95/drupal-caching" title="Drupal Caching (part 9)"><img src="http://cruncht.com/wp-content/uploads/2010/02/squirrel-150x150.jpg" alt="Red Squirrel" class="feed-image" title="Drupal caching" /></a><p>Several advanced options exist to take caching to the next level in Drupal. With advanced caching Drupal is able to scale to levels required by the most demanding of sites.</p>
<p><span id="more-95"></span></p>
<p>We have already discussed (i) page  and block caching and (ii) other caches which come baked into the core of Drupal in the <a href="/87/drupal-performance-out-of-the-box">Out of the Box</a> section. Whilst big gains can be made through enabling simple config options it is possible to make several more improvements to the way Drupal caches and serves data, making Drupal a system which can scale to serve 1000s of requests a second if the need arises.</p>
<h2><a id="page-caching">Page Caching</a></h2>
<h3><a id="drupal-bootstrap">The Bootstrap</a></h3>
<p>Every time Drupal code needs to be run either through a web page or script, Drupal must undertake the boostrap process which has a certain amount of overhead. Generally this process cannot be avoided &#8211; you need to run code after all. However, there are a couple of cases where it is possible. Firstly, Normal Page caching avoids much of the bootstrap save for hook_boot() and hook_exit() hooks which are used by the statistics and throttle modules. Secondly, Aggressive Page Caching allows for all of the bootstrap to be avoided when serving content.  Finally, contributed modules such as Boost and Varnish allow for the avoidance of PHP and bootstrap since they are operating at a stage before PHP is invoked. This is the main reason Boost and Varnish are so attractive as a page cache.</p>
<h3><a id="boost">Boost</a></h3>
<p><a href="http://drupal.org/project/boost">Boost</a> is a page cache which works in a similar way to aggressive page caching in that it attempts to serve cached data without running the bootstrap. However, it is able to go one step further and avoid PHP being run as the redirect takes place in rewrite rules located in .htaccess. If the static file exists then it is served as the response. The end result is super fast response without running PHP or the bootstrap. This frees up the server to handle more requests from logged in users.</p>
<p>Boost is an attractive module because it is easy to install, has a lot of configuration options which give good control over cache building and invalidation. It is highly recommended.</p>
<p>One aspect of Boost which may be forgotten is that it isn&#8217;t an entirely a file based solution. If the operating system file cache is working well then there is a good chance that the &#8220;file&#8221; will come out of RAM rather than off disk. This makes Boost a competitive option when compared to other more complicated reverse proxy setups such as Squid or Varnish.</p>
<p>Because Boost is file based it is able to potentially cache a lot more data than a RAM based solution. If you have thousands or millions of pages to cache putting them all in RAM is probably not optimal. Best to put them on disk and save the RAM for DB indexes or possibly a CacheRouter cache.</p>
<p class="summary">All sites, especially big, infrequently changing, high traffic: Big gains. Easy setup.</p>
<h3><a id="varnish">Varnish</a></h3>
<p>The <a href="http://drupal.org/project/varnish">Varnish HTTP Accelerator Integration</a> module integrates Drupal with <a href="http://varnish-cache.org/">Varnish</a>, a reverse proxy which sits in front of Apache, PHP and Drupal. Varnish stores cached content in RAM and avoids the overhead of Apache and the Drupal bootstrap. As such it offers very high performance for anonymous users on cached pages and is the preferred option for many sites where scaling is paramount.</p>
<p>Varnish requires either a patch to core to add HTTP headers, PressFlow or Drupal 7. Most people are therefore running Varnish in conjunction with PressFlow.</p>
<p class="summary">High Traffic: Big gains when performance critical.</p>
<h3><a id="squid">Squid</a></h3>
<p>Squid is a reverse proxy similar to Varnish and is in use on drupal.org. It doesn&#8217;t seem to be such a commonly deployed solution probably because PressFlow has been altered to work with Varnish which is higher performing apparently.</p>
<p class="summary">Varnish preferred over Squid.</p>
<h3><a id="varnish-vs-boost">Varnish vs Boost</a></h3>
<p>A popular thread <a href="http://groups.drupal.org/high-performance">High Performance</a> Drupal group is for the <a href="http://groups.drupal.org/node/46042">perfect recipe for page caching</a> &#8211; whether to use Varnish or Boost. Varnish has the edge in speed over Apache+Boost as well as Nginx+Boost. Look at <a href="http://groups.drupal.org/node/45514#comment-119868">results</a> published by <a href="http://groups.drupal.org/user/16022">brianmercer</a>.</p>
<table style="height: 88px;" width="417">
<tbody>
<tr>
<th style="text-align: left;">Setup</th>
<th style="text-align: left;">Approx requests/s</th>
</tr>
<tr>
<td>Boost with Apache-prefork</td>
<td>500</td>
</tr>
<tr>
<td>Boost with Nginx</td>
<td>2000</td>
</tr>
<tr>
<td>Varnish</td>
<td>2400</td>
</tr>
</tbody>
</table>
<p>Varnish may have the edge in speed but it is more complex to install and requires a patched core. Some people may not want to run a non standard Drupal installation. Varnish also requires RAM to store the cached material &#8211; something which may be better spent elsewhere (database or max clients). Boost offers similar performance, is easy to install and has good control of cache invalidation and warmup. Boost is also able to serve files from RAM if the OS has cached them.</p>
<p>It is certainly up to you to decide which avenue to take. This guide is attracted to the relative simplicity of Drupal + Nginx + Boost over Pressflow + Varnish + Apache/Nginx. Nginx brings better performance to the web server as a whole, ie. pages not in the cache, and RAM savings if it is needed elsewhere. It must be stressed that each site has a different profile and the ultimate decision is up to you.</p>
<dl class="more">
<dt><a href="http://groups.drupal.org/node/26485">Has anyone tried nginx caching with Drupal?</a></dt>
<dd>Drupal Groups thread with lots of benchmarking. Final conclusion seems to be that Varnish vs Boost+Nginx is a pretty close thing.</dd>
<dt><a href="http://www.chapterthree.com/blog/josh_koenig/project_mercury_preconfigured_drupalvarnish_ec2_ami">Project Mercury: A pre-configured Drupal+Varnish EC2 AMI</a></dt>
<dd>Josh Konig of Chapter Three claims that Varnish is faster than Boost but gives no numbers.</dd>
<dt><a href="http://www.metaltoad.com/blog/quick-drupal-cacherouter-and-boost-benchmarks">Quick Drupal Cacherouter and Boost benchmarks</a></dt>
<dd>Dylan Tack likes Boost: &#8220;Response times are all close enough that it doesn&#8217;t really matter what caching backend you choose&#8230; The only factor that&#8217;s really relevant is how good your system&#8217;s cache expiration and regeneration logic is&#8230; it seems like Boost is the clear winner here as well.&#8221;</dd>
<dt><a href="http://groups.drupal.org/node/46042">What&#8217; recipe should i choose for best performance?</a></dt>
<dd>Discussion with participants split between Varnish and Boost depending on circumstances. Nginx+Boost seems pretty equal with Varnish.</dd>
<dt><a href="http://groups.drupal.org/node/21897">Caching: Modules that make Drupal scale</a></dt>
<dd>Table of modules, performance gains and features.</dd>
</dl>
<h2><a id="cache-router">Cache Router</a></h2>
<p><a href="http://drupal.org/project/cacherouter">Cache Router</a> is a module which enables you to cache the Drupal cache tables (including views, blocks, menus, variables, and filters) in RAM. Drupal no longer has to hit the database to pull out this content &#8211; a big win for logged in users who might not be able to enjoy the advantages of a page cache. Cache Router therefore fills an important niche in your caching strategy.</p>
<p>Cache Router is able to do this via a number of backends including APC or Memcache. APC should be considered if you are running Drupal on a single node and only need the single local store. <a href="http://www.chapterthree.com/blog/josh_koenig/project_mercury_preconfigured_drupalvarnish_ec2_ami">According to Josh Konig</a>, APC &#8220;is less error-prone, more secure, and allegedly as fast (if not faster) than running memcached according to the folks at Facebook.&#8221; Memcache can be used to share cached data between or across multiple servers.</p>
<p class="summary">High Traffic, Logged in: Massive gains when page cache not hit.</p>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/95/drupal-caching/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Troubleshooting (part 8)</title>
		<link>http://cruncht.com/93/drupal-troubleshooting</link>
		<comments>http://cruncht.com/93/drupal-troubleshooting#comments</comments>
		<pubDate>Wed, 03 Feb 2010 10:01:41 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=93</guid>
		<description><![CDATA[Sometimes your installation will be slow for no apparent reason. Time to grab your toolkit and get started on finding and eliminating the problem. Here are some of the common problems faced by developers.

Database Queries
The database is generally the first place you look when trying to identify problems on a page. It is possible to [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/93/drupal-troubleshooting" title="Drupal Troubleshooting (part 8)"><img src="http://cruncht.com/wp-content/uploads/2010/02/flat_tire-150x150.jpg" alt="Flat Tire" class="feed-image" title="Debugging Drupal" /></a><p>Sometimes your installation will be slow for no apparent reason. Time to grab your toolkit and get started on finding and eliminating the problem. Here are some of the common problems faced by developers.</p>
<p><span id="more-93"></span></p>
<h2><a id='database-queries'>Database Queries</a></h2>
<p>The database is generally the first place you look when trying to identify problems on a page. It is possible to identify problems through MySQl&#8217;s slow query log or through the query log of the Devel module.</p>
<dl class='more'>
<dt><a href='http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716/ref=sr_1_1?ie=UTF8&#038;s=books&#038;qid=1264569847&#038;sr=1-1'>High Performance MySQL: Optimization, Backups, Replication, and More</a></dt>
<dd>Must have book for anyone serious about getting the most from MySQL.</dd>
</dl>
<h3><a id='database-indexes'>Views and indexes</a></h3>
<p>The Views module makes building queries easy and sometimes a crucial part of the query will rely on a CCK column with no index. This tends to happen with sorting or filtering. Queries can run slow in these cases. The solution is to place an index on the column in question.</p>
<p class='summary'>Large Sites: CCK need index</p>
<h3><a id='database-joinst'>Views and Left Joins</a></h3>
<p>The Views module will often design queries with LEFT JOIN rather than INNER JOIN, especially when joining from the node table to a content type CCK table. In many cases you might only want an inner join, especially when the node table is very big. In these cases it is possible to rewrite the query by hacking the query in hook_views_pre_execute.</p>
<p class='summary'>Large Sites: Some View SQL inefficient</p>
<dl class='more'>
<dt><a href='http://drupal.org/node/372994'>Ability to INNER JOIN to node for a specific field</a></dt>
<dd>Discussion of Views SQL for joining between node and content type tables.</dd>
</dl>
<h3><a id='database-schema'>Inefficient schema</a></h3>
<p>The database design of some modules could possibly be improved. It is up to you the developer/administrator to ensure that you are happy with the internal design of contributed modules. If you find an inefficiency then submit an issue to the module, provide a patch or remove the module from your site.</p>
<h3><a id='database-composite-index'>Composite indexes</a></h3>
<p>MySQL is limited to using one index per table when sorting and filtering. This can make it tricky when you wish to use multiple AND clauses or involve a sort and a filter at the same time. In these cases adding a composite index can get you out of trouble.</p>
<h3><a id='problems-core'>Problems in Core</a></h3>
<p>In a number of cases there are problems in core of Drupal where queries are very slow on very big Drupal installations with millions of nodes. The author has experienced problems with:</p>
<ul>
<li>Inability to browse content in Admin section  due to join from member to user table</li>
<li>Inability to edit nodes with many CCK fields. Massive RAM usage when loading nodes.</li>
<li>Taxonomy pages take a very long time to display</li>
<li>Search system unable to index content</li>
</ul>
<p>These problems can be avoided partly by swapping Solr in for search and having it override the taxonomy pages. The other problems you just have to live with <img src='http://cruncht.com/wp-includes/images/smilies/icon_neutral.gif' alt=':-|' class='wp-smiley' />
<dl class='more'>
<dt><a href='http://wtanaka.com/drupal/million-nodes-6'>Drupal with millions of nodes</a></dt>
<dd>Some good research from Wesley Tanaka into problems with many nodes. Attitude from some here seems a little dismissive of the issues.</dd>
</dl>
<h2><a id='slow-modules'>Slow Modules</a></h2>
<p>Modules can also be badly written leading to poor performance in certain circumstances. Generally this happens when the module&#8217;s creator did not test the module against (i) large installations with many nodes (ii) complex installations with heavy nodes or detailed taxonomies. You generally will only run into these problems if your site is large. In some cases you can fix the module by creating more indexes in the DB. In others you just have to remove the module from your installation&#8230;. or submit a patch.</p>
<p>Some known offenders for large sites include:</p>
<ul>
<li>XML Sitemap</li>
<li>Node access</li>
<li>Taxonomy Browser</li>
<li>Fivestar</li>
</ul>
<dl class='more'>
<dt><a href='http://2bits.com/articles/how-drupals-nodeaccess-table-can-negatively-impact-site-performance.html'>How Drupal&#8217;s node_access table can negatively impact site performance</a></dt>
<dd></dd>
<dt><a href='http://2bits.com/articles/scalability-taxonomy-browser-module-restricting-number-terms.html'>Scalability of the Taxonomy Browser module: Restricting number of terms</a></dt>
<dd>&#8220;Query from hell&#8221; with many joins leading to queries which never finish.</dd>
<dt><a href='http://2bits.com/articles/xml-sitemap-6x-2x-how-drupal-modules-can-overload-site-during-cron-solutions.html'>XML Sitemap 6.x-2.x: How Drupal modules can overload a site during cron, with solutions</a></dt>
<dd>XML Sitemap module needs to be configured correctly</dd>
</dl>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/93/drupal-troubleshooting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Implementation Decisions (part 7)</title>
		<link>http://cruncht.com/91/drupal-implementation-decisions</link>
		<comments>http://cruncht.com/91/drupal-implementation-decisions#comments</comments>
		<pubDate>Tue, 02 Feb 2010 10:01:08 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=91</guid>
		<description><![CDATA[Bad performance can be down to decisions you have made during development because you haven&#8217;t been aware of natural limitations of the system or the way Drupal works. The more you poke about the more you will understand. Here&#8217;s a grab bag of things which may bite you.

Search
The search which comes inbuilt with Drupal has [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/91/drupal-implementation-decisions" title="Drupal Implementation Decisions (part 7)"><img src="http://cruncht.com/wp-content/uploads/2010/02/bail-150x150.jpg" alt="Bail" class="feed-image" title="Drupal implementation mistakes" /></a><p>Bad performance can be down to decisions you have made during development because you haven&#8217;t been aware of natural limitations of the system or the way Drupal works. The more you poke about the more you will understand. Here&#8217;s a grab bag of things which may bite you.</p>
<p><span id="more-91"></span></p>
<h2><a id='search'>Search</a></h2>
<p>The search which comes inbuilt with Drupal has long been regraded as unsatisfactory for larger sites:</p>
<ul>
<li>failure to index large sites efficiently</li>
<li>slow at returning results</li>
<li>relatively limited feature set compared to dedicated search solutions</li>
</ul>
<p>Part of the problem lies with the fact that standard relational databases such as MySQL and interpreted languages such as PHP are not well suited to handle the large indexes and filtering required for big datasets. Search can also be an intensive process if the corpus is large and the traffic high. Being able to move search off the main box will give more resources to Drupal to do other things. Both of the following solutions can help solve these issues.</p>
<h3 id='solr'>Solr</h3>
<p>Enter the Apache Lucene project and Apache Solr.</p>
<dl class='more'>
<dt><a href='http://lucene.apache.org/'>Welcome to Lucene!</a></dt>
<dd>Lucene &#8220;provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.&#8221;</dd>
<dt><a href='http://lucene.apache.org/solr/'>Welcome to Solr</a></dt>
<dd>&#8220;Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world&#8217;s largest internet sites&#8221;</dd>
</dl>
<p>The <a href='http://drupal.org/project/apachesolr'>Apache Solr Search Integration</a> module integrates Solr with Drupal. Installation requires a setup of a JVM and a container such as Tomcat or Jetty &#8211; something which is a little outside the usual LAMP configuration. Installation is not so difficult with some good documentation available. If this is beyond you then it may be worth checking out the <a href='http://acquia.com/products-services/acquia-search'>Acquia Search</a> hosted solution.
<p>The beauty of Solr is that it is able to index massive datasets, millions if not 100s of millions of nodes, and return results surprisingly quickly. The faceted search (taxonomy, language, CCK, Author, Content Type) is its main selling point which will be sure to impress users. It really is a product which will take your site to the next level if search is important to you. More importantly Solr solves crippling scaling problems with core search as well as viewing taxonomy terms with many nodes.</p>
<p>Check out the following search for <a href="http://uriverse.com/search/apachesolr_search/madonna?filters=language%3Aen%20type%3Adwrk">&#8220;Madonna&#8221; filtered by &#8220;Artistic Works&#8221; type and &#8220;English&#8221; language</a>. Alternatively, here is <a href="http://uriverse.com/search/apachesolr_search/madonna?filters=type%3Adpsn%20language%3Ait">Madonna filtered by &#8220;Person&#8221; type and &#8220;Italian&#8221; language</a>. <a href="http://www.darkbrownbuckets.com/media/madonna.jpg">Italians do it better</a>.</p>
<p>Solr is a natural fit for installation on another server because it runs as a web service over HTTP. This is a very easy way to take load of your main server, especially if search is a big part of your site. The disk requirements of the index and Java memory requirements for Solr can be significant on big sites so moving it off the main server may well be a necessity. My experience on <a href="http://uriverse.com">Uriverse</a> (a large amount of smallish nodes) would suggest that each node takes around 200 bytes of RAM in the JVM. Your requirements could vary dramatically from this but this may be helpful as a rough guide for those wanting to know about RAM consumption.</p>
<p class='summary'>Large sites: Core search fails. Solr shines.</p>
<dl class='more'>
<dt><a href='http://wiki.apache.org/solr/SolrPerformanceFactors'>SolrPerformanceFactors</a></dt>
<dd>Solr documentation including optimization.</dd>
<dt><a href='http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr'>Scaling Lucene and Solr</a></dt>
<dd>Some very good notes on scaling Solr. File system cache is important as well as JVM and need 2+GB on top of JVM for big index.</dd>
<dt><a href='http://stackoverflow.com/questions/1546898/how-to-reduce-solr-memory-usage'>how to reduce solr memory usage?</a></dt>
<dd>Smaller documents, less facets, no sorting.</dd>
<dt><a href='http://old.nabble.com/How-much-disk-space-does-optimize-really-take-to25790344.html#a25792748'>How much disk space does optimize really take</a></dt>
<dd>Be careful of optimize &#8211; required disk space can double.</dd>
</dl>
<h3 id='google-custom-search'>Google Custom Search</h3>
<p>Another option for those users looking to switch away from the in-built search is <a href="http://www.google.com/cse">Google Custom Search</a> &#8211; a search service offered by Google where Google indexes your data and stores the results on their servers. You are able to search the data via a simple form on your site. Google Custom Search is free for individuals, who don&#8217;t mind ads with their results, with a business version starting at $100 pa. The <a href="http://drupal.org/project/google_cse">Google Custom Search Module</a> integrates it into Drupal by providing a block with the form.</p>
<h2><a id='module-bloat'>Module bloat</a></h2>
<p>Every module you add to a Drupal site leads to the consumption of more RAM reducing the number of simultaneous clients you can server. Modules also consume extra CPU. As a site designer you need to be aware that all modules added will have some cost on the performance of your site. Is the extra functionality worth the performance cost?</p>
<dl class='more'>
<dt><a href='http://2bits.com/articles/server-indigestion-the-drupal-contributed-modules-open-buffet-binge-syndrome.html'>Server indigestion: The Drupal contributed modules &#8220;open buffet binge&#8221; syndrome</a></dt>
<dd>Minimal install of Drupal has Apache process around 17MB-33M. A bloated system has 93M bloating to 100MB after a few requests. This means fewer requests can be handled due to RAM consumption.</dd>
</dl>
<p class='summary'>Feature rich sites: RAM wasted. Low max clients.</p>
<h2><a id='noad-load'>Node load</a></h2>
<p>The loading of a node in Drupal can be a very quick/light or very slow/heavy process depending on the circumstances. It all comes down to (i) how much data is in the node and (ii) how fast that data comes back out of the database. If you have nodes with many CCK fields sitting in a database which hasn&#8217;t been able to cache all of the necessary indexes, a node load is something which you really want to avoid. Instead of lazily loading data when required a node load loads it all in at once! In extreme cases a node could take seconds to load.</p>
<p>This become a problem when you want to handle many nodes at once, such as when you want to display node teasers in a View for example. If you notice your page slowing down when displaying 50 teasers on a view then it is a good bet that the DB is getting hammered trying to load in all that data just to display the title, description and url_alias! In these cases skip the teaser and just do it with fields. You should notice a big improvement.</p>
<p>Similarly, performing massive (millions of nodes) imports for new/updated nodes can be prohibitive as well. In these rare cases you need to skip the API and execute SQL on the DB directly.</p>
<p class='summary'>Many Nodes, Big Nodes: Speed and RAM suffers</p>
<h2><a id='cck-design'>CCK design</a></h2>
<p>It&#8217;s important to know how to work within the confines of a system to get the most out of it so it is well worth investigating how data is stored in the backend for nodes and CCK fields. Smart, sensible design of CCK fields should ensure that database access is kept to a sensible level. Whilst this guide recommends designing a data model first and then worrying about data access times second, it is worth knowing the consequences of your decisions.
<p>In a nutshell, CCK will create another table in the backend for multi fields and fields which are shared between content types. There&#8217;s no avoiding the first reason but in the case of shared fields, a separate DB query will need to be run for each shared property. This can be an issue on very large sites on nodes with a lot of properties when you want to keep DB queries to a minimum.</p>
<p>A couple of rules of thumb would be:</p>
<ul>
<li>Try to keep the properties of a content type encapsulated to that particular content type. eg. you might be tempted to share book.author and film.screenwriter in a single CCK field. Unless you are going to be doing queries across both then it makes sense to store the properties separately with their content types.</li>
<li>If you are going to share properties between content types, then the more sharing that can be done the better. eg. a &#8216;geo-location&#8217;, &#8216;country&#8217;, &#8216;intended-audience&#8217; are all possible candidates for stretching across several content types. This design is views friendly, aiding efficient queries across multiple content types.</li>
<li>It is probably better to lean towards designing content types which use taxonomy types and have some null properties instead of creating too many &#8217;sub-class&#8217; content types to do the job. eg. lets say you have two possible types, &#8216;it-employee&#8217; and &#8216;accounts-employee&#8217;, which share some properties (birthday, address) but not others (preferred-os). It probably would be better to add an employee-type taxonomy to do the sub-classing and have the unshared preferred-os property as optional. This ensures easy filtering using taxonomy and allows for fast employee retrieval from a single table.</li>
</ul>
<p class='summary'>Complex Data Model: Inefficient data</p>
<h2><a id='module-develpment'>Module development</a></h2>
<p>When designing your own modules you should have an eye to efficient design and use caching the Drupal way. Using the Drupal cache mechanism means that the custom caches are available to Cache Router to store in RAM (if appropriate).</p>
<p>It is also possible to use the &#8220;static&#8221; variable keyword to cache the variable for that particular script.</p>
<dl class='more'>
<dt><a href='http://www.lullabot.com/articles/a_beginners_guide_to_caching_data'>A beginner&#8217;s guide to caching data</a></dt>
<dd>Steps module developers can take to cache data the PHP/Drupal way.</dd>
</dl>
<p class='summary'>Custom Modules: cache data the Drupal way</p>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/91/drupal-implementation-decisions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal LAMP Server Tuning (part 6)</title>
		<link>http://cruncht.com/89/drupal-lamp-server-tuning</link>
		<comments>http://cruncht.com/89/drupal-lamp-server-tuning#comments</comments>
		<pubDate>Mon, 01 Feb 2010 09:59:41 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=89</guid>
		<description><![CDATA[Getting the most from your Drupal site means getting the most from your server &#8211; optimizing the various layers of the the LAMP stack. This includes the filesystem, database, web server, PHP, RAM and CPU. Tuning the LAMP stack is a major subject requiring a lot of study and practice to become proficient. It&#8217;s something [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/89/drupal-lamp-server-tuning" title="Drupal LAMP Server Tuning (part 6)"><img src="http://cruncht.com/wp-content/uploads/2010/02/tune-150x150.jpg" alt="Tune" class="feed-image" title="Tune your Drupal LAMP stack" /></a><p>Getting the most from your Drupal site means getting the most from your server &#8211; optimizing the various layers of the the LAMP stack. This includes the filesystem, database, web server, PHP, RAM and CPU. Tuning the LAMP stack is a major subject requiring a lot of study and practice to become proficient. It&#8217;s something you will probably never completely master <img src='http://cruncht.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Try Googling <a href='http://www.google.com.au/search?q=lamp+performance+tune'>lamp performance tune</a> for a few articles to whet your appetite. For now, we&#8217;ll cover a few of the major considerations for Drupal, although most of this advice would apply to any PHP web app running on Linux.</p>
<p><span id="more-89"></span></p>
<dl class='more'>
<dt><a href='http://http://drupal.org/node/2601'>Server tuning considerations</a></dt>
<dd>Drupal documentation covering the basics.</dd>
<dt><a href='http://www.ibm.com/developerworks/linux/library/l-tune-lamp-1.html'>Tuning LAMP systems, Part 1: Understanding the LAMP architecture</a></dt>
<dd>Intermediate article covering LAMP.</dd>
<dt><a href='http://www.ibm.com/developerworks/linux/library/l-tune-lamp-2.html'>Tuning LAMP systems, Part 2: Optimizing Apache and PHP</a></dt>
<dd>Intermediate article covering Apache and PHP.</dd>
<dt><a href='http://www.ibm.com/developerworks/linux/library/l-tune-lamp-3.html'>Tuning LAMP systems, Part 3: Tuning your MySQL server</a></dt>
<dd>Intermediate article covering MySQL.</dd>
</dl>
<h2><a id='php-opcode-cache'>Opcode cache</a></h2>
<p>Opcode caches cache the compiled form of a PHP script in shared memory to avoid the overhead of parsing and compiling the code every time the script runs. This saves RAM and reduces script execution time.</p>
<p>Quite a bit of benchmarking has been done in the Drupal and PHP communities between <a href="http://php.net/manual/en/book.apc.php">APC</a>, <a href="http://eaccelerator.net/">eAccelerator</a> and <a href="http://xcache.lighttpd.net/">XCache</a>. eAccelerator may have the edge in raw performance, but it appears that APC is the preferred opcode cache in the Drupal community because it is well maintained and less buggy.</p>
<p class='summary'>All sites: faster and less RAM. Moderate install.</p>
<dl class='more'>
<dt><a href='http://buytaert.net/drupal-webserver-configurations-compared'>Drupal web server configurations compared</a></dt>
<dd>APC gives 2x to 4x increase in throughput under load. PHP5 is around 10% slower.</dd>
<dt><a href='http://2bits.com/articles/php-op-code-caches-accelerators-a-must-for-a-large-site.html'>PHP op-code caches / accelerators: Drupal large site case study</a></dt>
<dd>Op-code caches are a must for large sites serving many pages.</dd>
<dt><a href='http://2bits.com/articles/benchmarking-apc-vs-eaccelerator-using-drupal.html'>Benchmarking APC vs. eAccelerator using Drupal</a></dt>
<dd>eAccelerator is faster and smaller than APC. Both offer around 6x &#8211; 7x times speedup over PHP.</dd>
<dt><a href='http://2bits.com/articles/high-php-execution-times-drupal-and-tuning-apc-includeonce-performance.html'>High PHP execution times for Drupal, and tuning APC for include_once() performance</a></dt>
<dd>Make sure apc.shm_size can fit the whole page else there will be no caching.</dd>
</dl>
<h2><a id='database'>Database</a></h2>
<p>There are a number of choices to be made when tuning your MySQL database server. The MySQLTuner script can be helpful for identifying outstanding issues you may be unaware of. It can be run on a functioning production server to see how your database is performing in the wild.</p>
<dl class='more'>
<dt><a href='http://blog.mysqltuner.com/'>MySQLTuner</a></dt>
<dd> Perl script which is able to report on the operation of your MySQL installation and offer suggestions as to what can be fixed.</dd>
<dt><a href='http://www.howtoforge.com/tuning-mysql-performance-with-mysqltuner'>Tuning MySQL Performance with MySQLTuner</a></dt>
<dd>Helpful tutorial.</dd>
</dl>
<h3><a id='myisam-innodb'>Storage Engine: InnoDB vs MyISAM</a></h3>
<p>A default install of Drupal 6 installs the DB tables as MyISAM. This will change in Drupal 7 with the default set to InnoDB. A Drupal 6 installation may well have some InnoDB tables as modules may create new tables in the InnoBD engine. Your installation may therefore be a mix between the two engines.</p>
<p>In many places on the web you will read statements such as &#8216;All high performance Drupal sites run InnoDB&#8221;. This is not necessarily so as there are some cases where MyISAM may still be preferred although with recent changes to Drupal core the pendulum has swung to InnoDB as a sensible default.</p>
<p>A list of the main difference between the engines is as follows:/p></p>
<ul>
<li>InnoDB is transactional (better integrity), MyISAM isn&#8217;t</li>
<li>InnoDB more reliable (better recovery), MyISAM can be repaired</li>
<li>InnoDB has row level locking (better concurrency), MyISAM locks tables</li>
<li>InnoDB uses clustered indexes (faster access to data), MyISAM indexes just the keys</li>
<li>InnoDB has a bigger memory footprint</li>
</ul>
<p>In general, you would consider sticking with MyISAM if</p>
<ul>
<li>Memory footprint was an issue. If you have very big indexes which might only just fit into the key buffer then MyISAM could offer faster lookups.</li>
<li>Most activity is read only.</li>
</ul>
<p>InnoDB tables definitely should be used for all of the Drupal cache tables since this is where most contention is likely to occur.</p>
<p>Finally, it must be noted that Drupal was written based on the MyISAM engine and as such many queries were not optimized for InnoDB. The SELECT COUNT(*) is particularly slow in InnoDB because it must scan all rows to calculate the count. Many of these shortcomings have been removed in the PressFlow distribution and have since made their way back into core.</p>
<p class='summary'>All sites: InnoDB for less contention on cache<br />Most sites: InnoDB for everything else<br />Big unchanging sites: MyISAM faster reads less RAM</p>
<dl class='more'>
<dt><a href='http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB'>MySQL Engines: MyISAM vs. InnoDB</a></dt>
<dd>InnoDB is a good fit for many cases and &#8220;in most cases, InnoDB is the correct choice for a Drupal site&#8221;. Very good comparison between the two engines.</dd>
<dt><a href='http://2bits.com/articles/mysql-innodb-performance-gains-as-well-as-some-pitfalls.html'>MySQL InnoDB: performance gains as well as some pitfalls</a></dt>
<dd>InnoDB does row level locking but lookup is slower for some slow queries. NB. Pressflow distribution fixes some slow InnoDB queries.</dd>
<dt><a href='http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/'>InnoDB vs MyISAM vs Falcon benchmarks – part 1</a></dt>
<dd>Myth that MyISAM is faster than InnoDB in all cases.</dd>
<dt><a href='http://groups.drupal.org/node/35188'>Which Tables can be converted to InnoDB</a></dt>
<dd>High Performance discussion emphasizing that InnoDB should definitely be used for cache tables and complex joins in CCK if memory allows.</dd>
</dl>
<h3 id='mysql-configuration'>MySQL Configuration</h3>
<p>There are a number of MySQL config variables which must be tweaked to suit your data. It is impossible to specify one set of options to suit all sites. A few rules of thumb are offered below.</p>
<dl class='more'>
<dt><a href='http://www.databasejournal.com/features/mysql/article.php/3367871/Optimizing-the-mysqld-variables.htm'>Optimizing the mysqld variables</a></dt>
<dd>Clear article with some good rules of thumb for MySQL variables.</dd>
</dl>
<h4><a id='mysql-key-buffer'>Key buffer</a></h4>
<p>If you are running MyISAM  tables then the key buffer is a very important variable to set. The key buffer stores table indexes in memory, allowing for fast lookups and joins. For large node, node_version and url_alias tables it is a must to have enough room to fit these tables into memory, otherwise your site will very slow on the most basic of operations: looking up nodes, titles and paths.</p>
<p>One rule of thumb is to set this buffer to somewhere between 25% and 50% of the memory on the server. To determine the best value up front sum the size of all the .MYI files.</p>
<p class='summary'>MyISAM sites: most queries faster. Essential.</p>
<dl class='more'>
<dt><a href='http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_key_buffer_size'>key_buffer_size</a></dt>
<dd>Documentation on the use of key_buffer_size.</dd>
</dl>
<h4><a id='mysql-query-cache'>Query cache</a></h4>
<p>MySQL has a query cache which stores results up to a certain size in memory. The cache is very handy for quickly returning commonly accessed data when all other forms of caching (reverse proxies, page cache, Drupal caches) have not been invoked. Queries which may take sometime return almost instantly.</p>
<dl class='more'>
<dt><a href='http://www.databasejournal.com/features/mysql/article.php/3110171/MySQLs-Query-Cache.htm'>MySQL&#8217;s Query Cache</a></dt>
<dd>Covers config and operation of the query cache.</dd>
</dl>
<p>During the development and testing of a site the query cache can catch developers out since a query may appear to be performing quite well the second and subsequent times through. To really test a query you need to fire up mysql client (or phpmyadmin) and add the SQL_NO_CACHE option to the query to see the real time it takes. Don&#8217;t be fooled!</p>
<dl class='more'>
<dt><a href='http://dev.mysql.com/doc/refman/5.1/en/query-cache-in-select.html'>Query Cache SELECT Options</a></dt>
<dd>Documentation on the use of SQL_NO_CACHE.</dd>
</dl>
<p>The query cache is destroyed if any row in the table is changed and so it cannot be relied upon if tables are changing frequently. The cache shines when the are big tables which don&#8217;t change that often. Unless your site has such characteristics it is best to limit it so that it fits small unchanging tables and then some for the most popular queries. Examination of cache hit rates will show you if it needs to be extended or reduced.</p>
<dl class='more'>
<dt><a href='http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_query_cache_size'>query_cache_size</a></dt>
<dd>Documentation on the use of query_cache_size.</dd>
</dl>
<p class='summary'>All sites: common queries faster</p>
<h4><a id='innodb-buffer-pool'>InnoDB Buffer Pool Size</a></h4>
<p>If you are running InnoDB tables then it is essential to optimize the InnoDB Buffer Pool Size, increasing the memory to reduce query time. InnoDB is more memory intensive and so the pool will be larger than that used for MyISAM tables. MySQL documentation suggests that the size can be upped to 80% of physical memory. Anymore could lead to swap issues.</p>
<p class='summary'>InnoDBsites: most queries faster. Essential.</p>
<dl class='more'>
<dt><a href='http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size'>innodb_buffer_pool_size</a></dt>
<dd>Documentation on the use of innodb_buffer_pool_size.</dd>
</dl>
<h4>Other variables</h4>
<p>Other variables worth tweaking include the following. See <a href='http://www.databasejournal.com/features/mysql/article.php/3367871/Optimizing-the-mysqld-variables.htm'>Optimizing the mysqld variables</a> for more.</p>
<ul>
<li>table cache</li>
<li>sort buffer</li>
<li>read_rnd_buffer_size</li>
<li>tmp_table_size</li>
</ul>
<h3><a id='database-warmup'>Database Warmup</a></h3>
<p>A warm database will perform much better than a recently started one because its caches and buffers will be primed with keys and data. It therefore makes sense to warm up a DB every time the database is restarted. The best way to do this is to load in the indexes of commonly used tables. This guide recommends loading in node, node_revisions and url_alias. Taxonomy information could be good candidates as well.</p>
<p><code><br />
        USE drupal6;<br />
        LOAD INDEX INTO CACHE node;<br />
        LOAD INDEX INTO CACHE node_revisions;<br />
        LOAD INDEX INTO CACHE url_alias;<br />
        LOAD INDEX INTO CACHE term_data;<br />
        LOAD INDEX INTO CACHE term_node;<br />
        </code></p>
<p>This SQL code can then be put in a script and run when MySQL restarts. It is possible to configure the <code>init_file</code> variable in my.cnf to tell mysql where to find the startup SQL.</p>
</p>
<p><code>init-file = /etc/mysql/init-file.sql</code></p>
<p class='summary'>Many nodes: Most queries where index relied upon.</p>
<dl class='more'>
<dt><a href='http://dev.mysql.com/doc/refman/5.0/en/index-preloading.html'>Index Preloading</a></dt>
<dd>How to use <code>LOAD INDEX INTO CACHE t1</code>.</dd>
<dt><a href='http://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_init_file'>init_file</a></dt>
<dd>How to use <code>init_file</code> variable to specify startup SQL.</dd>
</dl>
<h2><a id='webserver'>Web server</a></h2>
<p>Apache + MPM Prefork + mod_php is the default web server configuration in the LAMP stack. This combination does consume large amounts of RAM which can be a problem for handling many requests. It can also be quite heavy and slow for serving static content. Many administrators have looked to replace it with other combinations including multithreaded processes (MPM Worker) and external PHP (mod_fcgid) as well as swapping it out completely for another server such as Nginx. This guide has adopted the position that Apache problems can be ameliorated somewhat by removing unneeded modules, running fcgid to connect with PHP and using MPM Worker to enable multithreading per process. However, in some cases this won&#8217;t be enough and Nginx is a must.</p>
<h3>Apache vs Nginx</h3>
<p>Other Drupal users have replaced Apache with faster more lightweight  (RAM and CPU) web servers such as <a href='http://nginx.org/en/'>Nginx</a> and <a href='http://www.lighttpd.net/'>Lighttpd</a>. Nginx is generally preferred over Lighttpd because of memory leaks in the latter. It is currently possible to run Nginx without losing any functionality in Drupal. Boost, a module based on .htaccess rules, now supports Nginx so it is feasible to run Nginx as the main web server. If you are constrained by CPU or have high loads then this certainly is an option worth considering.</p>
<p>Setting up Nginx is not trivial but it is reasonably straight forward if you are comfortable with compiling and patching. There are some good tutorials on the Web for user who want to do this. </p>
<p class='summary'>Low resources, High Traffic, Many logged in: Possible to get more for less with Nginx.</p>
<dl class='more'>
<dt><a href='http://www.joeandmotorboat.com/2008/02/28/apache-vs-nginx-web-server-performance-deathmatch/'>Apache vs Nginx : Web Server Performance Deathmatch</a></dt>
<dd>&#8220;Nginx seems to compete pretty well with Apache and there doesn’t seem like there is a good reason not to use it especially in CPU usage constrained situations (ie. huge traffic, slow machines and etc).&#8221;</dd>
<dt><a href='http://groups.drupal.org/node/20813'>In reply to kbahey: apache vs nginx </a></dt>
<dd>Discussion and results over the pros and cons of Nginx vs various Apache setups.</dd>
<dt><a href='http://www.allthepages.org/archives/2009/02/how-get-drupal-working-nginx'>How to get Drupal working with Nginx</a></dt>
<dd>Simple guide for installing and configuring Nginx on a server with only 256MB RAM. Uses FastCGI which may not be preferred method.</dd>
<dt><a href='http://interfacelab.com/nginx-php-fpm-apc-awesome/'>NGINX + PHP-FPM + APC = Awesome</a></dt>
<dd>&#8220;The following guide will walk you through setting up possibly the fastest way to serve PHP known to man&#8230;In this article, we’ll be installing nginx http server, PHP with the PHP-FPM patches, as well as APC.&#8221;</dd>
<dt><a href='http://php-fpm.org/'>PHP-FPM &#8211; A simple and robust FastCGI Process Manager for PHP</a></dt>
<dd>Preferred way of connecting Nginx with PHP. Currently in PHP core for 5.3.2+ but not yet released. Requires patch to PHP 5.2.</dd>
</dl>
<h3><a id='apache-unneeded-modules'>Apache: Unneeded modules</a></h3>
<p>It is possible to turn off unneeded modules in Apache to reduce memory footprint. This depends very much on your setup.</p>
<p class='summary'>All sites: Good savings in RAM</p>
<dl class='more'>
<dt><a href='http://groups.drupal.org/node/41320'>What Apache2 modules can be disabled?</a></dt>
<dd>Lists of modules which should be enabled in Apache2.</dd>
</dl>
<h3><a id='apache-threading'>Apache threading: MPM Worker (multi threaded) MPM Prefork</a></h3>
<p>The use of <a href='http://httpd.apache.org/docs/2.0/mod/worker.html'>MPM Worker</a> allows for the handling of more requests due to multithreading in each process. It has a smaller memory footprint than Prefork and is faster. According to docs, Apache must be compiled with the <code>--with-mpm</code> argument in order to install Worker as &#8220;prefork&#8221; is the default on Unix systems.</p>
<p class='summary'>RAM limited: Worker preferable to Prefork.</p>
<dl class='more'>
<dt><a href='http://httpd.apache.org/docs/2.0/misc/perf-tuning.html#compiletime'>Compile-Time Configuration Issues</a></dt>
<dd>&#8220;Choosing an MPM&#8221; section covers differences between the two models.</dd>
<dt><a href='http://httpd.apache.org/docs/2.0/mpm.html'>Multi-Processing Modules (MPMs)</a></dt>
<dd>Apache documentation on installation.</dd>
<dt><a href='http://ivan.gudangbaca.com/installing_apache2_and_php5_using_mod_fcgid'>Installing Apache2 and PHP5 using mod_fcgid</a></dt>
<dd>Hey, you don&#8217;t have to recompile Apache. Tutorial on how to install MPM Worker using apt-get with Apache2 on Ubuntu. Just the ticket.</dd>
<dt><a href='http://www.complich8.net/archives/404'>mpm-worker versus mpm-prefork, and mod_php versus fastcgi</a></dt>
<dd>PreFork and FastCGI is still a win if you find that Worker is unstable due to long downloads as this person did.</dd>
</dl>
<h3><a id='webserver-fastcgi'>Connectors: mod_php, FastCGI, mod_fcgid</a></h3>
<p>The use of mod_php with Apache is the most common setup for calling PHP. mod_php works by embedding PHP into every Apache process. This has the disadvantage of a large memory footprint for each Apache process. FastCGI and mod_fcgid overcomes this problem and reduces resource utilization with no gains in performance.
<p><a href='http://groups.drupal.org/user/327'>kbahey</a> <a href='http://groups.drupal.org/node/27174#comment-94376'>lists the disadvantages of mod_php</a> as follows:</p>
<ul>
<li>All PHP loaded into the process</li>
<li>Heavy process even if flat file</li>
<li>Many processes will hog RAM</li>
</ul>
<p class='summary'>Use mod_fcgid for lower memory and DB/Network connections</p>
<dl class='more'>
<dt><a href='http://buytaert.net/drupal-webserver-configurations-compared'>Drupal webserver configurations compared</a></dt>
<dd>The most common, Apache+mod_php is the slowest. Tests conducted with FastCGI which is faster. NB: FastCGI has subsequently suffered from stability issues.</dd>
<dt><a href='http://2bits.com/articles/apache-fcgid-acceptable-performance-and-better-resource-utilization.html'>Apache with fcgid: acceptable performance and better resource utilization</a></dt>
<dd>Informative article which comes out in favor of mod_fcgid over FastCGI and mod_php. This is the must read article if you wish to attempt fcgid.</dd>
<dt><a href='http://groups.drupal.org/node/44938'>Configure Apache for high performance on drupal 6</a></dt>
<dd>Some solid comments from kbahey from 2bits regarding stable setup: Apache, MPM Worker, fcgid, APC (code cache), memcache No SQL.</dd>
</dl>
<h3><a id='apache-maxclients'>Apache MaxClients</a></h3>
<p>The MaxClients parameter controls how many simultaneous clients Apache is able to serve. If it is set to high RAM will be chewed up and the Machine will go into swap. If it is set to low then your site will be unnecessarily limited by the number of clients it can serve. The setting of this value should be determined after consideration of (i) how much spare RAM is available on the server and (ii) how much RAM each Apache process consumes. Obviously you will want to maximize available RAM through frugal allocation of RAM to MySQL, JVM, etc and minimize the size of Apache process through techniques described above.</p>
<p><a href='http://2bits.com/'>2bits</a> <a href='http://2bits.com/articles/tuning-the-apache-maxclients-parameter.html'>provide</a> the following formula:</p>
<p>        <code>MaxClients = (Total Memory - Operating System Memory - MySQL memory) / Size Per Apache process.</code></p>
<p>The only addition this guide would make is that it is important to leave some RAM free for the OS file buffer to allow efficient operation of the OS.</p>
<dl class='more'>
<dt><a href='http://2bits.com/articles/tuning-the-apache-maxclients-parameter.html'>Tuning the Apache MaxClients parameter</a></dt>
<dd>How to set MaxClients param.</dd>
</dl>
<h3 id='htaccess'>.htaccess</h3>
<p>If you are running Apache then it is possible to either use .htaccess or the apache conf file to specify directives such as rewrite rules, etc. If you use .htaccess then Apache must look for .htaccess rules in the directory hierarchy for every request. This can take some time even if no rules are found. You may consider putting the rules in httpd.conf/apache2.conf if you are looking to eek out the most performance from your site.</p>
<p class='summary'>.htaccess can slow down site if performance is crucial.</p>
<dl class='more'>
<dt><a href='http://www.fubra.com/blog/2008/01/07/htaccess-vs-httpdconf/'>.htaccess vs httpd.conf</a></dt>
<dd>Evidence that .htaccess can slow a site by 6.6%.</dd>
</dl>
<h2><a id='ram'>RAM: A precious resource</a></h2>
<p>Given the above, serious thought should be given to how the RAM on your box is to be divided up. In a nutshell we have the following apps contesting for their fair share:</p>
<ul>
<li>The JVM if you are running Solr</li>
<li>MySQL query cache and key buffers</li>
<li>Apache processes for client requests</li>
<li>PHP if it runs outside Apache</li>
<li>Memcached for holding Drupal caches</li>
<li>The file system cache</li>
</ul>
<p>Consider the following when deciding how to divide up your box:</p>
<ul>
<li>The JVM needs a certain amount or else Solr will crash.</li>
<li>MySQL really should have indexes buffered for MyISAM and InnoDB. Use MySQLTuner. If they can&#8217;t fit then buy more RAM or (i) reduce max clients and (ii) forget about CacheRouter.</li>
<li>Apache MaxClients should be set to consume available RAM.</li>
<li>The file system cache needs to be big enough to allow smooth running of system.</li>
</ul>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/89/drupal-lamp-server-tuning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drupal Performance Out of the Box (part 5)</title>
		<link>http://cruncht.com/87/drupal-performance-out-of-the-box</link>
		<comments>http://cruncht.com/87/drupal-performance-out-of-the-box#comments</comments>
		<pubDate>Sun, 31 Jan 2010 09:57:34 +0000</pubDate>
		<dc:creator>Murray Woodman</dc:creator>
				<category><![CDATA[Drupal Planet]]></category>
		<category><![CDATA[Tech]]></category>
		<category><![CDATA[drupal]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://cruncht.com/?p=87</guid>
		<description><![CDATA[Drupal has a number of features which allow for performance to be improved without the addition of external modules or for complicated configuration by administrators. A slow system can easily be turned into one which performs well under heavy loads.

Firstly, Drupal shortcuts the execution of unneeded code by using an internal caching system which stores [...]]]></description>
			<content:encoded><![CDATA[<a href="http://cruncht.com/87/drupal-performance-out-of-the-box" title="Drupal Performance Out of the Box (part 5)"><img src="http://cruncht.com/wp-content/uploads/2010/02/lunch_box-150x150.jpg" alt="Lunch Box" class="feed-image" title="Drupal performance out of the box" /></a><p>Drupal has a number of features which allow for performance to be improved without the addition of external modules or for complicated configuration by administrators. A slow system can easily be turned into one which performs well under heavy loads.</p>
<p><span id="more-87"></span></p>
<p>Firstly, Drupal shortcuts the execution of unneeded code by using an internal caching system which stores results of expensive routines. Secondly, Drupal improves page rendering time by allowing for the aggregation static JS and CSS files. Thirdly, page output can be compressed saving download time. Fourthly, Drupal allows for manual configuration of the caching of Blocks and Pages which can improve performance significantly.</p>
<h2 id='core-caching'>Drupal Core caching system</h2>
<p>Drupal comes with a number of in-built caches which store the results of expensive calculations (strings) in the database so that they can be retrieved quickly later on. There are six caches enabled by default: cache, cache_block, cache_menu, cache_filter, cache_form and cache_page. Contributed modules are able to create their own caches for storing data which is handy for module designers.  This default cache system provides improved performance across the whole app.</p>
<p>The caching system is pluggable and allows for custom storage engines to be substituted in for the default database implementation. Later in this guide you will see how the Cache Router module is able to swap in memory based storage in place of the database, making the retrieval of cached content that much faster.</p>
<p class='summary'>All sites: Improved performance across the app. No config.</p>
<h3 id='aggregate'>Aggregate and compress JS and CSS</h3>
<p>Drupal&#8217;s modular system means that pages can have a large number of CSS and JS includes which results in a lot of client server communication &#8211; slowing the page draw time down. The problem can be alleviated by merging the files and then compressing them. This results in less includes and faster downloads. Up to 90% of download time can be attributed to downloading CSS, JS and images so it makes sense to aggregate and compress if possible.</p>
<p>During development it is advisable to keep this option turned off so that CSS and JS errors can be troubleshooted.</p>
<p class='summary'>All users: Lower render times. One click config.</p>
<dl class='more'>
<dt><a href='http://developer.yahoo.com/performance/rules.html#num_http'>Minimize HTTP Requests</a></dt>
<dd>Yahoo place this at the top of their list for ways to reduce download time. 40%-6-% of users are first time users so client side caching is no help to them. Fewer HTTP requests are.</dd>
</dl>
<h3 id='page-cache'>Page Cache for anonymous users</h3>
<p>Pages for anonymous users can be cached, meaning that a full build of the page isn&#8217;t necessary for each new request which comes in, providing you with savings in CPU and DB load as well as giving the user much faster response times. This is a massive win for your website, especially if the majority of your page requests are from anonymous users. Basically it can help you survive a Slashdotting. Drupal offers &#8220;aggressive&#8221; and &#8220;normal&#8221; options &#8211; normal page caching is recommended for most websites. Other options for Page Caching are discussed below.
<p>During development it is advisable to keep this option turned off so that any changes to logic or design can be troubleshooted.</p>
<p class='summary'>Anonymous users: Big wins in speed and CPU. One click config.</p>
<dl class='more'>
<dt><a href='http://buytaert.net/drupal-vs-joomla-performance'>Drupal vs Joomla: performance</a></dt>
<dd>An older article comparing Joomla and Drupal. Joomla faster on non-cached pages but caching makes Drupal win.</dd>
</dl>
<h3 id='block-cache'>Block Cache</h3>
<p>Enabling the block cache allows finer grained control over cached content. Caching blocks which don&#8217;t change frequently will enable speedups for logged in users who need a dynamic page built for them each request.</p>
<p class='summary'>Logged in users: Moderate wins. Easy to implement.</p>
<dl class='more'>
<dt><a href='http://lists.drupal.org/pipermail/documentation/2008-March/005949.html'>Drupal guide to caching</a></dt>
<dd>Covers the various database tables which store data for Drupal&#8217;s caching system.</dd>
</dl>
<h3 id='page-compression'>Page Compression</h3>
<p>It is also possible to enable page compression for the pages sent. This will reduce the page size by 50% or more depending on the page. Users requesting big pages on slower connections will love this. </p>
<p class='summary'>Slow connections: Massive win. Easy to implement.</p>
<dl class='more'>
<dt><a href='http://www.mostlygeek.com/tech/how-to-make-drupal-run-85x-faster-in-5-minutes/'>How to make Drupal run 8.5x faster in 5 minutes…</a></dt>
<dd>Page cache provides a 3x speedup.</dd</p>
<dt><a href='http://blamcast.net/articles/speed-up-drupal'>How I Survived a 2300% Traffic Increase With Drupal</a></dt>
<dd>Demonstrates that these out of the box techniques can be very effective even for sites on shared hosting.</dd>
</dl>
<hr />
<p>This article forms part of a series on Drupal performance and scalability. The first article in the series is <a href="http://cruncht.com/75/drupal-performance-scalability">Squeezing the last drop from Drupal: Performance and Scalability</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://cruncht.com/87/drupal-performance-out-of-the-box/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
