Semantic web development and publishing

Drupal Implementation Decisions (part 7)

Bad performance can be down to decisions you have made during development because you haven’t been aware of natural limitations of the system or the way Drupal works. The more you poke about the more you will understand. Here’s a grab bag of things which may bite you.

Search

The search which comes inbuilt with Drupal has long been regraded as unsatisfactory for larger sites:

  • failure to index large sites efficiently
  • slow at returning results
  • relatively limited feature set compared to dedicated search solutions

Part of the problem lies with the fact that standard relational databases such as MySQL and interpreted languages such as PHP are not well suited to handle the large indexes and filtering required for big datasets. Search can also be an intensive process if the corpus is large and the traffic high. Being able to move search off the main box will give more resources to Drupal to do other things. Both of the following solutions can help solve these issues.

Solr

Enter the Apache Lucene project and Apache Solr.

Welcome to Lucene!
Lucene “provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.”
Welcome to Solr
“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites”

The Apache Solr Search Integration module integrates Solr with Drupal. Installation requires a setup of a JVM and a container such as Tomcat or Jetty – something which is a little outside the usual LAMP configuration. Installation is not so difficult with some good documentation available. If this is beyond you then it may be worth checking out the Acquia Search hosted solution.

The beauty of Solr is that it is able to index massive datasets, millions if not 100s of millions of nodes, and return results surprisingly quickly. The faceted search (taxonomy, language, CCK, Author, Content Type) is its main selling point which will be sure to impress users. It really is a product which will take your site to the next level if search is important to you. More importantly Solr solves crippling scaling problems with core search as well as viewing taxonomy terms with many nodes.

Check out the following search for “Madonna” filtered by “Artistic Works” type and “English” language. Alternatively, here is Madonna filtered by “Person” type and “Italian” language. Italians do it better.

Solr is a natural fit for installation on another server because it runs as a web service over HTTP. This is a very easy way to take load of your main server, especially if search is a big part of your site. The disk requirements of the index and Java memory requirements for Solr can be significant on big sites so moving it off the main server may well be a necessity. My experience on Uriverse (a large amount of smallish nodes) would suggest that each node takes around 200 bytes of RAM in the JVM. Your requirements could vary dramatically from this but this may be helpful as a rough guide for those wanting to know about RAM consumption.

Large sites: Core search fails. Solr shines.

SolrPerformanceFactors
Solr documentation including optimization.
Scaling Lucene and Solr
Some very good notes on scaling Solr. File system cache is important as well as JVM and need 2+GB on top of JVM for big index.
how to reduce solr memory usage?
Smaller documents, less facets, no sorting.
How much disk space does optimize really take
Be careful of optimize – required disk space can double.

Another option for those users looking to switch away from the in-built search is Google Custom Search – a search service offered by Google where Google indexes your data and stores the results on their servers. You are able to search the data via a simple form on your site. Google Custom Search is free for individuals, who don’t mind ads with their results, with a business version starting at $100 pa. The Google Custom Search Module integrates it into Drupal by providing a block with the form.

Module bloat

Every module you add to a Drupal site leads to the consumption of more RAM reducing the number of simultaneous clients you can server. Modules also consume extra CPU. As a site designer you need to be aware that all modules added will have some cost on the performance of your site. Is the extra functionality worth the performance cost?

Server indigestion: The Drupal contributed modules “open buffet binge” syndrome
Minimal install of Drupal has Apache process around 17MB-33M. A bloated system has 93M bloating to 100MB after a few requests. This means fewer requests can be handled due to RAM consumption.

Feature rich sites: RAM wasted. Low max clients.

Node load

The loading of a node in Drupal can be a very quick/light or very slow/heavy process depending on the circumstances. It all comes down to (i) how much data is in the node and (ii) how fast that data comes back out of the database. If you have nodes with many CCK fields sitting in a database which hasn’t been able to cache all of the necessary indexes, a node load is something which you really want to avoid. Instead of lazily loading data when required a node load loads it all in at once! In extreme cases a node could take seconds to load.

This become a problem when you want to handle many nodes at once, such as when you want to display node teasers in a View for example. If you notice your page slowing down when displaying 50 teasers on a view then it is a good bet that the DB is getting hammered trying to load in all that data just to display the title, description and url_alias! In these cases skip the teaser and just do it with fields. You should notice a big improvement.

Similarly, performing massive (millions of nodes) imports for new/updated nodes can be prohibitive as well. In these rare cases you need to skip the API and execute SQL on the DB directly.

Many Nodes, Big Nodes: Speed and RAM suffers

CCK design

It’s important to know how to work within the confines of a system to get the most out of it so it is well worth investigating how data is stored in the backend for nodes and CCK fields. Smart, sensible design of CCK fields should ensure that database access is kept to a sensible level. Whilst this guide recommends designing a data model first and then worrying about data access times second, it is worth knowing the consequences of your decisions.

In a nutshell, CCK will create another table in the backend for multi fields and fields which are shared between content types. There’s no avoiding the first reason but in the case of shared fields, a separate DB query will need to be run for each shared property. This can be an issue on very large sites on nodes with a lot of properties when you want to keep DB queries to a minimum.

A couple of rules of thumb would be:

  • Try to keep the properties of a content type encapsulated to that particular content type. eg. you might be tempted to share book.author and film.screenwriter in a single CCK field. Unless you are going to be doing queries across both then it makes sense to store the properties separately with their content types.
  • If you are going to share properties between content types, then the more sharing that can be done the better. eg. a ‘geo-location’, ‘country’, ‘intended-audience’ are all possible candidates for stretching across several content types. This design is views friendly, aiding efficient queries across multiple content types.
  • It is probably better to lean towards designing content types which use taxonomy types and have some null properties instead of creating too many ‘sub-class’ content types to do the job. eg. lets say you have two possible types, ‘it-employee’ and ‘accounts-employee’, which share some properties (birthday, address) but not others (preferred-os). It probably would be better to add an employee-type taxonomy to do the sub-classing and have the unshared preferred-os property as optional. This ensures easy filtering using taxonomy and allows for fast employee retrieval from a single table.

Complex Data Model: Inefficient data

Module development

When designing your own modules you should have an eye to efficient design and use caching the Drupal way. Using the Drupal cache mechanism means that the custom caches are available to Cache Router to store in RAM (if appropriate).

It is also possible to use the “static” variable keyword to cache the variable for that particular script.

A beginner’s guide to caching data
Steps module developers can take to cache data the PHP/Drupal way.

Custom Modules: cache data the Drupal way

Keep up to date!

From time to time different point releases are made to Drupal core. Keeping up to date with these patches will not only help you stay secure it may also speed up your site. If you take a look at the SQL patches many of them are ALTER statements which add indexes to the database which could help slow queries on big sites. Further, code inefficiences may be removed through better coding techniques.

Drupal Core updates: Better performance (especially for big sites) and stability.


This article forms part of a series on Drupal performance and scalability. The first article in the series is Squeezing the last drop from Drupal: Performance and Scalability.

2 Comments

  1. Posted April 13, 2011 at 8:01 am | Permalink

    Does it really make a difference if I share a CCK field across multiple content types? CCK creates a field_{name} table if it is shared or not, so there will be no improvement in efficiency by keeping them separate? And if you’re sharing fields then it’s because they contain the same type of content – so it’s fairly likely you’d use a view to pull that data regardless across multiple content types.

  2. Posted April 13, 2011 at 7:17 pm | Permalink

    In Drupal 6 a new table will be made if the field is a multi or is shared. In Drupal 7 a field table is made for fields no matter what. Personally, I was quite happy with the D6 design because it was closest to standard relational DB practices. It did lead to shifting schemas when field definitions changed and this was the cause of problems. However, that is worth living with IMO.

    The comments I made above which you have picked up on related to content types with a lot of properties (potentially hundreds). “This can be an issue on very large sites on nodes with a lot of properties when you want to keep DB queries to a minimum.” You really really, really want to be able to get that data back for a node without (i) doing joins or (ii) looking that data up again with a separate query.

    I ran into this problem when I was importing DBpedia (semantic version of Wikipedia) into Drupal 6. DBpedia has hundreds of property types for the various classes. During the data conversion I had to make prudent decisions about just what property types were worth importing. In this case it would be silly to have a property (shared field) used by 99% of instances in one class, but only 1% of instances in another. There is little point in slowing down access times to pick up one piece of data which will be missing for the majority of cases. In the end I opted for shared fields where there was good data density across classes and single fields where there was good data density within a class.

    All I was trying to say above was that there is no sense in smushing fields which are semantically similar into a single multifield. If there is a difference then keep it. If they are the same, then make a multi.

    Yes. This is an esoteric example and perhaps I made too big a deal of it in the article. Generally, I say model your data the best way you can. Drupal has very good tools for building schemas and you shouldn’t worry about stuff like this for the most part.

    Mind you, for D7 I have misgivings about the way fields have gone. With the current design it would be impossible to import (most of) DBpedia into D7 as I have done for D6 because of all the lookups or joins in Views. You would have hundreds (thousands?) of tables all supporting a different property! In this case, the fragmentation of data in MySQL does make a difference to efficiency. So, the scalability of MySQL has been reduced under D7 for large datasets using fields.

    HOWEVER, I think when you are pushing millions of rows and hundreds of properties it would be a sensible decision to switch to MongoDB or similar. You then get the benefit of fast lookup and multi indexed querying. I have yet to investigate that. I’m only a novice with this stuff but I’d say that a Mongo backend for nodes could be a fruitful avenue for the future for all medium to large sites given the way things have developed in MySQL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>