flattening my world

Since moving to Hong Kong and thus becoming self-unemployed, i have had lots of time to think about how i have approached building content management systems. For the last 2 years, i’ve been working with my own CMS (Content First) on my own projects and with ExpressionEngine while working with Grist. Both of these systems are different in how they store and manage the content, but ultimately, they are both very similar. They are both built upon an RDBS (MySQL) and both include some normalization (3NF). The db structure attempts to make the data accessible via built-in searches, but as we found at Grist, once the data grows large enough, simple queries are no longer quick and everything slows down – a true bottleneck.

While working at Grist, I was introduced to Apache Solr. I had heard of the Indexing system, but i had never used it. It was very easy to use (although i never got to do any deep work on it). Solr is what runs the main search for the site, but much of the ExpressionEngine CMS uses other queries to get available data. So when trying to join across the content and user tables, we quickly found the site a little too slow to deal with. This forced us (as it does every other large site) to spend effort caching the rendered content, using some simple system like Memcache. Memcache is amazing and I’ll have more to write about that later. But what i found most interesting is how a CMS like ExpressionEngine, once the data grows large enough, the simple RDBS doesn’t scale properly. I do understand that relational databases can handle quite large traffic and that is what they were meant to do. My issue though, is with the nature of the CMS itself.

Most CMSes that i have worked with are built to hold documents. Because they are versatile systems, the type of content can change, but ultimately, they content is all relatively flat. It’s a series of documents linked together by various indexes. So after seeing what solr could offer, i began to realize that much of the access to the documents could and perhaps should be handled by a better indexing system. After all, the tables that hold the content, include fields for date, author, category. If these key indexes were stored in solr, loading the most recent documents would be quick and painless. And ultimately, the accessing of the data itself – the document would be quick and simple.

This idea it would seem is not new and is actively being used by large organizations. I’ve just finished watching this video by Mat Wall – lead software architect for the Guardian, describing why they are using mongoDB rather than a pure RDBMS. But for me, the cool part is the idea of a solr playing the main role of indexing all your documents. The way he phrases it is that the Guardian is using a solr API. To access any data within the system, you are essentially making a solr query. So now, all data being rendered in a web page is all a series of quick solr queries, rather than complex joins across rigid relational database models. What a cool idea and one that will soon dominate most content management systems, given that they are often document driven, content flexible management systems.

Be sure to watch this video:

http://www.infoq.com/presentations/Why-I-Chose-MongoDB-for-Guardian