holy crap, my world is flat!

This realization has crept upon me. Even though I’ve talked about the flatness of information structures before, I’m seeing a strong pattern emerge which is, well, kind of blowing my mind. Keep in mind, I am a serious structuralist. I spent endless hours as a kid constructing vast cities out of pillows and furniture. I’ve always been drawn to the 3 dimensional representation of how we think, knowing full well that I could never be satisfied with a simple hierarchy of knowledge. Later in life, when I heard about neural networks, I was very turned on. Everything could connect to anything. Everything does connect to everything.

Point clouds

The flattening of my world finally hit an interesting point as I began to work with Google Refine (soon to be relaunched as Open Refine). I was trying to rework an Excel Spreadsheet for Ying Chan’s graduate seminar on Food Security. We found this very interesting dataset which maps food issues based upon news reports. The spreadsheet was in Chinese, but I wasn’t so worried. I was probably more taken back by the fact that I didn’t know all the names of China’s provinces in English, so what’s the big deal if I don’t know the Chinese. Besides, it’s all just a pattern. Anyway, the data was weird. I could see it immediately. The column which listed the location from the report, was a multi-valued set (comma separated), which listed the various regions in which the food issue occurred. My goal with this data, was to produce a map, using Google Fusion Tables. So before I did anything, I needed to sort out this weird column.

Now, I’ve always hated working with spreadsheets. This may have something to do with my lack of financial acumen, or might be a product of my early introduction to relational tables, but I’m pretty sure it is grounded in my deep need to create a structure – a multi-dimensional structure from the information. Simply put, I wanted this data in a database. I wanted to create at least a 2nd normal form relational database, so that regions could be linked to incidents. But because of how I was planning on working (using fusion tables), I needed to keep this data all flat. I resigned myself to this substandard position and began to work with Google Refine.

This problem was rather trival, once I could get past the non-trival – namely my own built-in world-view. I first has to fill all empty fields, so that wouldn’t be affected later, but then I simply had to split on multi-valued cells and fill down the empty cells. I now had a flat (and redundant, I kept muttering) structure. But it was ready to be worked with.

This whole time I am amazed at the power of Google Refine, as well as what is possible with both Drive and Fusion Tables. Because they all so neatly link together, they make manipulating and moving the data so easy. And even though I still was struggling with the idea of work with spreadsheets, I began to understand how important this simple structure is.

Although I believe all humans like structure, and even like me, like to form deep understanding of the multitude of relationships that can exist, when it comes to simply doing, we love taking the straightforward path. That’s where we line up everything we know and we duplicate our data, so we can audit and see it for what it is. We don’t want to have to juggle relationships in our minds when it comes to the affairs of the work in front of us. Sure, there is a deep satisfaction from being able to handle such complex relationships, but all of us, aren’t always able to deal with that, day to day.

This flattening of our world – the one I write (and think) about with respect to content is no different. But for some reason, I’ve been hanging on to the notion that it didn’t apply to data, to more pure datasets. The reality is, it is the same.

flattening my world

Since moving to Hong Kong and thus becoming self-unemployed, i have had lots of time to think about how i have approached building content management systems. For the last 2 years, i’ve been working with my own CMS (Content First) on my own projects and with ExpressionEngine while working with Grist. Both of these systems are different in how they store and manage the content, but ultimately, they are both very similar. They are both built upon an RDBS (MySQL) and both include some normalization (3NF). The db structure attempts to make the data accessible via built-in searches, but as we found at Grist, once the data grows large enough, simple queries are no longer quick and everything slows down – a true bottleneck.

While working at Grist, I was introduced to Apache Solr. I had heard of the Indexing system, but i had never used it. It was very easy to use (although i never got to do any deep work on it). Solr is what runs the main search for the site, but much of the ExpressionEngine CMS uses other queries to get available data. So when trying to join across the content and user tables, we quickly found the site a little too slow to deal with. This forced us (as it does every other large site) to spend effort caching the rendered content, using some simple system like Memcache. Memcache is amazing and I’ll have more to write about that later. But what i found most interesting is how a CMS like ExpressionEngine, once the data grows large enough, the simple RDBS doesn’t scale properly. I do understand that relational databases can handle quite large traffic and that is what they were meant to do. My issue though, is with the nature of the CMS itself.

Most CMSes that i have worked with are built to hold documents. Because they are versatile systems, the type of content can change, but ultimately, they content is all relatively flat. It’s a series of documents linked together by various indexes. So after seeing what solr could offer, i began to realize that much of the access to the documents could and perhaps should be handled by a better indexing system. After all, the tables that hold the content, include fields for date, author, category. If these key indexes were stored in solr, loading the most recent documents would be quick and painless. And ultimately, the accessing of the data itself – the document would be quick and simple.

This idea it would seem is not new and is actively being used by large organizations. I’ve just finished watching this video by Mat Wall – lead software architect for the Guardian, describing why they are using mongoDB rather than a pure RDBMS. But for me, the cool part is the idea of a solr playing the main role of indexing all your documents. The way he phrases it is that the Guardian is using a solr API. To access any data within the system, you are essentially making a solr query. So now, all data being rendered in a web page is all a series of quick solr queries, rather than complex joins across rigid relational database models. What a cool idea and one that will soon dominate most content management systems, given that they are often document driven, content flexible management systems.

Be sure to watch this video:

http://www.infoq.com/presentations/Why-I-Chose-MongoDB-for-Guardian

mediaSpace

mediaSpace came together today.  it’s basically a new framework based upon gallerySpace.  i’ve copied a lot of ideas from other systems.  at least in form.  i love the tagging system from delicious.  i want to make sure all the media (images and files) are able to be tagged and served up with that tagging system.

what this gives me is a clear framework to develop any kind of content and tag it and offer it up based upon tags, collections (formal orderings) or directly.  and what media is, at this point, i don’t care.  please, don’t be offended.  i like content. really.  i just don’t seem to be producing much lately.  i’m too caught up in systems design and that causes me to ignore content.

anyway, the system is coming together.  alan and i are getting together tomorrow to bang on this.  we should be able to tie in some of the backend.  at least, i’m hoping we can decide upon the interface to it.