Skip to content

Unpacking

28/10/2011

Last week I gave a presentation at the Access 2011 conference, and I have to admit: it wasn’t my best.  The conference was excellent, among the best I’ve ever been to, but still I’ve remained frustrated by the feeling that I didn’t convey what I wanted to.  Combined with a number of comments in the past few days about clarification, I feel compelled to sit down and at least clear my head of the details bouncing around.

The title of the talk was “Big Data in Libraries: Has Open Source’s Time Arrived?” (slides) — part of the problem is that this entails a lot of different, albeit related, ideas.  Let’s go through them one at a time, in the order that I gave the slides.

“Big data” is a buzzword.  So is “cloud”.

The point of my introduction was to remind people that a few years ago, “Web 2.0″ was The Next Big Thing™.  Software vendors in particular LOVE to use these terms for marketing, and “big data” is no exception.  The same way that “Web 2.0″ really meant “the social/interactive web”, “big data” really means: data that is large enough to be difficult to work with using familiar tools (for the purposes of my talk, I mention relational databases. More on that shortly). I wanted to make people aware that vendors are going to try to dazzle you with these buzzwords to get your money. Don’t let them.

Big data really isn’t about size.

The next thing I tried to do was give some context of how much space in bytes we’re talking about.  I only used MARC as a touchstone because it’s something every systems librarian is familiar with.  MARC was created in 1966, and the average MARC record today is the same size it was then: about 4KB.  But the cost and physical size of storing library data have both dropped astronomically, thanks to Moore’s Law.

In 1980 (sorry, I don’t have numbers for 1966), storing a million 4KB MARC records (3.8GB) cost about $170,000.  In 2011, it costs about $18 and fits on a thumbnail-sized microSD card.  A large library catalogue of about 2.5 million records (9.5GB) would easily fit on an iPhone.  WorldCat, the world’s largest union catalogue, contains 1.5 billion records: at about 5.6 TB, that’s less than $500 in storage.
(Yeah, yeah, more for infrastructure, redundancy, etc. but you get the point.)

Big data really isn’t about the number of records.

At least, not in libraries.  The kinds of organizations that are dealing with really big data, like Facebook and Google, are processing tens or hundreds of billions of records, most of which are much larger than 4KB — we’re talking petabytes of data, and often in near real-time.  Libraries, by contrast, are dealing with collections of less than 3 or 4 million records at most.  Let me reiterate: our biggest data sets are ten thousand times smaller than the “big data” crowd.  Keep in mind that, unlike Google or Facebook, 99% of the time, library data is read-only.

Big data isn’t really complex.

Again, not for libraries anyway.  But this is where we’ve fooled ourselves into thinking we have a problem.  If it’s not complex, then why is it cumbersome for a library catalogue to manage a million records?  The answer lies in how we represent a “record” in the tools with which we are familiar.

In 1974, Raymond Boyce and Edgar Codd (and apparently Ian Heath in 1971) popularized the relational model of data storage, including the concept of normalization and the SEQUEL (now SQL) language.  For the past 35 years, the relational model has been the pantheon and de facto standard for data management.  But the relational model basically assumes all data are keys (columns) and values (rows) — essentially tabular.  MARC records, devised 10 years earlier and despite all their other shortcomings, have never been tabular data.  They are hierarchical documents.

Similarly, links between library records are not normalized relations, but rather directed graphs.  This means that, in order to represent MARC records, authority records, and their relationships using relational databases, we have to transform them (using tools like object-relational mapping, for example).  I’m willing to wager that your ILS is actually using a relational database under the covers, if not something more proprietary and esoteric.

In short, for the past 35 years, we have been pushing a square peg into a round hole.  To be fair, it has worked remarkably well, but historically, managing tens or hundreds of thousands of non-tabular records has been overwrought and cumbersome — not to mention slow (More on this shortly).

New tools can change how we work.

The alternative to what we’ve been doing to date, which is modifying our data, is modifying our tools.  In the past ten years or so, there has been a rise in so-called NoSQL tools.  These are essentially data management tools that don’t implement the relational model and thus are not subject to the same limitations.  As it turns out, these NoSQL tools are the same ones that Google and Facebook are using, in large part because they are the ones who developed them.

Of course, they are really effective at dealing with large numbers of hierarchical documents and directed graph relations.  In particular, in the past year or two we have tools for object-document mapping (ODM) that replace the traditional object-relational mapping (ORM) approach.

This part of the presentation basically shows a couple of illustrations that show how the ICA-AtoM software scales when managing a set of 3.5 million archival descriptions.  The main point to take away is that, everything else being equal, an ODM scales much better (ie. performs faster) than an ORM: up to 10x faster and in only a tenth of the memory footprint.  This is significant because it means we can potentially improve the ability of our systems to handle substantially more data — several hundred times more — by only changing the way the data is stored.

Use the right tool for the job.

But why does an ODM scale so much better?  My hypothesis (more of an educated guess) is that a NoSQL document database much more closely fits the shape of the hierarchical, graph-like data that libraries are dealing with.  We don’t have to transform our data every time we modify — or more importantly, when we read/search it. (99%, remember?)

The contemporary approach is to use a relational database for storing (writing) data, and then a “shadow” index like Solr for searching (reading) data, because that is what each are good at.  But if you have a single tool that does both well, why wouldn’t you just use that?  Simplicity is good.

Here’s my soapbox moment: I think we have been so blinded by the dogma of SQL and the relational model, probably because it was the most widely-used solution, that we haven’t been actively pursuing other tools.  In fact, I’d contend that we haven’t even been open to considering other tools, even when they are: a) growing in popularity, b) relatively easy to use, and c) free.

It’s this last point that leads me to my conclusion, and, for those who were at Access, picks up on Peter’s more philosophical talk prior to mine.

Big data is no harder than small data; a.k.a. The Cloud Is A Lie

Like the tools we’re already using: Apache, MySQL, Solr, PHP, etc, many of the best NoSQL tools are open source and/or free.  If you’re already developing for MySQL, then MongoDB is no harder to work with.  If you’re already deploying Solr, then if anything, ElasticSearch is much easier.  Yes, it means learning new things, and learning new things is difficult when you’re busy doing old things — but the payoff is tremendous, and the risk is minimal.

As I said from the start, vendors are going to continue to use these new ideas to convince you that they are able to do something you’re not.  They are going to claim that their “horizontally-distributed, web-scale cloud platform” can magically outperform and out-scale your current application.  And I’m sure it can.  But here’s the part they probably don’t want you to know: I’ll wager that, under the covers, they are using the same open source NoSQL tools that you can implement yourself.

This isn’t to say that software-as-a-service (SaaS) isn’t a viable business offering — Artefactual offers application hosting for organizations who want it, and I happen to think we do a good, competitive job at it.  But “the cloud” is not the same as SaaS.  You don’t have to put your library data into that Black Box in the Sky™ just because it runs slowly in MySQL.  You don’t have to buy a top-of-the-line $5000 HyperScale PowerEdge server because Solr is running out of memory.

Think different.  Scale out.  Build your own damn cloud.

Advertisement

Comments are closed.

Follow

Get every new post delivered to your Inbox.