Over the years, I have the privilege of working with some pretty damn good people. One of those guys is a PhD in Database Research, used to be a professor, but apparently so passionate and so good at teaching that he eventually quits academia to join the industry.
He did an PhD in XML database, and even though XML database turned out to be completely useless nowadays, it doesn’t mean a PhD in XML Database couldn’t teach you anything (in fact, a good PhD could teach you quite many useful things). One interesting thing I learned from him was the evolution of database technology, which originates in an essay by Michael Stonebraker called What goes around comes around.
Michael Stonebraker is a big name in Database Research and has been around in Database research for a good 35 years, so you could easily expect him to be highly opinionated on a variety of things. The first few lines in the essay read like this:
In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.
Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.
His work, among others, include PostgreSQL. Hence, after reading the essay, it becomes obvious to me why there are so many highly opinionated design decisions being implemented in Postgres.
The essay is a very good read. You get to see how database technologies evolved from naive data models to unnecessarily complicated models, then thanks to a good mathematician named Edgar F. Codd, it got way more beautiful and highly theoretically-grounded. After a few alternatives, a new wave of XML databases come back bearing a lot of complications. (Along the way, you also get to see how Michael Stonebraker managed to sell several startups, but that isn’t the main story – or is it?)
There are many interesting lesson learned. The most two interesting for me are:
- XML database doesn’t take off because it is exceedingly complicated, and there is no way to efficiently store and index it using our best data structures like B-trees and the likes.
I learned XML databases and I was told that XML databases failed because it lacks a theoretical foundation like the Relational model. Now in retrospect, I think that isn’t the main issue. The problem with XML is that it allows too much flexibility in the language, so implementing a good query optimizer for it is extremely difficult.
A bit more ironically, this is how Michael Stonebraker puts it:
We close with one final cynical note. A couple of years ago OLE-DB was being pushed hard by Microsoft; now it is “X stuff”. OLE-DB was pushed by Microsoft, in large part, because it did not control ODBC and perceived a competitive advantage in OLE-DB. Now Microsoft perceives a big threat from Java and its various cross platform extensions, such as J2EE. Hence, it is pushing hard on the XML and Soap front to try to blunt the success of Java
It sounds very reasonable to me. At some point around 2000-2010, I remember hearing things like XML is everywhere in Windows. It has to be someone like Microsoft keeps pushing hard to make it quite phenomenal. When Microsoft started the .NET effort to directly compete with Java, the XML database movement faded away.
One thing Michael Stonebraker got wrong though. In the essay, he said XML (and SOAP) is gonna be the data exchange format of the future, but it turns out XML is still overly complicated for this purpose, and people ended up with JSON and RESTful instead.
- The whole competitive advantage of PostgreSQL was about UDTs and UDFs, a somewhat generalization of stored procedures. Stored procedures, though, are soon out-of-fashion because people realize it is difficult to maintain their code in multiple places, both in application code and store procedures in DBMS. However, the idea of bringing code close to data (instead of bringing data to code) is a good one, and has a big consequence on the Big Data movement.
Speaking of Big Data, Stonebraker must have something to say about it. For anyone who is in Big Data, you should really see this if you haven’t:
The talk presents a highly organized view on various aspects of Big Data and how people solved them (and of course mentions a few startups founded by our very Michael Stonebraker).
He mentioned Spark at some point. If you look at Spark nowadays, it’s nothing more than an in-memory distributed SQL engine (for traditional business intelligence queries), along with a pretty good Machine Learning library (for advanced analytics). From a database point of view, Spark looks like a toy: you can’t share tables, tables don’t have indices, etc… but the key idea is still there: you bring computation to the data.
Of course I don’t think Spark wants to become a database eventually, so I am not sure if Spark plans to fix those issues at all, but adding catalog (i.e. data schema), and supporting a somewhat full-fledged SQL engine were pretty wise decisions.
There are several other good insights about the Big Data ecosystem as well: why MapReduce sucks, what are other approaches to solve the Big Volume problem (besides Spark), how to solve the Big Velocity problem with streaming, SQL, NoSQL and NewSQL, why the hardest problem in Big Data is Variety, etc… I should’ve written a better summary of those, but you could simply check it out.