Why Hadoop is the future of the database

Jeff Hammerbacher says that Facebook tried them all. And none of them did what the web giant needed them to do.

Hammerbacher is the Harvard-trained mathematician Facebook hired in 2006. His job was to harness all the digital data generated by Mark Zuckerberg’s social network — to make sense of what people were doing on the service and find new ways of improving the thing. But as the service expanded to tens of millions of people, Hammerbacher remembers, it was generating more data than the company could possibly analyze with the software at hand: a good old-fashioned Oracle database.

At the time, a long line of startups were offering a new breed of database designed to store and analyze much larger amounts of data. Greenplum. Vertica. Netezza. Hammerbacher and Facebook tested them all. But they weren’t suited to the task either.

In the end, Facebook turned to a little-known open source software platform that had only just gotten off the ground at Yahoo. It was called Hadoop, and it was built to harness the power of thousands of ordinary computer servers. Unlike the Greenplums and the Verticas, Hammerbacher says, Hadoop could store and process the ever-expanding sea of data generated by what was quickly becoming the world’s most popular social network.

Over the next few years, Hadoop reinvented data analysis not only at Facebook and Yahoo but so many other web services. And then an army of commercial software vendors started selling the thing to the rest of the world. Soon, even the likes of Oracle and Greenplum were hawking Hadoop. These companies still treated Hadoop as an adjunct to the traditional database — as a tool suited only to certain types of data analysis. But now, that’s changing too.

On Monday, Greenplum — now owned by tech giant EMC — revealed that it has spent the last two years building a new Hadoop platform that it believes will leave the traditional database behind. Known as Pivotal HD, this tool can store the massive amounts of information Hadoop was created to store, but it’s designed to ask questions of this data significantly faster than you can with the existing open source platform.

“We think we’re one the verge of a major shift where businesses are looking at a set of canonical applications that can’t be easily run on existing data fabrics and relational databases,” says Paul Martiz, the former Microsoft exec who now oversees Greenplum. Businesses need a new data fabric, Maritz says, and the starting point for that fabric is Hadoop.