How and why LinkedIn is becoming an engineering powerhouse

Most LinkedIn users know “People You May Know” as one of that site’s flagship features — an onmipresent reminder of other LinkedIn users with whom you probably want to connect. Keeping it up to date and accurate requires some heady data science and impressive engineering to keep data constantly flowing between the various LinkedIn applications. When Jay Kreps started there five years ago, this wasn’t exactly the case.

“I was here essentially before we had any infrastructure,” Kreps, now principal staff engineer, told me during a recent visit to LinkedIn’s Mountain View, Calif., campus. He actually came LinkedIn to do data science, thinking the company would have some of the best data around, but it turned out the company had an infrastructure problem that needed his attention instead.

How big? The version of People You May Know in place then was running on a single Oracle database instance — a few scripts and heuristics provided intelligence — and it took six weeks to update (longer if the update job crashed and had to restart). And that’s only if it worked. At one point, Kreps said, the system wasn’t working for six months.

When the scale of data began to overload the server, the answer wasn’t to add more nodes but to cut out some of the matching heuristics that required too much compute power.

So, instead of writing algorithms to make People You Know Know more accurate, he worked on getting LinkedIn’s Hadoop infrastructure in place and built a distributed database called Voldemort.

Since then, he’s built Azkaban, an open source scheduler for batch processes such as Hadoop jobs, and Kafka, another open source tool that Kreps called “the big data equivalent of a message broker.” At a high level, Kafka is responsible for managing the company’s real-time data and getting those hundreds of feeds to the apps that subscribe to them with minimal latency.