Hadoop, a Free Software Program, Finds Uses Beyond Search
BURLINGAME, Calif. — In the span of just a couple of years, Hadoop, a free software program named after a toy elephant, has taken over some of the world’s biggest Web sites. It controls the top search engines and determines the ads displayed next to the results. It decides what people see on Yahoo’s homepage and finds long-lost friends on Facebook.
It has achieved this by making it easier and cheaper than ever to analyze and access the unprecedented volumes of data churned out by the Internet. By mapping information spread across thousands of cheap computers and by creating an easier means for writing analytical queries, engineers no longer have to solve a grand computer science challenge every time they want to dig into data. Instead, they simply ask a question.
“It’s a breakthrough,” said Mark Seager, head of advanced computing at the Lawrence Livermore National Laboratory. “I think this type of technology will solve a whole new class of problems and open new services.”
Three top engineers from Google, Yahoo and Facebook, along with a former executive from Oracle, are betting it will. They announced a start-up Monday called Cloudera, based in Burlingame, Calif., that will try to bring Hadoop’s capabilities to industries as far afield as genomics, retailing and finance.
The core concepts behind the software were nurtured at Google.
By 2003, Google found it increasingly difficult to ingest and index the entire Internet on a regular basis. Adding to these woes, Google lacked a relatively easy to use means of analyzing its vast stores of information to figure out the quality of search results and how people behaved across its numerous online services.
To address those issues, a pair of Google engineers invented a technology called MapReduce that, when paired with the intricate file management technology the company uses to index and catalog the Web, solved the problem.
The MapReduce technology makes it possible to break large sets of data into little chunks, spread that information across thousands of computers, ask the computers questions and receive cohesive answers. Google rewrote its entire search index system to take advantage of MapReduce’s ability to analyze all of this information and its ability to keep complex jobs working even when lots of computers die.
MapReduce represented a couple of breakthroughs. The technology has allowed Google’s search software to run faster on cheaper, less-reliable computers, which means lower capital costs. In addition, it makes manipulating the data Google collects so much easier that more engineers can hunt for secrets about how people use the company’s technology instead of worrying about keeping computers up and running.
“It’s a really big hammer,” said Christophe Bisciglia, 28, a former Google engineer and a founder of Cloudera. “When you have a really big hammer, everything becomes a nail.”
The technology opened the possibility of asking a question about Google’s data — like what did all the people search for before they searched for BMW — and it began ascertaining more and more about the relationships between groups of Web sites, pictures and documents. In short, Google got smarter.
The MapReduce technology helps do grunt work, too. For example, it grabs huge quantities of images — like satellite photos — from many sources and assembles that information into one picture. The result is improved versions of products like Google Maps and Google Earth.
Google has kept the inner workings of MapReduce and related file management software a secret, but it did publish papers on some of the underlying techniques. That bit of information was enough for Doug Cutting, who had been working as a software consultant, to create his own version of the technology, called Hadoop. (The name came from his son’s plush toy elephant, which has since been banished to a sock drawer.)
People at Yahoo had read the same papers as Mr. Cutting, and thought they needed to even the playing field with their search and advertising competitor. So Yahoo hired Mr. Cutting and set to work.
“The thinking was if we had a big team of guys, we could really make this rock,” Mr. Cutting said. “Within six months, Hadoop was a critical part of Yahoo and within a year or two it became supercritical.”
A Hadoop-powered analysis also determines what 300 million people a month see. Yahoo tracks peoples’ behavior to gauge what types of stories and other content they like and tries to alter its homepage accordingly. Similar software tries to match ads with certain types of stories. And the better the ad, the more Yahoo can charge for it.
Yahoo is estimated to have spent tens of millions of dollars developing Hadoop, which remains open-source software that anyone can use or modify.
It then began to spread through Silicon Valley and tech companies beyond.
Microsoft became a Hadoop fan when it bought a start-up called Powerset to improve its search system. Historically hostile to open-source software, Microsoft nevertheless altered internal policies to let members of the Powerset team continue developing Hadoop.
“We are realizing that we have real problems to solve that affect businesses, and business intelligence and data analytics is a huge part of that,” said Sam Ramji, the senior director of platform strategy at Microsoft.
Facebook uses it to manage the 40 billion photos it stores. “It’s how Facebook figures out how closely you are linked to every other person,” said Jeff Hammerbacher, a former Facebook engineer and a co-founder of Cloudera.
Eyealike, a start-up, relies on Hadoop for performing facial recognition on photos while Fox Interactive Media mines data with it. Google and I.B.M. have financed a program to teach Hadoop to university students.
Autodesk, a maker of design software, used it to create an online catalog of products like sinks, gutters and toilets to help builders plan projects.. The company looks to make money by tapping Hadoop for analysis on how popular certain items are and selling that detailed information to manufacturers.
These types of applications drew the Cloudera founders toward starting a business around Hadoop.
“What if Google decided to sell the ability to do amazing things with data instead of selling advertising?” Mr. Hammerbacher asked.
Mr. Hammerbacher and Mr. Bisciglia were joined by Amr Awadallah, a former Yahoo engineer, and Michael Olson, the company’s chief executive, who sold a an open-source software company to Oracle in 2006.
The company has just released its own version of Hadoop. The software remains free, but Cloudera hopes to make money selling support and consulting services for the software. It has only a few customers, but it wants to attract biotech, oil and gas, retail and insurance customers to the idea of making more out of their information for less.The executives point out that things like data copies of the human genome, oil reservoirs and sales data require immense storage systems.