Monday, December 12, 2011
Jeremy Howard is not a data scientist. Except that, well, he is.
At the University of Melbourne, he studied philosophy. Then he tackled the metaphysics of business operations, spending the better part of a decade with management consulting outfits AT Kearney and McKinsey & Company. And then he founded, built, and sold off two startups, including one that hosted e-mail services. He didn’t realize he was a data scientist until he stumbled onto Kaggle.
Kaggle bills itself as an online marketplace for brains. Over 23,000 data scientists are registered with the site, including Ph.D.s spanning 100 countries, 200 universities, and every discipline from computer science, math, and econometrics to physics and biomedical engineering. Companies, governments, and other organizations come to the site with data problems — problems that require the analysis of large amounts of information — and the scientists compete to solve them. Sometimes they compete for prize money, sometimes for pride, and sometimes just for the thrill. “We’re making data science a sport,” reads the site’s tagline.
After selling his two startups, Jeremy Howard needed a way to pass the time, so he signed up with Kaggle and went head-to-head with all those Ph.D.s from the likes of Harvard and MIT. “I was looking for an intellectual challenge,” he tells Wired.com. “I thought I should give it a go and I try to see if I could not come last.” Surprising even himself, he not only held his own, he rose to the top of heap, taking first prize in multiple competitions.
“He is not a data scientist per se. He’s sort of self-taught. But he is probably one of the top minds in data science in the world,” says Momchil Georgiev, a data analyst with the National Oceanic and Atmospheric Association who competes on Kaggle in his spare time.
Howard no longer vies for prize money at Kaggle. In February, he joined the company as president and chief scientist. “They don’t let me win,” he jokes on his LinkedIn profile. “Apparently, the fact I can look up the answers is considered potential cheating.” But his story is indicative of the way Kaggle democratizes data science, bringing the world’s top data minds to one place — regardless of their nationality, their field of study, or even their credentials.
As so many Silicon Valley startups and big-name IT outfits urge businesses to adopt Hadoop and other software platforms meant to analyze massive amounts of data, Kaggle is simply crowd-sourcing the problem. And Howard questions why you would do it any other way. “I find the Hadoop fascination curious,” he says. “For me, solving these problems is about great creativity, great open-minded-ness, prototyping, many iterations. Hadoop doesn’t do any of that.”
Kaggle Plays Nostradamus
Kaggle is a way of predicting the future. In launching a competition on the site, the average business is looking to anticipate certain outcomes based on an existing collection of data. Data scientists call it “predictive modeling.” Carvana, a Phoenix, Arizona-based outfit, recently launched a competition that seeks to determine whether a used car can be refurbished for re-sale on the web.
“We have a fair amount of data about the cars we have purchased in the past and then the ultimate outcome of whether we were able to get it through the production process or not,” says William Adams, the company’s head of analytics. “We want analytics models that can tell us what cars are going to require the least amount of expenses when we repair them.”
In similar fashion, the Allstate insurance company ran a competition to predict injury liability after a car accident, and a British outfit called Dunnhumby asked scientists to tell them when shoppers were likely to return to the supermarket and how much they’re likely to spend. But other competitions take a slightly different bent. Earlier this year, British Royal Astronomical Society, NASA, and the European Space Agency sponsored a competition that sought to build better algorithms for mapping dark matter, that mysterious substance that may account of as much as a quarter of our universe.
Scientists were given slightly blurred images of more than 100,000 galaxies — dark matter distorts space images in bending light that hits it — and they were asked to recreate the shape of these star systems.
That may seem like a rather specialized task, but like so many Kaggle competitions, it’s about the data, not the field of study. David Kirkby — a professor at the University of California, Irvine who ended up winning the competition, together with Daniel Margala, a graduate student at the university — calls the dark matter contest a “general problem.” Kirkby isn’t an astronomer. He’s a particle physicist. “I work at the opposite end of the spectrum: really small microscopic stuff,” he tells Wired. “This was an opportunity to work on a problem involving very big stuff.”
In the earliest days of the competition, it was a glaciologist — someone who studies ice — who turned the study of dark matter on its head. After only a week, Mark O’Leary, a glaciology Ph.D. student at Cambridge, proposed an algorithm that outperformed those commonly used to map dark matter, according to Jason Rhodes, an astrophysicist at NASA’s Jet Propulsion Laboratory. “Chalk another one up for the power of crowd-sourcing,” Rhodes said in a blog post at the time.
Hadoop and other “Big Data” software platforms promise to reinvent the modern business by crunching vast amounts of data. But according to a recent study from McKinsey & Company — Jeremy Howard’s old firm — such platforms are only as powerful as the minds who actually put them to use. “One of the key restraints is having the types of talent — the people — who are able to drive insight from large amounts of data,” McKinsey’s Michael Chui tells Wired. “When we talk to companies that use Big Data analytics, they talk about how difficult it is to find that talent.”
Howard is all too happy to paint Kaggle as a solution to this problem. The site pools data minds that wouldn’t ordinarily come together. “There aren’t too many opportunities that bring together people that have expertise in working with large datasets. We tend to all be pigeonholed into particular research sets,” says David Kirkby. “Kaggle does a good job of cleaning up the problems to the point where, if you understand data, you can really contribute.”
One Laptop Per Genius
The added irony is that Kaggle’s data scientists don’t even use Hadoop. Hadoop is an open source platform that runs across clusters of thousands of servers, but for the most part, Kaggle’s scientists solve their problems using a single machine. Momchil Georgiev uses his home desktop, with help from the SQL Server database and R, the open source data analytics language. Jeremy Howard works much the same way.
In part, this is because Kaggle works to limit the size of the datasets used in its competitions.
But both Georgiev and Howard argue that with even the largest data problems, you don’t need
an entire dataset to find a solution. “As a general rule, if more data is available, you will have a better prediction, but you don’t need the whole data set for this,” Georgiev says. “In fact, what’s been proven with Kaggle is that sometimes the entire dataset is either not necessary or even a hindrance. What’s required is a little bit of imagination and the ability to look into the dataset and deduce what the relationship are between the various data points.”
What’s more, Kaggle is a relatively cheap way to solve your problems. Adams and Carvana put up $10,000 in prize money for their used-car challenge. For the dark matter contest, NASA put up none. It offered an iPad and a free trip to the California Institute of Technology, where the winners could formally present their solutions to NASA. And then there are added perks. “The glaciologist has become quite well know because of this,” says Howard.
Many scientists compete just for fun. “The prizes a relatively small. You’re doing it for the challenge. And the glory,” Kirkby says, with a bit of wink. The competitions also foster a certain camaraderie — “you get a community of people working together. You’re just enjoying learning from each other and what everyone brings from their own background” — but with Kaggle keeping a leaderboard for each competition as contestants submit answers, it also sparks good, old-fashioned rivalry.
“I get that certain feeling when someone takes over on the leaderboard,” says Georgiev. “I’m thinking: ‘What do they know that I don’t?’ And I push harder.”
It is indeed a sport. But in pushing harder, Georgiev adds, scientists can only improve the solution to the problem at hand. Hadoop has its place. But pride isn’t something you’ll find in a server. At least not yet.