Whenever climate scientists want to analyze data, they need to request it in its messy original format, clean it up and analyzeit. That sort of work takes up valuable time, and so it makes sense that the federal government has started funding efforts to simplify the process.
Speaking at 2013 Hadoop Summit in San Jose on Wednesday, NASA software developer Glenn Tamkin (pictured) explained how he and one of his colleagues have been cooking up a 34-node Hadoop cluster for NASA’s Center for Climate Simulation that can analyze slices of the data in response to end users’ queries. The new architecture could be handy in seeing how the data stacks up in comparison with other data sets used in the U.S. and in other countries.
Tamkin’s team has an 80 TB data set on its hands concerning all kinds of information about climate and atmosphere: winds, clouds, humidity, air and water temperature and so on for the past three decades. The data includes observational information mostly collected from satellites as well as simulation data for filling in gaps. But it’s not continually streaming in; rather, it gets fed in every once a year, Tamkin said. The data is already publicly available.
The developers have brought this data into the Hadoop Distributed File System and rely on all the scaled-out nodes to quickly compute sums, counts, averages, standard deviation and other measurements in MapReduce.
While the MapReduce jobs don’t run as fast as he would like — it took two minutes to answer one query Tamkin’s ran recently — the new Hadoop setup sounds like it would be a lot less trouble for scientists looking for basic information across many years.
NASA is now employing the Cloudera Distribution for Hadoop for this work, although Tamkin said he’s not using every part of it; he would like to tack on more components around managing the cluster to try to further optimize the system, he said. He also wants to develop a method for caching queries, so they can run faster.
The project will end up serving data out of Hadoop through an API to scientists across government agencies and private organizations later this year, Tamkin said. And like the data itself, the API will also become available to the general public, perhaps as soon as February 2014, Tamkin said.