Friday, September 21, 2012

A brush with machine learning

by Kim Cobb
For the last two days I have been at the 2012 Climate Informatics workshop at the National Center for Atmospheric Research (affectionately referred to as NCAR). It was a wonderful collection of fellow paleoclimate geeks, including Cobb lab alum Julien Emile-Geay, as well as a mixture of 20+ computer and statistics nerds.

The goal of getting this diverse group of experts together was to probe the intersection of these fields, where novel or not-yet-developed techniques may lead to breakthroughs in climate science. A major motivation for this kind of venture are new federal initiatives aimed at tackling science and technological problems that leverage "Big Data"(see NSF example here). Such initiatives recognize that the explosion of increasingly large and complex datasets presents unique challenges and opportunities. They call for fundamental advances in computational and informational sciences – advances that will enable the use of Big Data to solve some seriously stubborn problems, like those involved in climate change.
View of NCAR, nestled into the foothills of the Rockies.

Al Kellie, the Director of the Computational and Information Systems Laboratory at NCAR, put the Big Data problem this way:  if you were to store the latest collection of climate model simulations run in support of the new IPCC assessment on 32GB iPads, you’d need a 6-mile-high wall of them stretching from Atlanta to Alaska! Our existing bag of data analysis tricks simply won’t work anymore.


So what’s a climate scientist to do? Adapt some whiz-bang tools from the computational and statistical folks to our questions. This basically involves writing algorithms to “teach” a computer how to extract the information you want from a huge archive of data. “Machine learning” is the term, and in this game, speed and accuracy are both of paramount importance. Whole papers have been written concerning a faster or more accurate algorithm for the detection of hand-written numbers – something my 3-yr-old son Isaac can do in less than 1 second (if he so desires). Here I mean no offense to Tony Jebara here, who gave a shockingly lucid presentation on various techniques in machine learning (and bought me lunch!). His talk, along with the others, can be viewed here.

One example from climate/comp sci hybrid Amy McGovern concerns the detection of tornadoes in an insanely high-resolution model (500m x 500m) of the atmosphere over Oklahoma. By amassing some rules and relationships about hurricanes from her dozens of simulations of historical data (her “training set”), she can build a hierarchical decision “tree” (yes, this is a technical term) that the computer will move through in order to assess the risk of a future tornado given real-time atmospheric data inputs. Her computer has “learned” how to predict tornados with Big Data. Thankfully for Amy, Oklahoma has a wonderfully rich network of meteorological observing stations that inform her super-high-resolution tornado model. Lucky her. She probably has more data in that Oklahoma network that the entire NOAA NCDC paleoclimate repository that covers the whole globe through all time!

In my talk, I tried to convince the computational and statistics crowd that helping us out with our paleoclimate problems i) would provide much-needed constraints on key climate change uncertainties, and ii) would benefit enormously from the types of Big Data techniques that they are developing. You might not think of paleoclimate as a Big Data problem, and you’d be right. We still have relatively few reconstructions of past climate. But, the latest IPCC model data archive contains upwards of 40 simulations of past climate variability, for the first time ever! The challenge for us is to develop efficient, objective, and well-suited analysis tools to perform paleodata-model data intercomparisons on the road to better estimates of climate change uncertainties.  

Perhaps the most thought-provoking talk was by Steve Easterbrook, who looked at the evolution of several IPCC climate models over the last decades as growing lines of computer code, as well as the apportionment of code to each component of the model (i.e. atmosphere, ocean, land, etc). He stressed that every model cannot be everything to everybody, advocating a model he called “Fitness for Purpose”. Some models were developed for weather prediction, while others were developed for long paleoclimate simulations. Some are easier to update and improve, while some are modified only rarely. Model center people management structures provide another unique model descriptor, which likely projects onto the model characteristics themselves. A neat meta-analyses of models, a rare glimpse at the models themselves, rather than their collective output plotted to 2100.

The NCAR supercomputer, "Bluefire". Wow.
I left feeling hopeful and inspired that there may be a path “from data to wisdom” (Ackoff et al., 1989, as quoted by Steve in his talk), as we risk drowning in bytes. We will clearly need students who are trained in climate science as well as computer science - a tall order. My graduate students struggle to learn Matlab their first year, but I hope they recognize that it's for their own good!