Friday, September 21, 2012

A brush with machine learning

by Kim Cobb
For the last two days I have been at the 2012 Climate Informatics workshop at the National Center for Atmospheric Research (affectionately referred to as NCAR). It was a wonderful collection of fellow paleoclimate geeks, including Cobb lab alum Julien Emile-Geay, as well as a mixture of 20+ computer and statistics nerds.

The goal of getting this diverse group of experts together was to probe the intersection of these fields, where novel or not-yet-developed techniques may lead to breakthroughs in climate science. A major motivation for this kind of venture are new federal initiatives aimed at tackling science and technological problems that leverage "Big Data"(see NSF example here). Such initiatives recognize that the explosion of increasingly large and complex datasets presents unique challenges and opportunities. They call for fundamental advances in computational and informational sciences – advances that will enable the use of Big Data to solve some seriously stubborn problems, like those involved in climate change.
View of NCAR, nestled into the foothills of the Rockies.

Al Kellie, the Director of the Computational and Information Systems Laboratory at NCAR, put the Big Data problem this way:  if you were to store the latest collection of climate model simulations run in support of the new IPCC assessment on 32GB iPads, you’d need a 6-mile-high wall of them stretching from Atlanta to Alaska! Our existing bag of data analysis tricks simply won’t work anymore.

So what’s a climate scientist to do? Adapt some whiz-bang tools from the computational and statistical folks to our questions. This basically involves writing algorithms to “teach” a computer how to extract the information you want from a huge archive of data. “Machine learning” is the term, and in this game, speed and accuracy are both of paramount importance. Whole papers have been written concerning a faster or more accurate algorithm for the detection of hand-written numbers – something my 3-yr-old son Isaac can do in less than 1 second (if he so desires). Here I mean no offense to Tony Jebara here, who gave a shockingly lucid presentation on various techniques in machine learning (and bought me lunch!). His talk, along with the others, can be viewed here.

One example from climate/comp sci hybrid Amy McGovern concerns the detection of tornadoes in an insanely high-resolution model (500m x 500m) of the atmosphere over Oklahoma. By amassing some rules and relationships about hurricanes from her dozens of simulations of historical data (her “training set”), she can build a hierarchical decision “tree” (yes, this is a technical term) that the computer will move through in order to assess the risk of a future tornado given real-time atmospheric data inputs. Her computer has “learned” how to predict tornados with Big Data. Thankfully for Amy, Oklahoma has a wonderfully rich network of meteorological observing stations that inform her super-high-resolution tornado model. Lucky her. She probably has more data in that Oklahoma network that the entire NOAA NCDC paleoclimate repository that covers the whole globe through all time!

In my talk, I tried to convince the computational and statistics crowd that helping us out with our paleoclimate problems i) would provide much-needed constraints on key climate change uncertainties, and ii) would benefit enormously from the types of Big Data techniques that they are developing. You might not think of paleoclimate as a Big Data problem, and you’d be right. We still have relatively few reconstructions of past climate. But, the latest IPCC model data archive contains upwards of 40 simulations of past climate variability, for the first time ever! The challenge for us is to develop efficient, objective, and well-suited analysis tools to perform paleodata-model data intercomparisons on the road to better estimates of climate change uncertainties.  

Perhaps the most thought-provoking talk was by Steve Easterbrook, who looked at the evolution of several IPCC climate models over the last decades as growing lines of computer code, as well as the apportionment of code to each component of the model (i.e. atmosphere, ocean, land, etc). He stressed that every model cannot be everything to everybody, advocating a model he called “Fitness for Purpose”. Some models were developed for weather prediction, while others were developed for long paleoclimate simulations. Some are easier to update and improve, while some are modified only rarely. Model center people management structures provide another unique model descriptor, which likely projects onto the model characteristics themselves. A neat meta-analyses of models, a rare glimpse at the models themselves, rather than their collective output plotted to 2100.

The NCAR supercomputer, "Bluefire". Wow.
I left feeling hopeful and inspired that there may be a path “from data to wisdom” (Ackoff et al., 1989, as quoted by Steve in his talk), as we risk drowning in bytes. We will clearly need students who are trained in climate science as well as computer science - a tall order. My graduate students struggle to learn Matlab their first year, but I hope they recognize that it's for their own good!


  1. Hi Kim: This was very interesting! Did the idea of top-down modelling ever come up in this session? Ususally, when we build a model, we want to gain insight by building it "bottom-up": we populate the model with whatever data we have, interpolate between our data points with a great deal of detail, iteratively tweek the model to history match the data. Only after all this effort, can we query the model to see what may happen in the future under varying conditions. The problem with this approach is that it requires huge amount of CPU to run, introduces a lot of false data into the model, and takes a long time to run. You are also limited by the biases that you introduced into your single model during the interpolation. The history matching rarely yields a unique solution but will be limited by the biases we introduced. This can lead to a false sense of security (we can confuse precision with accuracy)and to conflict with people who have a different set of biases. The top-down approach is the opposite. Instead of building a single complex model, you build many simple models, introducing multiple interpretations of the data (ie knowingly and explicitily introducing bias!), and then run them multiple times to attempt history matching. The runs a very fast and can be tweeked many many times to converge on a match. Outlier biases can be eliminated and through multiple realisations, you converge on a set of assumptions which tend to work most of the time. Those models which work can be easily updated and maintained as new data arrives. It's a cool idea. We use this all the time on complex fields with many business partners, who often have very different concepts of a field's geology. Here is a abstarct describing it:

  2. Hi Caro,
    Great to hear from you! I think the approach you describe is akin to the suite of models we call "models of intermediate complexity". These models, unlike the fully coupled high-resolution IPCC class models, usually make a variety of simplifications to the physics of the ocean-atmosphere-land system. What you might lose in detail you gain in speed - they run much faster, enabling the generation of many ensembles. Tweaks can more easily be made to these models, although in reality that is rarely done. More often than not, the models will be used to run in "paleoclimate" mode for many millennia, or consider a range of different climate forcings and their uncertainties, or to simply amass a large collection of simulations to get a handle on the signal:noise ratio in the model. I wish I could read the full abstract that you sent the link for, but it's behind a pay wall. If you have a hard copy, could you send it along? Thanks!