Climate code archiving: an open and shut case?

26 Oct 2010 by eric

Gavin Schmidt and Eric Steig

The last couple of weeks saw a number of interesting articles about archiving code – particularly for climate science applications. The Zeeya Merali news piece in Nature set the stage, and the commentary from Nick Barnes (of ClearClimateCode fame), proposed an ‘everything and the kitchen sink’ approach. Responses from Anthony Vejes and Stoat also made useful points concerning the need for better documentation and proper archiving. However, while everyone is in favor of openness, transparency, motherhood and apple pie, there are some serious issues that need consideration before the open code revolution is going to really get going.

It would help to start by being clear about what is meant by ‘code’. Punditry about the need for release of ‘all supporting data, codes and programmes’ is not very helpful because it wraps very simple things, like a few lines of Matlab script used to do simple linear regressions, along with very complex things, like climate model code, which is far more sophisticated. The issues involved in each are quite different, for reasons both scientific and professional, as well as organizational.

First, the practical scientific issues. Consider, for example, the production of key observational climate data sets. While replicability is a vital component of the enterprise, this is not the same thing as simply repetition. It is independent replication that counts far more towards acceptance of a result than merely demonstrating that given the same assumptions, the same input, and the same code, somebody can get the same result. It is far better to have two independent ice core isotope records from Summit in Greenland than it is to see the code used in the mass spectrometer in one of them. Similarly, it is better to have two (or three or four) independent analyses of the surface temperature station data showing essentially the same global trends than it is to see the code for one of them. Better that an ocean sediment core corroborates a cave record than looking at the code that produced the age model. Our point is not that the code is not useful, but that this level of replication is not particularly relevant to the observational sciences. In general, it is the observations themselves – not the particular manner in which they are processed – that is the source of the greatest uncertainty. Given that fundamental outlook, arguments for completely open code are not going to be seen as priorities in this area.

By contrast, when it comes to developers of climate models, the code is the number one issue, and debugging, testing and applying it to interesting problems is what they spend all their time on. Yet even there, it is very rare that the code itself (many of which have been freely available for some time) is an issue for replication — it is much more important whether multiple independent models show the same result (and even then, you still don’t know for sure that it necessarily applies to the real world).

The second set of issues are professional. Different scientists, and different sciences, have very different paths to career success. Mathematicians progress through providing step by step, line by line documentation of every proof. But data-gathering paleo-climatologists thrive based on their skill in finding interesting locations for records and applying careful, highly technical analyses to the samples. In neither case is ‘code’ a particularly important piece of their science.

However, there are many scientists who work on analysis or synthesis that make heavy use of increasingly complex code, applied to increasingly complex data, and this is (rightly) where most of the ‘action’ has been in the open code debate so far. But this is where the conflicts between scientific productivity at the individual level and at the community level are most stark. Much of the raw input data for climate analysis is freely available (reanalysis output, GCM output, paleo-records, weather stations, ocean records, satellite retrievals etc), and so the skill of the analyst is related to how they choose to analyse that data and the conclusions they are able to draw. Very often, novel methodologies applied to one set of data to gain insight can be applied to others as well. And so an individual scientist with such a methodology might understandably feel that providing all the details to make duplication of their type of analysis ‘too simple’ (that is, providing the code rather carefully describing the mathematical algorithm) will undercut their own ability to get future funding to do similar work. There are certainly no shortage of people happy to use someone else’s ideas to analyse data or model output (and in truth, there is no shortage of analyses that need to be done). But to assume there is no perception of conflict between open code and what may be thought necessary for career success – and the advancement of science that benefits from a bit a competition for ideas — would be naïve.

The process of making code available is clearly made easier if it is established at the start of a project that any code developed will be open source, but taking an existing non-trivial code base and turning into open source is not simple, even if all participants are willing. In a recent climate model source code discussion for instance, lawyers for the various institutions involved were very concerned that code that had been historically incorporated into the project might have come from outside parties who would assert copyright infringement related to their bits of code if it were now to be freely redistributed (which is what the developers wanted). Given that a climate model project might have been in existence 30 years or more, and involved hundreds of scientists and programmers, from government, universities and the private sector, even sorting out who would need to be asked was unclear. And that didn’t even get into what happens if some code that was innocently used for a standard mathematical function (say a matrix inversion) came from a commercial copyrighted source (see here for why that’s a problem).

Yet the need for more code archiving is clear. Analyses of the AR4 climate models done by hundreds of scientists not affiliated with the climate model groups are almost impossible to replicate on a routine and scalable basis by the groups developing the next generation of models, and so improvements in those metrics will not be priorities. When it comes to AR5 (for which model simulations are currently underway), archiving of code will certainly make replication of the analyses across all the models, and all the model configurations much less hit or miss. Yet recently, it was only recommended, not mandated, that the code be archived, and no mechanisms (AFAIK) have been set up yet to make even that easy. In these cases, it makes far more sense to argue for better code archiving on the basis of operational need, than it does on the basis of science replication.

This brings us to the third, and most important issue, which is organizational. The currently emerging system of archiving by ‘paper’ does not serve the operational needs of ongoing research very well at all (and see here for related problems in other fields). Most papers for which code is archived demonstrate the application of a particular method (or methods) to a particular data set. This can be broken down into generic code that applies the method (the function), and paper-specific code that applies that method to the data set at hand (the application). Many papers use a similar method but in varied applications, and with the current system of archiving by ‘paper’, the code that gets archived conflates the two aspects, making it harder than necessary to disentangle the functionality when it is needed in a new application. This leads to the archiving of multiple versions of essentially the same functional code causing unnecessary confusion and poor version control.

It would be much better if there existed a stable master archive of code, organised ‘by function’ (not ‘by paper’), that was referenced by specific applications in individual papers. Any new method would first be uploaded to the master archive, and then only the meta script for the application referencing the specific code version used would need to be archived with an individual paper. It would then be much easier to build on a previous set of studies, it would be clear where further development (either by the original authors or others) could be archived, and it would be easy to test whether the results of older papers were robust to methodological improvements. Forward citation (keeping track of links to papers that used any particular function) could be used to gauge impact and apportion necessary credit.

One could envision this system being used profitably for climate model/reanalysis output analysis, paleo-reconstructions, model-data comparisons, surface station analyses, and even for age-model construction for paleo-climate records, but of course this is not specific to climate science. Perhaps Nick Barnes’ Open Climate Code project has this in mind, in which case, good luck to them. Either way, the time is clearly ripe for a meta-project for code archiving by function.

200 Responses to "Climate code archiving: an open and shut case?"

The Ville says

26 Oct 2010 at 6:54 AM

Good point about the independent results thing being more important.
A lot of these people wanting code to check what groups of scientists have done, should be writing their own code and producing results to check against the results produced by the code they want to see.

Of course one could assume that they aren’t interested in the results or producing any, which might explain why they don’t produce their own models and want to see the code.

If they can do better, all will benefit. But will they publish it as an open source model?
Mitch Lyle says

26 Oct 2010 at 7:14 AM

The whole business of code archival is actually a form of publication that used to be referred to as a methods paper. Ironically, even as the internet supposedly opened up vast new amounts of information space, it has been getting harder and harder to actually publish methods in detail. The general reaction in the last decade that methods are boring and should be minimized in papers. Why can’t we be a bit more creative with publication formats so that detailed methods can be included?

[Response: I totally agree. The issue is all about the publication formats. There is a really interesting article on some of the pitfalls of publishing ‘everything’ in the Journal of Neuroscience, which we linked to above (here)–eric]

.
David Jones says

26 Oct 2010 at 7:40 AM

Thank you for the links and this excellent discussion of the science behind source code release. In terms of science I think we are in broad agreement: it is difficult, certainly in the current publishing and promotion regime, to use science as a justification for releasing your code. Science will be better the more code you release, but it’s not an easy argument to make.

However, the real reason we started Climate Code Foundation is public perception. Science is being used to justify public policy, and the public’s perception of that justification (and hence, the policy) is clouded by not having access to source code. Public policy will be advanced by releasing source code.

[Response: David, thanks for commenting. I think we should be clear that the public perception is being clouded by what pundits are saying about the importance of access to source code, not the actual alleged lack of it’s availability. But don’t get me wrong — your work is important and will indeed serve science.-eric]
Damien says

26 Oct 2010 at 7:55 AM

As a software engineer writing math-ish code in an unrelated field, I can say from the outset that a clear description of the mathematics is far, far, far more important than the implementation itself.

Consider a case where you’re in a code base trying to work out what’s done. Even something like reconstructing the symbolic version of a large-ish Jacobian from some badly written, cryptic code is the stuff of nightmares. I’d rather get the equations and re-write the code.

Only having one or the other (code or description) to me is a false dichotomy. Time and time again, I come across papers that a slightly incomplete descriptions of an algorithm (for a trivial example, something has to be implemented with numerical integration… what method did they use? ). Having the code makes these sort of assumptions clearer and sometimes fills in the gaps missing in a paper.

[Response: Very well stated points. Not to mention typos in the mathematics, that can lead to endless frustration until you figure them out! I recall a glaciology paper by a mentor of mine with a missing ^2 that took me forever to find. Of course, if I had had the code, I would have just used that, and never learned as much as I had to when working it out myself, and then I never would have known the equations were wrong, and perhaps would have propagated the error in future publications. None of these solutions are without pitfalls!–eric]
vboring says

26 Oct 2010 at 9:30 AM

The point here is that in some fields, the code is the method. Not making code available feels a lot like when the cold fusion folks chose to not reveal the details of their methods. If several labs had independently claimed to have made cold fusion work, would that have meant that revealing the methods was unnecessary or that their claims were more believable?

For the claims that are based on code to be tested, the code itself has to be checked. Something as simple as a minor typo can lead to a problem that creates a false result even if the basic analysis is correct. Checking through back-casting can’t always find these problems because future conditions put in the model are not the same as past conditions used for checking.

I’m sure that IP problems are real, but for code-based scientific papers to be independently verified, the code itself has to be open to review. Maybe some kind of limited review could be implemented until the IP problems are resolved. Until something is worked out, we’re just repeating the cold fusion debacle, with skeptics trying to guess at what the code does and talking about how it is the wrong way to do things.
sambo says

26 Oct 2010 at 9:34 AM

From the previous comments (and my own experience as well), it seems the reason having code open along with the publications is to clarify the assumptions and the methods that lead to the results. While I could replicate the rusults of a paper with a cryptic description of the method and the results, it would take a lot longer for me to do so (and I’d assume everyone would have this experience). You do raise some important objections to having everything open source immediately (especially for projects that weren’t started open source). While on the whole I think open source is the way to go, as you say, there are pitfalls.

Thanks for a very good and informative discussion.
Bryan Lawrence says

26 Oct 2010 at 9:35 AM

Thanks for saving me from having to write this. Seriously, I’ve had had a few hundred words lying around for a few weeks now, waiting for a wee bit of polish. But all it said, in essence, you’ve said, although I would have spent a bit more time on the difference between replication and repetition (probably without adding much signal).

So, just asking for “code publication” is far from enough, like “data publication”, communities are going to have to work out “what is the publication unit”, and “how does one get credit for publishing (code)”?
(Here I’m not talking about “putting code snippets on the internet”, that can happen or not, and wont make much difference to much, I’m talking about the publication of significant pieces of code, such asthe piece that the climate code foundation started with.)
adrian smits says

26 Oct 2010 at 9:40 AM

This isn’t rocket science folks! Think about Linux and and their willingness to work together to create something really special! If climate science is as important as we say it is, there should be no hesitation to stand on each others work and build something that our children can be proud of rather than doubting us because of secrecy and the need for personal aggrandizement!

[Response: Huh? Did you read what we wrote, or are you just responding to what you imagine we wrote?–eric]
Bob (Sphaerica) says

26 Oct 2010 at 9:41 AM

The starting point in any computer project is always a very, very clear statement of the real world goals. Without that, people always seem to wander off in some direction or other, usually getting distracted by very intense technical details, feverishly solving complex or laborious problems, only to find out that no one really needed 90% of what was done and vast amounts of effort were spent on ultimately irrelevant details.

And stating the real world goals is not always as obvious as it seems.

So… what is the actual real world goal of this?

There is a vague, implied purpose of improving the science… but that’s not specific enough. How? It’s already been stated that part of the scientific process is to reproduce individual efforts separately, without tainting the reproduction by using the exact same original (and potentially flawed) methods. This is a sensible approach. Sharing code will defeat this. It will be too easy for people to say “I don’t the have time (or a grad student to spare), so I’ll just re-use his code.”

I will say that it seems to me as if the primary goal is to make the software available so that self styled amateur auditors can comb through it looking for mistakes. If this is in fact the case, then publishing the software will not improve the science. It could, in theory, help to identify errors that might otherwise go unnoticed, but experience to date has shown that it will instead be used by the unprincipled to mount an attack on the science by trying to discredit valid work by exaggerating the impact of minor problems.

I’m not saying that makes it wrong to share the code, or that it’s a reason to “hide” the code. I’m just saying go in with your eyes open, and a clear, positive goal in mind.

Alternately, a specific scenario like the sharing of model components (such as is described for AR5) is a more narrow and specific focus, and is liable to have very, very different requirements and implementation techniques than a problem defined with a broader scope. It’s a very different problem with a different solution.

The goal is everything.

I am not saying I’m against this at all. I’m a big fan of open source with everything. But I do think everyone needs to go in with their eyes wide open, and with very, very clear real world goals established and stated at the outset. That one step has a tremendous impact on what you actually do, how you do it, and how you address the pitfalls you encounter along the way to a solution.

If you don’t know where you are going, you might wind up someplace else! — Yogi Berra
sambo says

26 Oct 2010 at 9:46 AM

I read the linked announcement (J of Neuroscience re Supplemental Material). It seemed to me that the main critisism was that too often the supplemental material tempted researchers to put key components of the methods into the supplemental material when it should be in the article. Is this what everyone else understood?

It would seem that ensuring the methods are clearer in the article would be a good thing instead of burying it in the code.
Bob (Sphaerica) says

26 Oct 2010 at 10:09 AM

8 (adrian smits),

This isn’t rocket science folks! Think about Linux and and their willingness to work together to create something really special…

This demonstrates a clear lack of appreciation of the problem. It’s equivalent to saying that it’s easy to drive from New York to San Fransisco, so it should be easy to drive from San Fransisco to Tokyo.

The two scenarios are wildly different. In fact, the sharing of code in science is very far removed from any other scenario of sharing computer code.

To give one example, Linux involved a single base of code (with offshoots that included commercial ventures), with one specific purpose, that of offering very limited and clearly pre-defined functions (as defined by myriad operating systems, and the definition of an operating system, and the industry standard for a worthy operating system, and the commercially available competitors such as Windows and Mac OS). A huge system developed around this which included project plans, teams, assignments, documentation methods, and so on. All of this involved people who were focused on Linux (or any other open source project) as the primary product of the effort.

We are instead talking here about sharing thousands and thousands of individual run-once programs, written in a variety of computer languages, by people who are generating the code in support of other goals and as just one small part of their projects. The code is supplementary, and not even secondary in most cases… it’s much further down in the pecking order.

No one in the world should be sitting back and saying “heck, it’s easy, what’s taking so long?”
Paul Wowk says

26 Oct 2010 at 10:10 AM

I actually work pretty heavily in software and have a interest in documenting code. I recently posted in my blog about that. (www.cleancodeco.com/blog). I have even done small scientific models on my own.

My thoughts would be that the software needs to be documented in a step by step mentality. I think it should fall back to the theory of a scientific lab report. In other words, another scientist can duplicate the results of the experiment/analysis. If the documentation is not specific enough, another scientist would not be able to duplicate the results.

Spending a lot of time looking at other people’s source code, I have learned a very hard fact. Reading and understanding someone else’s software/code is extremely difficult and time consuming. My experience is it takes about an hour to understand 50 lines of code. With some climate models getting into the thousands or millions (?) of lines of code, it may be a lot of hours to understand every line of code.

All climate models/analysis code follows one model. It has input, goes into a black box (the analysis code), and then produces output. I believe the black box code needs to be explained at a high level to go from input to output. The assumptions used should be in there. It should be expected to take as long as writing the software itself.

A document explaining what happened would be much more valuable than open sourced code.
Bob (Sphaerica) says

26 Oct 2010 at 10:28 AM

In 5, vboring said:

For the claims that are based on code to be tested, the code itself has to be checked. Something as simple as a minor typo can lead to a problem that creates a false result even if the basic analysis is correct. Checking through back-casting can’t always find these problems because future conditions put in the model are not the same as past conditions used for checking.

This sort of approach is exactly what I’m afraid of.

First of all, for example, a main part of the validation of the GCM’s comes from how closely their short term predictions match the real world, and how complex, detailed modeling of micro-behaviors result in a visible, realistic macro-behavior. It doesn’t require that the code be read.

Another part of the validation comes from the fact that multiple teams, working in isolation, create different models which independently come to similar conclusions. [They’re never going to share the same missing semi-colon.] Sharing code will actually begin to defeat this sort of validation, not improve it.

One is not going to validate such models by going through the code, line by line, looking for a misplaced semi-colon, especially as the models get more and more complex.

But people will try. And those who are politically or financially motivated to do so will get traction by making irrefutable claims of supposed errors (irrefutable, because to most people computer code looks like magic spells, following Arthur C. Clarke’s maxim that “any sufficiently advanced technology is indistinguishable from magic”). Look at what people did with a tiny snippet of “Harry’s” code in the UEA hack?

…the code itself has to be checked…

No. The results need to be checked, not the code. No one ever, ever checks the code*. You check the results, and revise the code when invalid results are identified. [* The one exception is a project where code needs to be ported. In that case, the code is often read and translated, and in the process old bugs may be recognized by an insightful coder. But this is a side effect of the process, not an intended goal.]

Something as simple as a minor typo can lead to a problem that creates a false result…

This is a gross oversimplification of the realities involved in validating computer code.

The reality is that industry expends a huge amount of effort performing QA on all software it creates for commercial purposes (or rather, they should, although foolish companies often run out of funds and short change that end of the project). It is never, ever, ever done by paying someone to go through the code, line by line, looking for mistakes. It might be done, in the scenario presented, if a lot of people downloaded the code, ran it in a variety of applications, and used very clever methods to look for inconsistencies or unexpected results in the output. But the complexity of that sort of effort (if the stated goal is to find mistakes in the code) should not be downplayed.

This particular problem (i.e. validating code written for scientific endeavors) is no where near as easy as is implied, or likely to be a major result of any effort to share scientific code.
John P. Reisman (OSS Foundation) says

26 Oct 2010 at 10:35 AM

I very much like the point on independent replication. In one one of Scott Mandia’s recent articles

http://profmandia.wordpress.com/2010/10/15/shooting-the-messenger-with-blanks/

He made an acute point. And though he was referring to the Hockey Stick it applies here as well

Consider the odds that various international scientists using quite different data and quite different data analysis techniques can all be wrong in the same way. What are the odds that a hockey stick is always the shape of the wrong answer?

It is through independent replication, especially when from multiple lines of evidence, that the implied, possible, potential and hypothetical transition into the likely, probable, or robust confidence that results from varied analysis.

Corroborating results form so many sources, from so many different lines of evidence, form so many different source code constructs. . . what are the odds that they are all wrong AND that they largely agree in the results?

I find it much more interesting that different models, form different scientists, that were not produced from open code, showed the same general conclusions.

The funding issue is in form the same as the stove-piping issue that allowed increased risk prior to 9/11. Neither, in that case, has that problem been effectively solved. Without getting into the political philosophy argument, I do think that competition fosters innovation and that should not be ignored.

It’s ironic that most denialists would likely argue that competition is good, would miss the fallacy of their argument when applied to things they prefer to whine about. Between rights and the what is best argument, it remains complex as well illustrated in the above article.

An open source code is a wonderful thing and presents it’s own challenges. It is now proven (though their are still denialists in the form of commercial companies that for some reason don’t like open source, and have a tendency to spread disinformation about how good it is… hmmm… this pattern sounds vaguely familiar???).

Discussion Balancing Economies (commenting fixed)
October Leading Edge: The Cuccinelli ‘Witch Hunt”
—
Fee & Dividend: Our best chance – Learn the Issue – Sign the Petition
A Climate Minute: Natural Cycle – Greenhouse Effect – Climate Science History – Arctic Ice Melt
The Ville says

26 Oct 2010 at 11:18 AM

vboring: “Not making code available feels a lot like when the cold fusion folks chose to not reveal the details of their methods. If several labs had independently claimed to have made cold fusion work, would that have meant that revealing the methods was unnecessary or that their claims were more believable?”

You can buy books from your local bookstore which will tell you the methods used in all climate models starting from the 1950s to the present day!
If you doubt the model, buy a book about climate models, learn the science and make your own model to see if you get the same results.

The more the merrier!
David Jones says

26 Oct 2010 at 11:23 AM

@Bob (Sphaerica): you say “It is never, ever, ever done by paying someone to go through the code, line by line, looking for mistakes”. I and my colleages have been paid to do just that. Entire books have been written about Software Inspection, and Tom Gilb’s is one of the best. We have worked on projects that produced very reliable code (millions of hours of running between failures).
Ryan Cunningham says

26 Oct 2010 at 11:29 AM

Bioinformatics researchers like me face a lot of the same issues about transparency in an interdisciplinary field. As a computer scientist, I think you should be a little less cavalier about the importance of sharing our work. Independent measurements are important, but independent assessments of the analysis is equally important. Especially when most of this work ends up buried in the supplement.

Code can have mistakes, sometimes very subtle mistakes. Having the code released in a readable way provides the opportunity for another level of peer review. Having one open source analysis tool developed by 10 researchers is much better than ten independent perl scripts thrown together on the fly. When we can check each other’s work, the quality is a lot better.

I strongly disagree with this statement “it is very rare that the code itself (many of which have been freely available for some time) is an issue for replication”. Sometimes, looking at the source code is the only way to determine what the methods were. Sure, having two models based on independent measurements agree on predictions is nice. But there also might be systematic biases in the approaches taken. For example, picking apart models for predicting gene regulation helped determine that a significant number of independent approaches were grossly over-fitting some parameters.

Of course it’s not trivial, but pushing for transparency and coming up with better practices for handling source code is critical.
vboring says

26 Oct 2010 at 11:37 AM

@The Ville:

“If you doubt the model, buy a book about climate models, learn the science and make your own model to see if you get the same results.”

Ok. Let’s say, I just made a model and it says the exact opposite of what the existing models say about the future, but does back-casting just as accurately as they do. I can state my assumptions and general method, but won’t show the details of my work.

Let’s say ten other working groups then also produce similar results.

Then what happens?

Everyone can attack funding sources, intentions, and other irrelevant factors, but nobody can check anyone else’s actual methods or the details of their assumptions, so no science-based debate can actually result.

This black box method is plainly anti-scientific.

[Response: This is a strawman. In any paper you would need to demonstrate why your results were different – a different assumption, input or something, and that would have to be well enough described to allow replication – whether there was code or not. How useful is it to archive 100,000 lines of code without pointing out what the key change was? – gavin]
caerbannog says

26 Oct 2010 at 11:40 AM

Of course, another big advantage of the “open source” approach is that it allows all those global-warming skeptics out there to help improve the science by scrutinizing the model code, submitting bug-reports and patches, assisting with q/a, unit-testing, documentation, etc…

(**ducks and runs….**)
Mitch Golden says

26 Oct 2010 at 11:42 AM

Regarding the last two paragraphs: Climate science would do well to learn from the Free/Open Source Software (FOSS) community’s approach to software development. The problem of organizing code for distributed teams of workers has largely been solved, especially through the use of modern distributed code repository tools, such as git.

If a group of researchers are working on a paper, they could simply establish a code repository on a place such as github. The authors collectively work against that repository, and when they are ready to publish a paper, they simply tag the final “commit” that was used for the paper. The paper would identify the tag that was used.

Subsequent development proceeds from there, and there is no need to re-archive everything for each further paper that might be published – just make changes and when the next paper is published, tag that subsequent commit.

If it is desired to have several repositories for separate toolsets (for example, one for some analysis of some set of observational data, and another for a GCM), this is also possible. In fact, many FOSS projects proceed in just this way.

There are also structures to allow various outsiders to participate. Different people can be given different levels of access to the repository, such as read-only or the right to make commits. If a climate team wanted to develop a true Free Software GCM, they could make the code available to the public, license it under the GPL, and allow members of the public to submit patches (to the project owners, generally not directly to the repository).
Bob (Sphaerica) says

26 Oct 2010 at 11:48 AM

16 (David Jones),

I and my colleages have been paid to do just that.

It’s done… about 1/1,000,000,000,000th of the time. And even then, you select segments of code to review, not the entire thing.

Realistically, the effort involved is untenable.

What would you estimate is the percent of code written world wide that undergoes actual code review?

What sort of projects have that sort of money to spend on the effort?

What would you estimate is the percent of the code that was reviewed on the projects on which you performed software inspection?
Kate says

26 Oct 2010 at 11:50 AM

I really enjoyed this post. It shows that there is a lot more to open source science than is implied by those who angrily proclaim, \Release everything, even the emails you wrote to your colleagues about it!\

I recently had the pleasure of interviewing Dr. Ben Santer, which touched on many of these topics about reproducibility and data sharing. You can read my article about it here. Lots of angry comments that necessitated deletion, though – why are deniers so hostile to this guy?? I didn’t think anybody actually took Frederick Seitz seriously, but apparently I was wrong…
Mitch Golden says

26 Oct 2010 at 11:51 AM

One further point:

I am fairly certain that the developers use some form of code management already. However, the older systems such as CVS or subversion do not offer anywhere near the level of sophistication of git, and are nowhere near as good at the kind of distributed development done in a FOSS project. And, more importantly here, they don’t offer an automatic way to make the codebase public, as publishing on github does.

Please feel free to contact me at the embedded e-mail address if you’d like help setting it up!
Didactylos says

26 Oct 2010 at 12:13 PM

Bob (Sphaerica):

I do of course agree with your major thesis. However, you seem to go a little far in claiming that code isn’t checked, and doesn’t need to be checked. All code benefits from careful debugging, and line-by-line code review is a traditional part of software development. In particular, code review may identify degenerate cases that need special testing.

There is nothing wrong at all with finding and correcting bugs no matter how small. I think the point you were trying to make is that the effect of any such bug must be negligible, since the code produces correct (“realistic”) results on a larger scale. I think that point got a little lost.

The question that I keep asking is why, after all this time, have the self-appointed climate auditors not found dozens of minor bugs in the vast oceans of already available climate code? The bugs are there, they just aren’t looking hard enough. Some of the bugs may even have important implications for some fine-scale processes or little-understood emergent phenomena that are being studied using a particular model.
Hank Roberts says

26 Oct 2010 at 12:50 PM

David Jones says: (millions of hours of running between failures)
How long between failures? One million hours is more than a century.
vboring says

26 Oct 2010 at 12:51 PM

@Gavin @18

The scientifically irrefutable contrary black box is only a straw man if it can’t be done by making small changes to existing models that are within the generally accepted estimates. Considering the number and size of uncertainties, my assumption is that this is possible.

And whether the contrary model is a straw man or not doesn’t change the primary point, which is that the details of the method are only revealed in the code. The cold fusion folks claimed that tiny changes in the method would prevent the experiment from working (how the platinum had been stored, etc), yet refused to reveal the exacting specifications for their methods. Maybe 10 working groups used ten bad fudge factors in order for their model to properly back-cast. Until the details of the method are revealed, skeptics are free to make any kind of wild unsupportable claims they want. And scientifically minded folks get to simply hope that uninspected work was done perfectly.

100,000 lines of code sounds like a lot to archive. If each character is a byte and each line has 50 characters, that is 5 million bytes. Where will anyone find the space to archive 5 megabytes of uncompressed text? IP rights may be an issue, documentation may be an issue, archival space is obviously not. Speaking of straw men…

[Response: Huh? The issue is not the size of the archive but the time needed to go through some one else’s code. If you are given 100,000 lines of code and told to find the key assumptions, you’ll be there a year (or more). Complex code absent understanding and some documentation is basically useless. Any science paper needs to have enough information so that others can replicate the results – your ‘black box’ paper simply does not (and cannot) exist. What cold fusion hucksters announce at a press conference is as far from a scientific paper as you can get. There have clearly been cases where replication hit roadblocks that would have been dealt with given code archiving – the MSU afffair for instance, or more recently Scafetta. But it is also true that the independent replication attempts in both cases (which might not have happened had the original code been available) were useful in and of themselves. There are no ‘one size fits all’ rules here. – gavin]
JBL says

26 Oct 2010 at 12:53 PM

vboring wrote: “If several labs had independently claimed to have made cold fusion work, would that have meant that revealing the methods was unnecessary or that their claims were more believable?”

This is a very strange hypothetical, but the answer is “yes.” If several labs were independently able to make cold fusion work, this would in fact be incredibly strong evidence that cold fusion is possible. This is true of any scientific result: the evidence that it’s correct gets stronger as more people are able to replicate it separately. If you and I both set out to solve the same problem via different methods and come to the same conclusion, this is good evidence that we are both correct. Of course, in the case of cold fusion, the non-replicability is exactly why we know there was something wrong.
Bob (Sphaerica) says

26 Oct 2010 at 12:57 PM

24 (Didactylos),

Sort of.

…claiming that code isn’t checked, and doesn’t need to be checked.

I’m not claiming it doesn’t need to be checked. QA is hugely important. But it is very rarely done by just reading the programs.

My concern, however, is that very, very few people have any inkling as to what computer software involves, either in writing, checking, documenting or maintaining it. All of those people out there who think that computer code is like a poem that you can read for grammatical errors and poor rhythm and weak rhymes are on the completely wrong track.

The point of making the code public should not be so that auditors can read it looking for mistakes. Some can try, and every contribution is a contribution, no matter how it was attained… but nobody should think that that is either an objective or a likely outcome of such a project. There are other values to this, but that’s not one of them.

The way that someone will find a mistake is by trying to implement the code themselves, provide their own inputs, make their own enhancements, and in so doing learn to understand how the code works and then, possibly, realize where a mistake has been made.

Of course, if that is the case, the world might have been better served for them to put that same effort into starting from scratch, writing their own programs, and publishing those results if they contradict the original (or not, which becomes an affirmation instead of a refutation of the original). You know, kind of the way science has always worked. Sharing code makes it easier for people, and in so doing it makes people more likely to repeat previous errors without even realizing it.

Which gets back to my original question: What is the actual goal of any concerted, systematic effort to share scientific code?
Ray Ladbury says

26 Oct 2010 at 12:58 PM

vboring, Well, if other groups are independently confirming your results, then you would have to look in detail at the models. That does not necessarily mean looking at the code. Presumably the “10 other groups” did not, so the issue is not there. It would have to be in the physics.

I usually learn much more looking at physics than I do looking at code. Try it.
Leonard Evens says

26 Oct 2010 at 1:04 PM

Let me comment about Adrien Smits point about Linux.

Linux system and application code is written by teams, often consisting of one person. It is in principle open and available, but for anything at all complicated, you have to in essence become part of the team in order to decipher the code. For example, on various occasions, I’ve tried to figure out exactly what the program gimp was doing, but just studying the code was of little use. I had to ask questions in forums and directly of the authors. And then I was only partly successful. To do better than that I would have to in effect become part of the gimp development team, and I don’t want to spend my time working to be able to do that.

I can’t imagine trying to evaluate climate model code without becoming a climate modeller myself, and that would require spending all my time for at least three years first doing the equivalent of a graduate program in the subject, under the supervision of people already doing that work.

Finally note that there is no group of skeptics, who aren’t experts in Linux, trying to show that Linux is a fraud. One of the reasons it works as well as it does is that Linux users trust the developers. So we don’t have to spend a lot of time debating the virtue of the system.

[Response: Actually, Microsoft was sponsoring very similar attacks on Linux not too long ago, and oddly enough, it was spearheaded by the same ‘think tanks’ that also attack climate science. – gavin]
SecularAnimist says

26 Oct 2010 at 1:11 PM

Leonard Evens wrote: “Finally note that there is no group of skeptics, who aren’t experts in Linux, trying to show that Linux is a fraud.”

Actually, at least for a while there were some folks at Microsoft who tried to make that argument.
kyn says

26 Oct 2010 at 1:13 PM

I guess having the code is not a requirement. Some papers do get published in my field that simply don’t include enough details to replicate the research, which can be infuriating. Having the code here would help, but arguably peer review should have caught these and told them to spend more time on their methods.

More code would save people a lot dreary implementation work that might be useful once, but not 20 times. All that wasted time might have been good for the career of one scientist, but not for those of the 20 others, and not for science as a whole.
Hank Roberts says

26 Oct 2010 at 1:32 PM

(P.S. — I’m not a programmer, it’s a serious question: can you count “millions of hours … between failures” if hundreds of copies average a year — 8760 clock hours if uninterrupted — between failures?

I do know the story from Techweb 9 April 2001:

“… The University of North Carolina has finally found a network server that, although missing for four years, hasn’t missed a packet in all that time….. by meticulously following cable until they literally ran into a wall. The server had been mistakenly sealed behind drywall ….”

I assume you’re using an industry standard for time between failures, but–how?)
Didactylos says

26 Oct 2010 at 1:36 PM

Yes, I find all these people worshipping at the altar of open source very strange indeed.

Open source is terrible. And, like democracy, it is the worst possible solution – except for all the others that have been tried.

We have to do the best with what we have, imperfect as it is.
Roger Caiazza says

26 Oct 2010 at 1:36 PM

If the point is that you need to document somewhere exactly how you did, whatever you did then, in my experience, it is a heck of a lot easier to document the code with many, many comment statements. If not you have to go through the whole documentation process twice. Moreover, that kind of documentation makes it easier for the next person to build on your results or, dare I suggest others are afflicted with my problem, to answer questions about what you did months later. Open code to me would just simplify documentation.
vboring says

26 Oct 2010 at 1:41 PM

@Gavin @26 – this is a very inefficient way to comment on anything. Comment systems with the ability to reply are much more effective.

“If you are given 100,000 lines of code and told to find the key assumptions, you’ll be there a year (or more). Complex code absent understanding and some documentation is basically useless. Any science paper needs to have enough information so that others can replicate the results – your ‘black box’ paper simply does not (and cannot) exist.”

So, give me 100,000 lines of code plus the key assumptions. The key assumptions are already published, so all I need is the code. People hack code that they have no documentation or even code for all the time. Given both the key assumptions and the code, determining the precise methods would be a fairly straightforward (though time consuming) task. In any case, it is a weird argument to say that the code shouldn’t be published because people might find it difficult to use. Philosophical arguments are often difficult to follow. Does that mean philosophers should just publish key assumptions and results and skip the details of their reasoning altogether?

Essentially, you’re claiming that all of the necessary information is already contained in published sources and I’m saying that I don’t care what you think I do or don’t need to know. If the output from your work is being used to justify alterations to my life (and I honestly think they should be), then I should have the right to see the actual method used. Not the overview, not the key assumptions, not the basic physics. The actual method. The code and data.

Doing otherwise creates an image of secrecy which makes the science seem more controversial than it is, which reduces the chances of climate policy being implemented effectively.

[Response: Did you even read the post? Then how can you think I am arguing for secrecy? Nonetheless, 99% of anything that is archived will be completely ignored in terms of public policy (as it is now), and since archiving is never going to be 100%, someone will always be able to claim something isn’t archived (even if it is – few people bother to check). Since most scientists (believe it or not) are not obsessed with the public policy outcomes of their work, this is not a sufficient incentive – it has to be useful for science too otherwise it isn’t going happen. This is just an observation, not an aspiration. – gavin]
vboring says

26 Oct 2010 at 1:51 PM

@Ray Ladbury @29

I can only know the details of how the physics were implemented by looking at the code.

When you’re talking about a 2 degree C signal in hundreds of years of data with several inputs that have order-of-magnitude error bars, the details matter.

If one is looking for a way to agree with something, the overview argument is enough to convince them that the speaker is making a serious claim that they’ve really thought about. If one is looking for a way to disprove something, every detail is needed because it is thought that any one of them could be the undoing of the argument. This is why people who agree with AGW claims think that enough information has already been given and those who don’t agree think that every detail of the methods must be made public.
Edward Greisch says

26 Oct 2010 at 2:02 PM

Pet Peeve: Commercial software that is a year old so they won’t support it and it won’t run on your new machine but it is still patented and copyrighted. The law favors the big corporation to an extreme extent.

Another problem: I have enough trouble figuring out how to use data that are in NASA and NOAA web sites. I’m pretty sure the pundits wouldn’t recognize code if they saw it.
Robert says

26 Oct 2010 at 2:20 PM

I think that perhaps a good in-between would be just simply to have the provision of the skeleton of the coding done. Methods sections in papers are often insufficient for understanding how things are implemented but the provision of say a skeleton would let those reading the code understand what was done, but still have to write the code themselves to do it.
I agree just providing the code is not a good thing but step by step instructions on how something was done could be important because people seeing that can write their own code and perhaps streamline or improve things.
SecularAnimist says

26 Oct 2010 at 2:30 PM

Hank Roberts quoted TechWeb 2001: “The University of North Carolina has finally found a network server that, although missing for four years, hasn’t missed a packet in all that time … by meticulously following cable until they literally ran into a wall. The server had been mistakenly sealed behind drywall”

Betcha it was a NetWare server.
Dan H. says

26 Oct 2010 at 2:54 PM

vboring,
Good point. Any input with large enough error bars could cause the entire model to unravel. That is why some who include uncertainties in their calculations arrive at 21st century warming between 1 and 6C.
HappySkeptic says

26 Oct 2010 at 3:14 PM

The point that at the end of the day climate science’s veracity rests on the physical evidence, and not code in models, is very important. If different groups, funded by different governments on different continents using different measuring devices and writing their own modelling/analysis code are all coming to the same conclusions it shows that climate change isn’t an artefact of a coding error in some computer simulation.

However in many cases in the public debate this is too nuanced a point to be useful – it doesn’t fit into a soundbite. Therefore the code really does have to be completely made open, even stuff that isn’t really useful on its own, so that the answer to bullsh** questions like ‘Aren’t climate scientists hiding the code they use to produce their results?’ is a simple ‘No, all such code is freely available to anyone on the internet’.
vboring says

26 Oct 2010 at 3:14 PM

@gavin @36

“Did you even read the post? Then how can you think I am arguing for secrecy? Nonetheless, 99% of anything that is archived will be completely ignored in terms of public policy (as it is now), and since archiving is never going to be 100%, someone will always be able to claim something isn’t archived (even if it is – few people bother to check). Since most scientists (believe it or not) are not obsessed with the public policy outcomes of their work, this is not a sufficient incentive – it has to be useful for science too otherwise it isn’t going happen. This is just an observation, not an aspiration. – gavin”

I am a bit of a hard liner when it comes to openness is science. As I see it, the work of any public employee who is paid to generate knowledge belongs to the public (with exceptions for national security, etc).

Separately, for a scientific paper to be published, I think every detail should be present to allow for identical replication of the original result. Finding problems in the published approach would take a small fraction of the time compared to starting from zero trying to replicate results based on a described method, so the value to science is obvious.

From this perspective,arguing for anything short of complete openness is arguing for secrecy and excuses for why it doesn’t matter or why some scientists don’t want to do it are uninteresting.

[Response: ‘identical replication’? Under all circumstances? Not possible. Different compilers, different math libraries mean that any reasonably complex code will generate bitwise different results (and far more than bitwise in a chaotic system) if run on a different computer, or even the same computer at a later time. Some codes are specifically written only for the computer they were designed for (the Earth Simulator for instance). This is not – in practice – identically reproducible. Or are we supposed to archive the entire OS and functions as well? No absolutist approach such as you advocate is possible, and in demanding it, you actually set back the work on getting anything done. – gavin]
klee12 says

26 Oct 2010 at 3:16 PM

Hello,

I’ve ported many codes. The only documentation I read carefully when I have to go through the program line by line, are (1) the purpose of the subroutine (2) the list of parameters and assumed data structure (integer, floating, array) and (3) the list of output and their data structure. The problem with detailed documentation is that frequently the documentation is erroneous … the programmer may have written a program, perhaps modified the program and then forgot to change the documenation to reflect the change. With just the purpose of the subroutines and calling sequences for the input and output I can write a standalone program to call the subroutine and single step through the subroutine to see what it does.

Sometimes I want to understand the code. It helps then if the subroutines are short and performs only one function. In my own coding I try to keep the length of a subroutine to 2 screenfulls of source code. If it is much longer, I try to break it into two subroutines.

Being able to reproduce results is the bedrock of science and, IMHO, the code and data are necessary for reproduction. Suppose A presents the results depending on a computer run in a paper. A makes available the data and a detailed description of the algorithm to produce in the code. B tries to reproduce the code and using the same data, cannot get the result that result. It seems there are 3 possibilities: (a) A has a bug in his/her program (b) B has a bug in his/her program and (c) there was a misunderstanding in the specification of the program.

klee12
The Ville says

26 Oct 2010 at 3:19 PM

vboring:
“Essentially, you’re claiming that all of the necessary information is already contained in published sources and I’m saying that I don’t care what you think I do or don’t need to know.”

Or rather you aren’t interested in listening, your on a pre-programmed course and no one is going to change your mind. I think others can assess your attitude and make their own judgement.

vboring:
“If the output from your work is being used to justify alterations to my life (and I honestly think they should be), then I should have the right to see the actual method used.”

Well I hope you apply that logic to all public policy!
Do you apply it to policy based on ‘morals’ or religion or anecdotal evidence?
I’m guessing you accept public policy when you personally feel happy about it, which has little to do with the reasons for introducing the policy.

vboring:
“Not the overview, not the key assumptions, not the basic physics. The actual method. The code and data. “
Septic Matthew says

26 Oct 2010 at 3:44 PM

Yet the need for more code archiving is clear.

I am glad that you wrote that.

You rightly critique the per paper approach to archiving, but that itself is an improvement over the previous practice.
sharper00 says

26 Oct 2010 at 3:57 PM

I think the issue of source code availability is important but more so for the wider acceptance of scientific work than for the progression of scientific knowledge itself.

Simply put: People want to replicate the work of scientists to ensure the results are reliable and not simply the word of a small group of insiders. The ins and outs of how the debate evolved is something we could all write about extensively but when a topic is controversial it’s a lot to ask of people not in the scientific community to trust in peer review etc when they could just have the code and data instead.

I think the ability of others to independently analyse and implement GISTEMP has done a lot to dispel the idea that warming trends are purely an artifact of analysis. Naturally it hasn’t removed all criticism but they are those who will never be satisfied, the goal should be to satisfy those who are reasonable.

[Response: Fair enough, but despite all code and all data being available, and it being independently coded, and it being independently validated, I am still seeing accusations of corruption, fraud and data mishandling. And these are still making news. It is precisely the fact that there are loudmouths who will never be satisfied and plenty of people around who don’t check their claims, that the notion of perfect archiving suddenly improving public appreciation of the science is nonsense. People will continue to use stale talking points as long as this has political connotations. – gavin]
SecularAnimist says

26 Oct 2010 at 4:06 PM

Gavin wrote: “… in demanding it, you actually set back the work on getting anything done.”

Of course, with regard to climate science, for some people that’s a feature, not a bug.
sharper00 says

26 Oct 2010 at 4:28 PM

“I am still seeing accusations of corruption, fraud and data mishandling.”

Certainly and they’ll continue because not everyone is playing the same game. I think we’re all aware there are those who know nothing about climate and care even less but are only involved in order to dismiss the entire thing because it doesn’t match their preconceptions.

Which is why I say above that the goal should be to satisfy those who are reasonable. When data and code is locked away reasonable people will ask why and become concerned about accusations behind such a thing. When data and code is open and independently implemented reasonable people scoff at claims it’s manipulated.

I mentioned GISTEMP above, the independent reconstructions should put to bed the concerns any reasonable person might have that warming trends are simply inserted into the data. Of course that still leaves the daily rants about adjustments at location X but there’s basically nothing that can be done about those.

“the notion of perfect archiving suddenly improving public appreciation of the science is nonsense.”

I don’t believe it will anymore than problems of unarchived/unavailable code and data instantly damaged public perceptions of science. However over time freely available code will serve to increase confidence while unavailable code will do the opposite.
sam marshall says

26 Oct 2010 at 4:50 PM

@hank: Yes, ‘millions of hours between failures’ means ‘hours of usage’. So if the software is being used in one million sites, and each hour on average one site experiences a failure, then they have a million hours between failures.

It is possible to imagine cases where this metric is pointless (an obvious one would be some kind of input that will not occur by random chance/wide-scale usage but only at a specific date or time, such as the Y2.038K bug; if your software runs on millions of sites for years with few failures, that’s great and you’ll have a good average, but it’s still possible that every one of those million ‘reliable’ sites will fail drastically in 2038) but typically it’s probably a pretty good measure.

I once worked on a system which monitored the same metric and managed a quality 1.5 hours between failures. :) Largely my fault, too. Since decommissioned.

Excepting code for nuclear power stations and rockets and whatever, most code is not written to anything like the specifically-checked or million-hour standards. However it’s not true that outside that realm code is very rarely read. Lots of organisations, including the one I work at, have code-review policies where developers have new code casually reviewed either by peers or by senior developer before committing it.

Even though our reliability requirement is along the lines of ‘eh it’s a website, if it breaks we’ll patch it’, rather than the ‘oops, if it breaks we might accidentally launch a nuclear strike’ level, the review process is still quite important for us.

I would second the recommendation for a distributed VCS such as Git when used in this context. Archive quality code in a proper version control system which can be publicly referenced and you can easily point to the exact version used. For throwaway crappy code written just for that paper, or a few lines of Matlab or whatever it was, sure, go ahead and archive it with the paper.

Incidentally the same process might apply to datasets depending on how they are stored and structured; in other words if you use a standard dataset there is no point archiving the whole thing with the paper, but if that dataset is stored in a version control system, you can simply reference the precise version that you used.

Because distributed VCS are, er, distributed, you have security in the event that (for example) GitHub goes bust. Not a problem, everyone who uses it has a full copy; you can stick it onto some other host and all the history, tags, other references, etc should still work.