This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).
Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.
Data-as-used vs. pointers to online resources
MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).
On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.
Data updates
In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.
In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.
Processing
MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).
The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).
Replication
Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.
Are we done? Not in the least.
Science
Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).
In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).
Bottom line
Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.
However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.
To conclude? Ease of replicability does not correlate to the quality of the scientific result.
And oh yes, the supplemental data for my paper are available here.
Martin Vermeer says
Ed #41 the carpenter:
There’s a reason why “skeptic” was in quotation marks ;-)
Richard Steckis says
You said:
“MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment.”
That statement is not exactly true. There are a limited number of resources available to guide people in the conversion of STATA scripts to R:
http://wiki.r-project.org/rwiki/doku.php?id=getting-started:translations:stata2r
http://wiki.r-project.org/rwiki/doku.php?id=guides:demos:stata_demo_with_r
It may be time consuming but it is doable without the proprietary software.
Alternatively, I am sure McKitrick would be happy to provide an alternative R script for the analyses. Well he would if he values open replication of results.
You only have to ask I guess.
Glen Raphael says
21. dhogaza:
Yes, QA performs both types of testing, but having the info available to exactly replicate a usage case does not preclude more elaborate or independent tests! So far as I can tell from Gavin’s example, MM07’s data provision is strictly better than dLM07’s because it’s the only one that allows both precise and approximate reconstruction. You have the freedom to roll your own methods or include your own alternative data sources and the security of being able to exactly replicate the reference result set if you get confused by a possible bug and need to track down where in the process differences are creeping in – the best of both worlds.
(And it sounds like both MM07 and sLM07 are strictly better than most of the studies that skeptics complain about. Maybe standards are improving.)
Most of the time when a developer “lies” about how his code works he doesn’t know he’s lying because he has fooled himself too. Again, this is human nature – I don’t regard it as a symptom of either dishonesty or incompetence. (It is often said that QA “keeps developers honest” in regard to claims made about the code, and that’s part of the role I see for “auditors” as well.)
31. Gavin:
Yes, replication is fine if it’s only theoretical. When the issue comes up, what we code-obsessed nitpickers want to see is the actual code you used to generate your result, even if it’s in COBOL or uncommented APL or requires million-dollar hardware. What matters is that somebody could in theory reproduce your work precisely if they really needed to; that, for us, is part of what makes the work “science”. And yeah, you might still hear people whine about other issues – insufficient comments or inelegant code or whatever, which is fine, but all they have a reasonable right to expect is to see the code you yourself used, in whatever state it’s in. Anything more is gravy.
You wrote “the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue”. Ah, but what if the author of the paper got hit by a bus? What if he’s too busy to respond or just doesn’t feel like doing so? MM07-style disclosure means you don’t have to worry about any of that; those who want to see what was done can figure it out if they need to.
dhogaza says
Of course. But in the context of results reported in a research paper – hate to belabor the point, but you’ve apparently missed it – replication of the “usage case”, i.e. the computation(s) used to generate those results is all that’s important regarding the correctness of those results.
Someone above posted an example of exactly what I’m stating, i.e. that extended use of some code showed a serious bug, but the work done in the generation of results for a particular paper didn’t hit that bug. Therefore … the conclusions of the paper were not affected.
And quit trying to back off from your original statement regarding the possibility that a researcher might be lying about their research results. Not “lying”. Lying. Your meaning was clear. If you regret it, man up and apologize.
Mark says
dhogaza #55, he CAN’T man up and apologize. Doing so would be admitting being wrong. And his entire point is that if you’re wrong in one thing, no matter how small, you can be readily accused of being wrong elsewhere in bigger things and you then have to prove you’re NOT wrong.
So no apologizing is allowed.
Barton Paul Levenson says
Glen Raphael writes:
Then you don’t understand what reproducibility means in science. It means you get the same results, not you use exactly the same procedure and code. If you can’t reproduce the same results independently, you aren’t really doing any useful work. You need the method and the algorithm. You don’t need the code.
michel says
I do find this very puzzling. They have data, they have code, they have run the code against the data. Why would they not want to publish it? It makes no sense. You have lots of more or less convoluted and emotional points here about this, but that’s the bottom line: why on earth not?
Don’t think this is missing something. Its the whole and only point.
Mark says
58, why does publishing it make sense? More work. No benefit. cost/benefit analysis negative.
Pop along to Oracle and tell them it makes no sense to have closed source with copyrights, since open source with copyrights gives them the same power.
Martin Vermeer says
michel, you wouldn’t be able to run the code if it bit you in the bacon. The data and the code are out there to the satisfaction of any competent scientist. All you have to do is call little scripts in the right order on the right intermediate files. The Unix way, the science way. There is no end-user hand-holding mega-script. Writing one would be a major additional chore — precisely the intention of the bitchers.
I know from experience, been there, done that. You don’t. That’s a big point you’re missing. Figure it out.
Glen Raphael says
55. dhogoza
56. mark
I regret using the word “lying” in such a politically-charged context. I apologize for doing so.
Although I didn’t intend by my generic hypothetical example to accuse anyone in particular of doing anything in particular, I can see how the background context of this discussion might have suggested one or more specific targets. I did not intend that implication, do regret it, and will try to be more careful when using similar language in the future.
Richard Steckis says
#56 BPL:
“You need the method and the algorithm”
I have looked at McKitrick’s code. I am half way through converting it to R code. I have only learn’t R over the last month or so and do not know STATA at all. But getting the data and some of the analyses into R is a no-brainer.
McKitrick has well documented his method and the algorithms used in the code file.
He used Hausman’s Test which is apparently well used in Econometrics. R has a package that will perform the Hausman test.
What I am trying to say is that with just the code and data files, McKitrick’s work is easily replicable.
[Response: I certainly didn’t disagree – in fact that was my main point. With a little expertise (with R in your case), all of these scientific results are quite easy to replicate. – gavin]
Ross McKitrick says
Hi Gavin,
I had thought this post would be about the actual findings in your IJOC paper, and on that point I disagree with your interpretation of your results. But that can wait for another day. The immediate point of your post seems, to me, to be that there is a difference between reproducing results versus replicating an effect; and a difference between necessary and sufficient disclosure for replication. Full disclosure of data and code sufficient for reproducing the results does not ensure an effect can be replicated on a new data set: Agreed. But that is not an argument against full disclosure of data and code. Such disclosure substantially reduces the time cost for people to investigate the effect, it makes it easy to discover and correct coding and calculation errors (as happened to me when Tim Lambert found the cosine error in my 2004 code) and it takes off the table a lot of pointless intermediate issues about what calculations were done. Assuming you are not trying to argue that authors actually should withhold data and/or code–i.e. assuming you are merely pointing out that there is more to replication than simply reproducing the original results–one can hardly argue with what you are saying herein.
[Response: And why would anyone want to? ;) – gavin]
I do, however, dispute your suggestion that I am to blame for the fact that dispensing with the spatial autocorrelation issue has not appeared in a journal yet. Rasmus posted on this issue at RC in December 2007. I promptly wrote a paper about it and sent it to the JGR. The editor sent me a note saying: “Your manuscript has the flavour of the ‘Response’ but there are no scientists that have prepared a ‘Comment’ to challenge your original paper. Therefore, I don’t see how I can publish the current manuscript.” So I forwarded this to Rasmus and encouraged him to write up his RC post and submit it to the journal so our exchange could be refereed. Rasmus replied on Dec 28 2007 “I will give your proposition a thought, but I should also tell you that I’m getting more and more strapped for time, both at work and home. Deadlines and new projects are coming up…” Then I waited and waited, but by late 2008 it was clear he wasn’t going to submit his material to a journal. I have since bundled the topic in with another paper, but that material is only at the in-review stage. And of course I will go over it all when I send in a reply to the IJOC.
Briefly, spatial autocorrelation of the temperature field only matters if it (i) affects the trend field, and (ii) carries over to the regression residuals. (i) is likely true, though not in all cases. (ii) is generally not true. Remember that the OLS/GLS variance matrix is a function of the regression residuals, not the dependent variable. But even if I treat for SAC, the results are not affected.
[Response: I disagree, the significance of any correlation is heavily dependent on the true effective number of degrees of freedom, and assuming that there are really 400+ dof in your correlations is a huge overestimate (just look at the maps!). I’m happy to have a discussion on how the appropriate number should be calculated, but a claim that the results are independent of that can’t be sustained. I look forward to any response you may have. – gavin]
Nicolas Nierenberg says
#58 why do you keep bringing up Oracle? I’m sure you realize that a commercial developer of software, where the software itself is the product, has no connection to the output of a scientific researcher.
Other than a couple of people the conversation seems to be converging on the fact that providing the code and data is preferable. Dr. Schmidt has said that this will be done in the case of the Steig paper. [editor note: this was done in the case of Steig et al with respect to code though perhaps not with as much hand-holding as you seem to want. some of these data are proprietary (NASA), but will be made available in the near future]
By the way the statement that scientific replication means getting the same result with your own methods just isn’t correct. In the case of experimental data, the idea is to produce the same result using the same methods. It is very critical in the case of experimentation to completely understand the methods used in the original experiment.
It is also possible then to vary the methods to see if the result is robust.
In this case the purpose of replication is to make sure that you completely understand what was done before varying the tests or adding new tests to see if the result is robust.
Hank Roberts says
> vary the methods to see if the result is robust
I believe this is not what “robust” means in this field. Would someone knowledgeable check the sense there? As I understood it “robust” means some of the data sets can be omitted — for example if you’re looking at ice core isotopes, tree ring measures, and species in a sequence of strata, a robust conclusion would be one where any one of those proxies could be omitted.
I am sure that given complete access to someone’s work, it would be possible to vary the method used until it broke beyond some point and, if one were a PR practitioner, then proclaim it worthless.
You can break anything. Point is do you know the limits of the tool, within which it is usable and behond which it fails, and how it fails.
dhogaza says
The claim was made that open source is “standard practice” in the commercial world. I originally provided a set of counterexamples which then devolved into nit-picking attempts to demonstrate that since Oracle does support a handful of open source projects, that the “standard practice” claim is correct.
Which is different than demands that it is MANDATORY and that scientific results can’t be trusted unless a source code is available.
Which, of course, means that any paper based on the results of STATA or other proprietary software needs to be immediately rejected by the “auditing” community, if they follow their demands to a logical conclusion.
Which they won’t, as long as the “right people” are using STATA. As evidence I note that the rock-throwing “auditing” crowd isn’t demanding full source disclosure, or the use of open source software, from M&M.
I can’t think of any reason for this double standard … (snark)
Bill DeMott says
Many of those concerned about fraud and mistakes in scientific publications seem to forget about or misunderstand the peer review process. Top journals reject a high percentage of submitted papers and scientific reviewers typically do a good job of critical, objective evaluation. Acceptance of a paper depends on a number issues including, most importantly, the support of conclusions by evidence and the originality and significance of the results.
People who submit manuscripts for peer review (or act as editors or reviewers) know how detailed and picky but sometimes helpful and insightful reviewer can be. Authors also appreciate detailed and critical reviews because they want their papers to be as clear and as error-free as possible when they appear in print. Often it is most helpful when a reviewer expresses a misunderstanding, because this usually means that the authors were unclear, even to specia-lists.
I’ve been a reviewer or editor for about 600 manuscripts submitted to scientific peer reviewed ecological journals (something that I need to document for annual reports). Scientists place trust in the review process and I think that this trust is well placed. This does not mean that no questionable work gets through. However, peer reviewers are very good at finding mistakes, misstatements, questionable citations, problems with statistics and data analysis, etc. When a reviewer feels that more documentation or basic data are needed he or she can ask for it. If a paper is considered a potentially significant contribution, editors give authors a chance to defend and/or revise their manuscripts in response to the reviewers’ comments. Often times, controversial (and important) papers receive very critical review but different reviewers may criticize for completely different reasons. In this situation, in my opinion, the editor has a rational for supporting the authors in many cases.
Recently, many journals are allowing authors to submit “supplemental online supporting material” with more details about methods, statistics etc. I occasionally see a paper in my field which I think has basic problems, but most published scientific research is made credible by the review process. Often papers are unimpressive the first time I read them, but seem much more valuable when I become interested in the specific issues that they deal with. In contrast, blog postings are not reviewed and should viewed with skepticism unless the poster is an author talking about his or own peer reviewed publications.
Nicolas Nierenberg says
Editor, all I have seen is a pointer to a set of routines implemented in matlab. It is far less than “hand holding” to point someone to a routine library and say that some combination of these routines were used. I am not saying that replication is impossible with this amount of information, but it would be a lot of work with an uncertain outcome. Admittedly I haven’t spent a lot of time on this, but I am sure that there is quite a bit in the use of those routines which is at the discretion of the user.
Before someone starts talking about incompetent people etc. I am the founder of two software companies, and did most of the initial technical development on both of them. I am qualified to comment on this.
I will repeat my earlier statement, which has not been contradicted, whatever Dr. Steig would make available to other researchers if asked should be posted. This appears to be more than a pointer to general purpose statistical routines.
I also note that the idea that some of this is proprietary is new to these threads. What portion is NASA claiming as proprietary?
Also nothing about that library resolves the questions about the gridded AVHRR data as far as I know.
Contrasting the MM paper outlined here. Even if Dr. Schmidt didn’t use the particular software that was originally used he could see the code and comments which would allow him to determine what steps were taken. I don’t think Dr. Schmidt is contending that he replicated MM without reviewing that code.
[Response: The code is too high level to be useful in replication if you aren’t running the exact same software. Calculations are written “regress surf trop slp dry dslp” and “test g e x” and the like. The written descriptions in this case were much more useful. – gavin]
dhogaza says
Well, I founded a compiler company, worked as a compiler writer for much of my life, and currently manage an open source project, which I founded, used quite widely in fairly arcane circles.
And *I* don’t feel qualified to “audit” climate science papers, nor the statistics being evaluated, etc. Source to the software used isn’t really of help other than allowing me to say, “oh, it compiles and runs and gives the same answers in the paper”, but that’s it.
And that’s true of the rest of the software engineering rock-throwing crowd that’s insisting here that they need the source code to properly vet such papers.
This exercise is really only relevant to those who 1) really do believe that climate science is fraudulent and that researchers do lie about results (and it’s clear from reading CA and WUWT that this population exists and is very vocal in demanding code etc) or 2) don’t understand what is, and is not, of scientific value. Simply verifying that the same code run on the same dataset gives the same results really tells us nothing.
It’s telling, don’t you think, that scientists working in the field aren’t those making these demands? That it’s really only amateurs who are asking?
Do you ask for the source of the flight deck software before taking a flight? If you got it, could you, having founded two software companies, evaluate whether or not the software interacts properly with ground based navigation systems, correctly implements the model of the physics governing the flight characteristics of the airplane that allows modern cockpits to land an airplane automatically, etc?
Now, the scientific community’s interested in increasing sharing in ways that meet their needs. It appears that online availability of datasets has moved a lot more quickly than the availability of source to code. This seems totally reasonable as it meets the needs of scientists to a far larger extent.
pascal says
Gavin
It is not really the topic of your post, but I don’t understand the importance of the problem.
The TLT RSS trend is about the same that the surface temperature trend.
Since 30 years the both are about 0.16°C/dec.
The anthropogenic heat was, in 1998, 0.03W/m2 (AR4 chapter2).
Very roughly, the anthropogenic heat trend is about 0.005W/m2.dec
So if this latter should be considered as TOA forcing (it’s not the case) and if we suppose a climate sensitivity of 0.8°C.m2/W the temperature anomaly trend from anthropogenic heat should be about 0.004°C/dec.
Thus, we can estimate that the RSS trend is “polluted” by anthropogenic heat only by 0.004°C/dec (if we suppose a constant lapse rate)or only 3%.
Since RSS and surface trends are the same, can we say that the urban or economic effects are completely insignificant on surface trends?
[Response: Well, that presupposes that RSS is correct and UAH wrong – which is unclear. It is clear that there is structural uncertainty in those numbers which makes it difficult to assess whether there is a true discrepancy between them and the surface data.Thus it isn’t a priori a waste of time to look for surface effects – as I suggest in the paper, there may be issues with local climate forcings being badly specified as well as potential contamination of the signal by noise. However, you are correct, the ancilliary data (ocean heat content increases, phenology shifts, glacier retreat, Arctic sea ice etc.) all point towards the surface data not being fundamentally incorrect. – gavin]
Mark says
Good post. In response to your response to Walt (#48)… I don’t know of an online system that provides all the data-citation functionality, but it is certainly something the data archiving community is working on. As one example, the International Polar Year data policy requires formal data acknowledgement and IPY provides some guidelines on “how to cite a data set” (http://ipydis.org/data/citations.html), but it is still an evolving practice. There are technical challenges that are slowly being addressed through better versioning and greater use of unique object identifiers, but the greater challenges lie in the culture of science. In many ways, data should be viewed as a new form of publication that correspondingly should be acknowledged, vetted, and curated. You rightly point out that few scientific errors are the result of bad analysis or coding, but I suspect erroneous results may often be due to errors in the data that may not be well characterized. As you say, replication is only part of the issue.
Mark Parsons
National Snow and Ice Data Center
Nicolas Nierenberg says
dhogaza,
If you go back to the original open source comment it was made by Dr. Schmidt, and it referred to the use of open source by enterprises not the publication of it. Everyone knows that not all software is open source so I don’t know what you are trying to prove. The question was whether the use of open source is standard/normal in enterprises. My answer is that it clearly is. Of course the use of proprietary software is also standard/normal. Most of the discussion after that has been a waste of time.
I would say that if I were king it would be mandatory to make code and data available for scientific papers where code and data are used. I don’t know if I would require that it always be based on open source.
I want to make it clear that for me this is not at all an issue of trust. It is an issue of understanding. Dr. Schmidt was curious about the conclusions of two papers. His ability to quickly be able to quickly replicate MM because the data, and at least a high level detailed description of the code was available, allowed him to focus on the interesting parts of his analysis of the result.
Mr. Roberts,
Robust means exactly what Dr. Schmidt implied in his post. He looked at things like varying the start dates and time periods to see if the results changed. I don’t know what you mean when you say that you can break anything. If changing the start and end periods changes the results, then they might not be robust. This is part of the argument that Dr. Schmidt makes in his paper
BTW for those of you having issues with Captcha. I have found that if you click on the little circular arrows you can quickly get to a readable entry.
Ray Ladbury says
It would appear that some folks are confusing “the model” with “the code”. They aren’t the same thing. Unless you understand the physical model, of which the code is an expression/approximation, you probably won’t be able to come to an intelligent assessment of either model or code.
[Response: I think there may be more fundamental miscommunication. Michael Tobis and others see the build scripts and makefiles and data formatting as part of the code – and from a computer science stand point they are right. The scientists on the other hand are much more focussed on the functional part of the algorithm (i.e. RegEM in this case) as being the bit of code that matters – and they too are right. This is possibly the nub of the issue. – gavin]
dhogaza says
The claim, not question, was that open source is a “standard practice” in commercial enterprises.
This is a much stronger claim. Feel free to do the google to see what is meant by a “standard practice” in engineering etc.
Obviously no one disagrees that a mix of open source and proprietary software is used in commercial enterprises.
It’s all a dodge anyway. Ray Ladbury, above, makes the crucial distinction between model and code, same point I’ve made in a couple of posts earlier. Code access only makes sense if you understand the model (or statistical analysis), or mistrust the researchers. Which camp a large number of wannabee “auditors” fall into is obvious from their posts at places like CA and WUWT.
Mark says
Hank, 64, the best I can get from the context *I* hear robust is that a robust answer is one where the answer doesn’t change if you got something unexpected wrong.
E.g. the measurement 15.6 was actually misscribed and was 16.5. If the answer didn’t change if you redid it, then the answer is robust.
OR
You used a 1-degree resolution model. If you used a 2 degree resolution model or a 1/2 degree resolution model, and the answer is the same, then the answer is robust.
and so on.
A robust answer can ONLY result from someone doing the work again THEMSELVES without following exactly the same path (if you add 1 and 1 to get 2, don’t expect to get any different if someone else runs the same code to add 1 and 1 together) and if the answer is the same, the result is robust.
If the answer is different, then that shows how the answer may be variable based on assumptions made.
Both are useful answers from a repeat of the work.
But running the same code, on the same machine with the same data? That’s only proven
a) you didn’t deliberately lie
b) you didn’t print out the wrong report
Nicolas Nierenberg says
Mark,
I generally agree, but that doesn’t mean it is useful or even best practice to do all the work again yourself. For example Dr. Schmidt didn’t have to go and gather up all the econometric data.
The useful work is the change, not the replication.
duBois says
“Robust” doesn’t refer to the experiment. It refers to the hypothesis and the underlying reality it’s describing. It’s “robust” because one needn’t approach it with kid gloves in a delicately restricted fashion. The hypothesis is sound enough to be pounded on. Etc.
Chris Ferrall says
In response to Ross, Dr. Schmidt writes “the true effective number of degrees of freedom, and assuming that there are really 400+ dof in your correlations is a huge overestimate (just look at the maps!).”
Here there appears to be a simple misunderstanding. Effective degrees of freedom is not a concept used in applied econometrics. That doesn’t mean it is not dealt with. I believe you are simply saying that if the Gauss-Markov theorem assumptions do not hold then the usual estimated variance matrix is too small. Non-diagonal terms in the variance of the residual qualifies as a violation. However, I believe Ross is trying to say that they used Generalized Least Squares, which is robust to this kind of deviation from plain vanilla OLS. If I recall their ‘main’ result does not use GLS, but they explored this types of issues in subsequent sections. There is a tradeoff between using GLS when it is not called for and OLS, so it is accepted practice to report OLS and focus on it then follow up with sensitivity tests and not focus on those results if they end up not being ignored.
One imagine mapping this approach into one in which we adjust critical values as if we do not have N-k degrees of freedom. But Dr. Schmidt, you seem to think that correlation among observables is itself a problem to be addressed. And the focus on a priori effective degrees of freedom seems to support that view. But this type of correlation is not an issue even for OLS unless the residuals display SIGNIFICANT violations of Gauss-Markov. In that case, GLS is a completely standard approach. The textbooks referenced by MM07 provide complete discussions of these concerns.
[Response: Possibly there is a confusion of terms here. The issue is not the coefficients that emerge for OLS should be different (they aren’t), but their significance will be inflated. Let’s give a very simple example, I have a data set with one hundred pairs of data and I calculate the correlation. Now I duplicate that data exactly, so that I have 200 pairs of numbers. The regression is identical. However, if I tried to calculate the significance of r (which goes roughly like 1/sqrt(n)) using n=200 instead of 100, I would find that magically my nominal significance level has increased dramatically. The same thing is happening in this case because neighbouring points are not independent. Take ‘e’ the educational attainment, this has only a few dozen degrees of freedom, and it doesn’t matter how many time you sample it, you can’t increase the effective ‘n’. – gavin]
Walt says
Hi Gavin,
As Mark Parsons pointed out, there are indeed efforts being developed to formalize dataset citation, documentation, etc., and NSIDC is involved in several. Most notable for NSIDC is the IPY data that Mark mentions. At a larger scale is the Group on Earth Observations (GEO) group, working to implement a Global Earth Observation System of Systems (GEOSS). But efforts are still in their infancy and there are many issues to still to resolve. One such issue is that most of these efforts have focused on data (satellite, aircraft, in situ, etc.) and not model outputs, but it seems to me that model outputs should be considered in the discussions as well.
JohnWayneSeniorTheFifth says
I note the implicit defense of not archiving data/methods completely is slipped in with “a quick email to the author resolved the issue”.
What if a quick email to the author is met with “why would I waste my time on you ?”, or just plain silence – what then ? If everything is archived, personal animosity between parties can not hinder the process.
[Response: But there is always the potential for error and/or ambiguity in what was done – that doesn’t go away because the code looks complete. – gavin]
Mike Walker says
“[Response: But there is always the potential for error and/or ambiguity in what was done – that doesn’t go away because the code looks complete. – gavin]”
But the probability is greatly diminished? And that is certainly worth something, is it not?
[Response: Not sure how you’d be able to evaluate that, maybe, maybe not. – gavin]
Gillian says
Maybe the northern hemisphere winter triggers a focus on icy topics here in the past few weeks, but the current heatwaves and fires (death toll 175 and rising) in Australia right now might trigger some discussion about the effects of climate change on arid areas. It’s not at all that you neglect this topic, but for us in Australia right now, melting ice is not the most salient aspect of climate change.
As one of our media commentators said this week, our firestorms are an indication that climate change will make the financial crisis look like a garden party.
Thanks for the great work you do.
dhogaza says
While not nearly as horrifying in human or property loss thus far, the western US is also experiencing more frequent and more intense fires.
We probably share one problem with SE australia that makes it potentially difficult to tease out any global warming signal – fire suppression. But with recent work showing that trees are dying at an increased rate across the West (almost certainly due to global warming), northward expansion of insect pests and invasive non-native pest species (while some are disappearing in their more southern reaches), and a bunch of other ecological changes, it’s getting harder and harder to ignore the role climate change will play in the future, and is playing now. I know, for instance, that fire season has been starting a couple of weeks earlier in recent years than three decades ago …
The brou-ha-ha over the fact that a *portion* of the US has been experiencing a cold winter not only ignores Australia’s exceptionally warm summer, but also the fact that the west coast of the US has had an overall mild winter. Water managers are already wailing about the low snowpack in the mountains and expressing worries about summer.
sidd says
Mr Nierenberg wrote at 10:34 am on the 9th of February:
“By the way the statement that scientific replication means getting the same result with your own methods just isn’t correct. In the case of experimental data, the idea is to produce the same result using the same methods.”
I cannot agree. If I publish a result that the boiling point at STP of a certain compound is 190 C, you hardly need to use exactly the same thermometer and the same heating pad, and the same beaker to replicate my result.
JS says
Gavin,
In your reply to Chris you state that the standard errors may be wrong in OLS. This is well recognised and in this circumstance you use GLS – apparently as McKitrick did. (Or, you can do Newey-West corrections, or White corrections or other well-established approaches as the circumstances dictate.)
So, yes there would seem to be a confusion of terms – the use of GLS would seem to directly address your concerns – as Chris has suggested.
GLS allows for a covariance matrix that is not the identity – it explicitly allows for the possibility that you raise that adjacent (or other) entries are not independent in a clearly specified manner through the specification of the covariance matrix. Thus, one does not need to talk about effective degrees of freedom, one has a parameterisation that deals with departures of the covariance matrix from the identity. Heck, if you use White’s correction it doesn’t even require you to specify the form of heteroskedasticity.
[Response: Let’s make a rule – any digressions into econometrics-speak require links to define terms, and a sincere effort not to bludgeon readers with unrefereced jargon, which while it might make perfect sense, is not what is required on a climate science blog. For a start, instead of heteroskedasticity, can we say unequal variance (or similar)? Thanks.
I’ll cheerfully admit to not being an econometrician, and I’m not going to get involved in esoteric-named-test duels (way above my pay grade!). I am however willing to discuss (and learn), but you have to also see where I’m coming from here. The tests used by MM07 give statistical significance to regressions with synthetic model data that can have had no extraneous influences from economic variables. This is prima facie evidence that the statistical significance is overstated, as is the fragility of ‘noteworthy’ correlations when RSS MSU data is used instead of UAH. My estimates of the spatial correlation were an attempt to see why that might be. The numbers I used are in the supplementary data, and so if you (or someone else) can show me that all these various tests or corrections get it right for the synthetic data (i.e. showing that the correlations are not significant), then I’ll be willing to entertain the notion that they might work in the real world. – gavin]
[Response: One little puzzle though. The significance reported in MM07 Table 2 is the same as I got in my replication – but I was just using OLS and no corrections. If GLS made a difference (and if it didn’t then we are back to the my original point), where is that reported? – gavin]
James says
It just occurred to me that almost everyone here seems to be overlooking the main reason why scientific results should be replicable. It’s not so that they can be checked, or “audited” by those hoping to find some trivial error that they can use as a talking point. It’s so the results can be used as a basis for further work.
If for instance I read a computer science journal and see a paper on a new algorithm to do X, I’m not likely to code it just to check whether the authors knew what they were doing. If I do try to implement their algorithm, it’s because I’ve found that I need some way of doing X in order to achieve Y.
[Response: Of course. And whether that further work gets done clearly marks the difference between people who genuinely interested and those who are grandstanding. – gavin]
Nicolas Nierenberg says
ssid,
If you can’t understand the difference between methods and objects then perhaps programming isn’t for you ;-)
Nicolas Nierenberg says
Dr. Schmidt,
Could you comment on the fact that while you show significant relationships with model data they are of the inverse sign to MM? You note this in the paper, but that seems relevant to me.
[Response: Sure, I think those relationships are spurious. They aren’t really significant (despite what the calculation suggests) and therefore whether they are positive or negative is moot. If anyone thinks otherwise, they have to explain why models with no extraneous contamination show highly significant correlations with economic factors that have nothing to do with them, and which disappear if you sub-sample down to the level where the number of effective degrees of freedom is comparable to the number of data points used. – gavin]
JS says
My apologies for using jargon. Here are some links that might make some slight ammends:
GLS
White’s correction for heteroskedasticity
Newey West standard errors (A Stata help page – I couldn’t resist, it was the best reference I could find on a quick Google.)
Section 4.1 states that “Equation (3) was estimated using Geleralized Least Squares…following White”. The table notes say that the standard errors in Table 2 are ‘robust’. This implies to me, based on the text, that they have used the White correction in reporting the standard errors in this table. It could be that the White corection makes no difference so you get the same results with OLS but to delve deeper I might suggest you contact the author for more details ;)
But, recognising that this is taking this thread a little away from its original focus I will curtail comment here. (Or is this a case in point about replication and what is necessary?)
[Response: Thanks (not sure I’m much the wiser though). (By the way, I have nothing against Stata, it seems extremely powerful. It’s just that I don’t have it on my desktop). – gavin]
michel says
I still don’t get it. Whatever the code is, if its in snippets of script, or in one large program, why not release it? Why do you have to do extra work? Just release whatever you used, and let people make what they can of it.
And don’t all pile in and tell me I have never done things, and don’t understand things, like write shell scripts, run shell scripts or write programs, or use Unix, that I do every day, thank you!
[Response: Samuel Johnson may have once written that “I did not have time to write you a short letter, so I wrote you a long one instead.” Encapsulating what is necessary in an efficient form is precisely what scientists do, and this is the case whether it’s in a paper or an online submission. My working directories are always a mess – full of dead ends, things that turned out to be irrelevent or that never made it into the paper, or are part of further ongoing projects. Some elements (such a one line unix processing) aren’t written down anywhere. Extracting exactly the part that corresponds to a single paper and documenting it so that it is clear what your conventions are (often unstated) is non-trivial. – gavin]
Thor says
#79
[Response: But there is always the potential for error and/or ambiguity in what was done – that doesn’t go away because the code looks complete. – gavin]
Ambiguity is generally larger in prose than in code.
Scientific papers today seem to rely on a lot more data than just a few years back. Computer processing is no longer optional, it has become a requirement. Thus, the code exists, and it will also be highly unambiguous, regardless of what language was used.
One simple way of assuring that replication is possible is to make the code available. If someone wants to run it, fine. But the main thing is that code availability reduces the algorithmic ambiguity that will always be in the paper itself.
tamino says
Re: #88 (JS)
If you use the White correction to least squares, it’s a very restrictive form of GLS which assumes the noise is uncorrelated (although heteroskedastic), so the variance-covariance matrix is not the identity matrix but is still diagonal. If the var-covar matrix is not actually diagonal, then the White correction doesn’t give the right answer.
Newey-West (as referenced in Stata) isn’t GLS at all; it’s a correction to the standard errors for autocorrelated/heteroskedastic noise when using OLS. That’s a perfectly valid procedure, and is computationally simpler than GLS, but not quite as precise.
But the bottom line is: for any method to be right you have to get the variance-covariance matrix right. If you assume no correlation (i.e., a diagonal matrix) when it ain’t, your results will be incorrect.
Mark says
re: #90.
How can ambiguity be “larger”? What does “larger” mean?
And if English isn’t your first language, this shows that ambiguity can happen in code too. NOBODY has it as their first language.
There is a reason why ADA is still being used and that’s because it was designed as a formally provable programming language. This being the case, it is obvious to deduce that most languages (and all the popular ones) are not formally provable.
And if they can’t be formally proved, that must mean there is ambiguity in what the program is saying it’s doing and what it really IS doing.
Mark says
re: 75. You missed the point there.
You shouldn’t even be recreating the process exactly. As someone else put it, if you are told in a paper that water boils at 100C at 1 atmosphere, you don’t need to take the same thermometer to check if they’re right.
In fact, if their thermometer wasn’t calibrated correctly, you may be WORSE off if you used their thermometer.
So do the work again yourself. If you use exactly the same programs, exactly the same data and exactly the same procedures all you’ve proven is that the original didn’t mistype.
Whooo. Big it up for the science.
I mean, when it came to cold fusion, people were trying to replicate the process. And when failing, said so. And the original authors said “You have to do it EXACTLY our way”. Now do you think if, instead of trying it again themselves, they went and put the data from the experiment they did into the program they used, you’d get the wrong result? Possibly, but just as likely you’d get the same result because the measurements were incorrectly done or someone in the next lab was using a polonium target or something.
So redoing their work rather than redoing their experiment could easily have proven cold fusion.
Way to work!
Barton Paul Levenson says
Nicolas Nierenberg writes:
And we objected because your answer is clearly insane. It is NOT “standard” for software companies to make their code open source. If it were, there probably wouldn’t be any software companies.
Thor says
#92 Mark,
I’m sure you understood perfectly well what I meant – even though my prose might have had a slightly higher level of ambiguity than intended. And yes you are correct, English is not my first language.
My entire point is that algorithmic ambiguity should be at as low a level as feasible. Formal proofs in that context is just silly.
dhogaza says
Not silly, just impractical.
Tamino above illustrates a point several of us have been making about reading code in order to “audit science”:
You need to have sufficient knowledge in order to have the proper insight to what the underlying algorithms, stat analysis, physical model which is being programmed actually does.
In tamino’s example, if your assumption regarding correlation is wrong, it doesn’t matter a bit of the code doing the analysis is correct or buggy as hell: you will get an incorrect result.
Now, scroll backwards to michel’s comment, in which he talks about why he wants the code at a level which seems fairly representative of the rock-throwing “free the source” hacker mob:
This clearly demonstrates the misconception that seems held by that crowd that understanding something about software (and operating environments and how to type on a keyboard) is somehow relevant to really understanding what a piece of scientific programming does.
Mark says
Thor 95, And you missed my point. NOBODY outside a mom’s basement at 47 years of age has “C” as their first language.
The algorithms are in the paper and that is what the code SHOULD be doing.
Write your own code to the algorithm presented. And, like the airplane fly-by-wire servers, you have a MUCH better answer.
The source code might be nice, but so would a pony. Do we go demanding a pony too?
Mark says
Tamino #91 and there’s a great example of why the code isn’t really worth as much as the paper itself.
As a computer porogrammer (though not a CS major), I am all “WTF???” about that.
So to make appropriate use of the code and paper (enough to enable us to DEMAND it all be available), we need a team of
climatologists
statisticians
logicians
CS majors
and a typist.
Anyone out there asking for the code got such a group ready to work on it???
Nicolas Nierenberg says
Mr. Levenson,
I’m sorry if my wording confused you. In my business the word enterprises generally refers to the users of software not the publishers. So please read my comments in that context. Many if not most large enterprises have open source as at least a portion of their corporate standard.
It doesn’t seem like that profitable a discussion anyway since I don’t believe that Dr. Schmidt is arguing that the software they developed is proprietary. I think instead his point had to do with a perception that I would require the use of open source software. I wouldn’t although all things being equal it would be preferable.
Nicolas Nierenberg says
Dr. Schmidt you comment that at the bottom line you think the correlations in MM07 are spurious. I find it interesting that when withholding a significant number of points from the regression that it is successful at predicting the values of the held out points. I am not an expert on this type of statistical analysis, but it seems to me that this is a good test of whether it is completely spurious.
[Response: But this test was done picking at random. If there is spatial correlation, picking one point instead of another point in the same region will give pretty much the same result (there is a lot of redundancy). So you can in fact discard most points and find the same correlation, but this actually confirms that the points are not all independent. A better test would have been whether results from one region (say Europe and surroundings) predicted values elsewhere. – gavin]
I guess I’m also not completely surprised that there would be some correlation between the results of model output and the factors studied in MM07. The models are designed to emulate the real world, and to some degree must have been corrected to improve this emulation. This would mean that they correlate generally with real world temperature distributions over historical periods as measured. This is the same historical measurement system studied in MM07 as I understand it.
[Response: Regional down to grid-box trends for short time periods (we are looking at a 23 year period) are not robust in coupled models due to the different realisations of internal variability. These models do not include any temperature data either (though they do have volcanic effects included at the right time, and good estimates of the other forcings). – gavin]