“Reproducible research” is a hot question

I have long been interested in the question of reproducible research and as a manuscript author, reviewer and more recently, editor, have attempted to make sure that no key information was missing and that methods were described in full detail and, of course, valid.

Although the problem has always existed, I think that in recent years papers and reports with badly described methods have become more frequent. I think that there are many reasons for this: 1) the pressure to publish quickly and frequently as a condition for career advance, 2) the overload on reviewers work’ and the pressure from journals to get manuscript reviews submitted within a few days’ time, 3) the stricter and stricter rules of journals about maximum number of “free” pages, and 4) the practice by some journals of publishing methods at the end of the papers or in smaller typeface, implying that methods are not important for most readers, and irrelevant for understanding the results described (which is a false premise).

Continue reading ““Reproducible research” is a hot question”

Some frequent ways of unwillingly misrepresenting experimental results

Many students and some researchers are ignorant of the fact that any of the following practices are statistically invalid and could be considered to be ‘research-results manipulation’ (=cheating):

  1. Repeating an experiment until the p-value becomes significant.
  2. Reporting only a ‘typical’ (=nice-looking) replication of the experiment, and presenting statistics (tests of significance and/or parameter estimates such as means and standard errors) based only on this subset of the data.
  3. Presenting a subset of the data chosen using a subjective criterion.
  4. Not reporting that outliers have been removed from the data presented or used in analyses.

RG: Do we actually need (or understand) more than basic statistics?

Link to the original Q&A thread at ResearchGate

This is another topic worthwhile looking at, and especially thinking about. I copy here, my answer, that is to some extent off-topic (you will need to follow the link above to read the original post and other answers):

Frequently students that I have supervised, seem to think that statistical tests come first, rather than being a source of guidance on how far we can stretch the inferences that we can make by “looking at the data” and derived summaries. They just describe effects as statistically significant or not. This results in very boring “results” sections lacking the information that the reader wants to know. When I read a paper I want to know the direction and size of an effect, what patterns are present in the data, and if there is a test, then statistical tests should help us decide what amount of precaution we need to use until additional evidence becomes available. Many students and experienced researchers which “worship” p-values and the use of strict risk levels ignore how powerful and important is the careful design of experiments, and how the frequently seen use of “approximate” randomization procedures or the approach of repeating an experiment until the results become significant invalidate the p-values they report.

[edited 5 min later] As I read again what I wrote it feels off-topic, but what I am trying to say is that not only the proliferation of p-values and especially the use fixed risk levels, but also many times how results are presented, is the reflection of a much bigger problem: statistics being taught as a mechanical and exact science based on clear and fixed rules. Oversimplifying the subtleties and degree of subjectivity involved in any data analysis, especially in relation to what assumptions are reasonable or not, and how any experimental protocol relates to which assumptions are tenable or not, is simply not teaching what would be the most useful training for anybody doing experimental research. So, in my opinion, yes we need to understand much more than basic statistics in terms of principles, but this does not mean that we need to know advanced statistical procedures unless we use them or assess work that uses them.

 

RG: What prevents you from using a p-value other than 0.05 as your statistical significance cut-off?

Link to the original Q&A thread at ResearchGate

Even though there were already 84 answers, I added my own answer:

… for me choosing the critical p-value is not a statistical question. It is in the realm of the real-world effective cost of making the wrong decision. In research, it mainly relates to balancing “false positive” and “false negative” decisions. So, mostly informally, sometimes researchers set the critical value at 0.1 (10%) when replication is low. On the other hand when we have many replicates, we will find statistically significant differences that are biologically irrelevant. [Added only here: The 5 % tends to work not too badly for the number replicates used by many of us.]

In my opinion in every scientific publication, whatever critical value we use for discussing and interpreting the results, the actual p-values should always be given. Not doing so, just discards valuable information. Of course, one historical reason for not reporting actual values was the laborious calculations involved in obtaining values by interpolation when using printed tables.

The situation has far-reaching consequences when dealing with legal regulation compliance studies, or for environmental impact assessment, or safety. I would not want to take 1 in 20 risk of making the wrong decision concerning the possible lethal side-effect of a new medicine, while it might be acceptable to take that risk when comparing the new medicine to a currently used medicine known to be highly effective [but maybe not if comparing against a placebo]. In such cases we would want, rather than balance the risks of making false positive or false negative decisions, minimize one of them. In other words minimize the probability of the type of mistake that we need/want to avoid.

I have avoided statistical jargon, to make this understandable to more readers. Statisticians call these Type I and Type II errors, and there is plenty of literature on this. In any case I feel most comfortable with Tukey’s view on hypothesis testing, and his idea that we can NEVER ACCEPT the null hypothesis. We can either get evidence that A > B or A < B, and the alternative being that we have not enough evidence to decide which one is bigger. Of course in practice using power analysis, we can decide that we could have detected or not a difference that would be in practice relevant. However, this is conceptually very different to accepting that there is no difference or no effect.

[I would like to see students, and teachers, commenting on this problem, and how this fits with their understanding of the use of statistics in real situations. Please, just comment below. I will respond to any comments, and write a follow-up post on theĀ  effect of using different numbers of replicates on inferences derived from data].