Scientists, Data Scientists And Significance
Written by Mike James
Monday, 15 April 2019
Article Index
Scientists, Data Scientists And Significance
Misusing Significance

## Misusing Significance

So the procedure for significance testing is:

• State what your experimental procedure is. For example, I will reject the hypothesis that this is a fair coin if I see more than x heads or tails.

• State what the significance is for this experiment. This is the probability of type I error of rejecting the null hypothesis when it is true. For example, at a significance of 0.05 my experiment will reject the null hypotheses 5 times in 100 repeats.

• State what the power is. This is the probability of type II error or not rejecting the null hypothesis when it is true. For example, if the coin's probability of falling heads is 0.25 or 0.75 or greater then the experiment has a power close to 1.

Once again note that it is the experiment that is quantified in terms of significance and power and not a particular realization of the experiment.

It is important that the experiment is performed in exactly the way that would result in the probabilities calculated if it was repeated. For example, if you selected the best of three lots of 50 tosses then the repeated experiment would not be described by the significance and power.

A common distortion of the procedure is to do the experiment and then work out the significance. If it is better than 0.05 then you quote the result as being significant at the 5% level. If it is better than 0.01 you quote it as being significant at the 1% level. This is not an experiment that has a significance of 0.01 simply because, if you repeat the procedure, you will erroneously reject the null hypothesis 5 times in 100 - the "extra" significance is spurious.

A much bigger problem is the repeated experiment situation. If you are using experiments that have a significance of 5%, then if you repeat the experiment 100 times you will expect to see five significant results purely by chance. I once was asked why in a ten by ten correlation matrix there were always a handful of good significant correlations. When I explained why this was always the case, I was told that the researcher was going to forget what he had just discovered and I was never to repeat it. Yes measuring lots of things and being surprised at a handful of significant results is an important experimental tool. If repeated attempts at finding something significant were replaced by something more reliable, the number of papers in many subjects would drop to a trickle. This is a prime cause of the irreproducibility of results and a repeat generally finds the same number of significant results, just a different set.

Another really big problem is that most researchers don't quote any sort of power figure for their results. The reason is that in many cases the sample sizes are so small that the power is very low. Many studies have a power below 0.5. For example, if you toss the coin only 10 times, the power to detect a bias as larger than 0.25 or 0.75 is just 0.5. What this means is that half of the experiments will fail to detect quite a sizable bias. There are plenty of real world cases where null results are very likely to be due to a lack of power and when it comes to, say, comparing the negative effects of drugs this can be lethal.

Whenever I have tried to explain this, with the advice "you need more data", I have always been met with the response that "more data is impossible; what can we do with what we have?". There are many subjects that would dry up if power estimates were made mandatory.

## Not Significant Enough

The suggestions for cleaning up significance range from getting researchers to rephrase their conclusions to estimating confidence intervals. All I can say to this is that if significance testing is misunderstood, confidence intervals are an even deeper mystery. Don't go there please.

So what is the solution?

There is a solution, but many disciplines will simply be unable to accept it. Consider for a moment physics - often a standard by which to judge a scientific procedure. When the apple fell on Newton's head, he didn't have to consider probability. The apple fell to earth with a probability so close to 1 that it wasn't even worth considering. An old-fashioned physics experiment is so certain that many physicists don't know anything much about statistics - a bit of error estimation is quite enough. Putting this another way, physics doesn't work with 5% significance; it uses fantastically small significance levels.

When physicists have to take chance into account they continue this high standard. For example, the discovery of the Higg's boson needed data that was five standard deviations away from the results predicted by a model where it didn't exist. That is, a rough significance level of .0000003 or 1 in 3.5 million - think about that compared to 5 in 100. Particle physics in general requires a significance of 0.003 to announce evidence of a particle and 0.0000003 to announce a discovery.

If we want reproducible results we need to increase significance level and be aware of the power of the experiments we perform.

Of course, data scientists have lots of data and could use significance levels similar to physics. The big problem is that with publications decrying the use of significance and alternatives being suggested, the chances are that they will be seduced by lesser procedures that result in just as irreproducible results.

Show me the data, show me the evidence, and make it good.

#### Related Articles

The Monty Hall Problem

What's a Sample of Size One Worth?

Data Science Course For Everyone Now Online

Coursera Offers MOOC-Based Master's in Data Science

 ChatGPT And Excel Another Coding Threat?06/09/2023We have been considering the role of coding copilots in helping skilled programmers create code, but what happens when large language models attempt to create a spreadsheet? Is this just another way t [ ... ] + Full Story Azure Drops MariaDB Support, Adds Free SQL DB02/10/2023Microsoft has announced changes to its Azure database services that will see support dropped for MariaDB and a free tier added for the preview of Azure SQL databases. + Full Story More News