Scientists, Data Scientists And Significance
Written by Mike James   
Monday, 15 April 2019
Article Index
Scientists, Data Scientists And Significance
Misusing Significance

Misusing Significance

So the procedure for significance testing is:

  • State what your experimental procedure is. For example, I will reject the hypothesis that this is a fair coin if I see more than x heads or tails.

  • State what the significance is for this experiment. This is the probability of type I error of rejecting the null hypothesis when it is true. For example, at a significance of 0.05 my experiment will reject the null hypotheses 5 times in 100 repeats.

  • State what the power is. This is the probability of type II error or not rejecting the null hypothesis when it is true. For example, if the coin's probability of falling heads is 0.25 or 0.75 or greater then the experiment has a power close to 1.

Once again note that it is the experiment that is quantified in terms of significance and power and not a particular realization of the experiment.

It is important that the experiment is performed in exactly the way that would result in the probabilities calculated if it was repeated. For example, if you selected the best of three lots of 50 tosses then the repeated experiment would not be described by the significance and power.

A common distortion of the procedure is to do the experiment and then work out the significance. If it is better than 0.05 then you quote the result as being significant at the 5% level. If it is better than 0.01 you quote it as being significant at the 1% level. This is not an experiment that has a significance of 0.01 simply because, if you repeat the procedure, you will erroneously reject the null hypothesis 5 times in 100 - the "extra" significance is spurious.

A much bigger problem is the repeated experiment situation. If you are using experiments that have a significance of 5%, then if you repeat the experiment 100 times you will expect to see five significant results purely by chance. I once was asked why in a ten by ten correlation matrix there were always a handful of good significant correlations. When I explained why this was always the case, I was told that the researcher was going to forget what he had just discovered and I was never to repeat it. Yes measuring lots of things and being surprised at a handful of significant results is an important experimental tool. If repeated attempts at finding something significant were replaced by something more reliable, the number of papers in many subjects would drop to a trickle. This is a prime cause of the irreproducibility of results and a repeat generally finds the same number of significant results, just a different set.

Another really big problem is that most researchers don't quote any sort of power figure for their results. The reason is that in many cases the sample sizes are so small that the power is very low. Many studies have a power below 0.5. For example, if you toss the coin only 10 times, the power to detect a bias as larger than 0.25 or 0.75 is just 0.5. What this means is that half of the experiments will fail to detect quite a sizable bias. There are plenty of real world cases where null results are very likely to be due to a lack of power and when it comes to, say, comparing the negative effects of drugs this can be lethal. 

Whenever I have tried to explain this, with the advice "you need more data", I have always been met with the response that "more data is impossible; what can we do with what we have?". There are many subjects that would dry up if power estimates were made mandatory.

Not Significant Enough

The suggestions for cleaning up significance range from getting researchers to rephrase their conclusions to estimating confidence intervals. All I can say to this is that if significance testing is misunderstood, confidence intervals are an even deeper mystery. Don't go there please.

So what is the solution?

There is a solution, but many disciplines will simply be unable to accept it. Consider for a moment physics - often a standard by which to judge a scientific procedure. When the apple fell on Newton's head, he didn't have to consider probability. The apple fell to earth with a probability so close to 1 that it wasn't even worth considering. An old-fashioned physics experiment is so certain that many physicists don't know anything much about statistics - a bit of error estimation is quite enough. Putting this another way, physics doesn't work with 5% significance; it uses fantastically small significance levels.

When physicists have to take chance into account they continue this high standard. For example, the discovery of the Higg's boson needed data that was five standard deviations away from the results predicted by a model where it didn't exist. That is, a rough significance level of .0000003 or 1 in 3.5 million - think about that compared to 5 in 100. Particle physics in general requires a significance of 0.003 to announce evidence of a particle and 0.0000003 to announce a discovery.

If we want reproducible results we need to increase significance level and be aware of the power of the experiments we perform.

Of course, data scientists have lots of data and could use significance levels similar to physics. The big problem is that with publications decrying the use of significance and alternatives being suggested, the chances are that they will be seduced by lesser procedures that result in just as irreproducible results.

Show me the data, show me the evidence, and make it good.

More Information

Statistical Inference in the 21st Century: A World Beyond p < 0.05

Related Articles

MINE - Finding Patterns in Big Data

How Not To Shuffle - The Knuth Fisher-Yates Algorithm

The Monty Hall Problem

What's a Sample of Size One Worth?       

Reading Your Way Into Big Data

What is a Data Scientist and How Do I Become One?

Data Science Course For Everyone Now Online

Coursera Offers MOOC-Based Master's in Data Science

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


AI Propels Python To Top Language on GitHub
30/10/2024

This year's Octoverse Report reveals how AI is expanding on GitHub and that Python has now overtaken JavaScript as the most popular language on GitHub. The use of Jupyter Notebooks has also surged.



Google Updates Responsible AI Toolkit
01/11/2024

Google has announced updates to the Responsible Generative AI Toolkit to enable it to be used with any LLM model. The Responsible GenAI Toolkit provides resources to design, build, and evaluate open A [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Banner

 

 



Last Updated ( Monday, 15 April 2019 )