Google Flu Prediction - Beware The Media Effect
Written by Mike James   
Saturday, 16 February 2013

Google's flu prediction web site is a really good idea, but recently it got it wrong - all due to a media effect that you might think was itself very easy to take into account.

 

As far as data mining or statistics goes this is a very simple idea - there is a correlation between the number of cases of flu and the number of searches on the topic of flu. It is a very reasonable idea. Of course not all searches on the topic of flu are from people who have the flu, but if you gather some data it seems that the signal to noise ratio is very good. In fact, it is so good that Google has a site that shows you the current geographical prevalence of flu and there is even a paper in Nature explaining how good the system is.

The important point is that the predictions Google makes leads the CDC's data, based on reported cases, by as much as 14 days - which is enough time for people to react to the information.

You can see this lead in the following video:

 

However, according to a report in Nature, this year it isn't working out quite as well. Flu in the US started to rise in November 2012 and peaked just after Christmas. Google's curve seems to follow the trend but it over estimates the CDC's figures by double and more in some regions.

The problem seems to be the obvious one - a media effect. The noise in Google's figures is related to the number of searches about flu that aren't because the people doing the searching actually have flu. If this number remains constant then it should be possible to factor it out. The problem this year is that the media has been very active on the topic of flu and this is very likely to have caused people to search for general news items on flu or to search for the sort of data that Google Flu Trends provide.

googleflumap

The redder the more flu

So while Google's technique is as good as, or better than more expensive ways keeping track of an epidemic, it fails for a very obvious reason.

Of course this is fixable. All Google has to do is find a search based variable, or any easily to obtain variable, that is correlated with media attention and build a model that factors this in. This is easy in principle, but comes with the usual practical problems of implementation

There is also a more general lesson to be learned. Social media and search data may distill the current interests of the crowd, but the cause of those interests may vary in erratic ways. Wasn't there something about correlation not being the same as causation...

 googlevirus

More Information

When Google got flu wrong

Google Flu Trends

Related Articles

Google Civic Information API

The Significance Of Big Data

Twitter Can't Predict Elections Either

How the Music Flows from Place to Place

 

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.

 

blog comments powered by Disqus

 

Banner


Google Fit SDK Preview
08/08/2014

Google announced Google Fit at this year's I/O but there wasn't much to say other than it would be launched some time towards the end of the year. Now we have the preview SDK ready for the Fall final  [ ... ]



Imagine Cup Winners 2014
09/08/2014

The finals of the Imagine Cup were recently held in Seattle attended by 34 teams, 125 students in all, representing 34 countries. Winners of the Imagine Cup, Team Eyenaemia will be meeting with Bill G [ ... ]


More News

 

Last Updated ( Saturday, 16 February 2013 )
 
 

   
RSS feed of news items only
I Programmer News
Copyright © 2014 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.