1 Billion Web Pages = 1 Million Dollars?

Written by Mike James

Sunday, 16 December 2012

It seems to be easier that we thought to win "Who Wants to be a Millionaire". All you need is a little data mining and access to the web and answering the questions is almost trivial.

Who wants to be a Millionaire is a multiple choice style quiz. A question is asked and a set of possible answers are presented to the contestant. All the contestant has to do is pick the correct answer. There are some interesting strategic complications, such as being able to opt for help from a friend, or reducing the number of alternatives but this is more or less the outline of the game. The rewards are staged and getting a question wrong means that the contestant gets the winnings at the last "milestone" - which could be zero if no milestones have been passed.

whowantstobeamillionare

The first question is can we automate the answering of the multiple choice question using all the huge amounts of data on the web?

The answer, according to Shyong (Tony) K. Lam, David M Pennock, Dan Cosley and Steve Lawrence from four different US Universities, is that it turns out to be very easy. It isn't even really a matter of AI - simple data mining will suffice, coupled with a few statistical methods. You can get an improvement with some simple language processing but this doesn't seem to be a big one.

The basic method was to take each question and pair it with each possible answer and build a search query. When this query is submitted to Google all you have to do to pick the right answer is to count the number of pages returned. In other words you are using the number of pages returned as a measure of how of the the question occurs along with the answer on the web. The answer with the highest correlation with the question is assumed to be correct.

This amazingly simple approach gets over 55% of the questions right. If you add an additional processing step that retrieves the first ten web pages and measures the distance in the text between the question and the answer and use this to weight the results,you can do a little better. But to get a reasonable improvement you have to do some syntactic analysis and work out the proximity of the noun phrases in the question to the possible answers. Using this method, the accuracy goes up to 69%. Using a combination of methods gets you to 70% accuracy, which is impressive given the simple nature of the method.

The team also tried out the method with different search engines and you might want to know that the results were:

Engine	Combined performance
Google	70%
AltaVista	68%
AllTheWeb	66%
MSN(Bing)	58%

I think this means that you probably want to take Google with you onto that particular quiz show.

The researchers give examples of the types of questions the method tended to get wrong and they are interesting because they give some idea of what it is that limits simple non-interpretive data mining might be subject to:

Common Sense. How many legs does a fish have? 0, 2, or 4? This information may exist on the web, but is probably not spelled out.
Multiple Plausible Answers. What does the letter "]" stand for in the computer company name "IBM"? Information, International, Industrial, or Infrastructure? "Information" probably appears just as often as "international" in the context of IBM.
Polysemy. Which of these parts of a house shares its name with a viewing area on a computer screen? Wall,Root, Window, or Basement? The words "root" and "computer" often co-occur (e.g., the Unix superuser). This question also suggests that biases in the content of the web originally by and for technical, computer-literate usersmay hamper using the web as a general knowledge base in some instances.
Non-Textual Knowledge. Which of these cities is located in Russia? Kiev, Minsk, Odessa, or or Omsk? The program doesn't know how to read maps.
Alternative Representations. Who is Flash Gordon's
archenemy? Doctor Octopus, Sinestro, Ming the Merciless, or Lex Luthor? The word "archenemy" usually appears as two words ("arch enemy") on Flash Gordon (and other) pages.

You might care to ponder how to improve things so at these problems are no longer problems - it isn't easy.

The researchers also tackled the problem of working out what the optimal strategy was in terms of risks and rewards as the game progressed. in other words when should you walk away with what you have and give up the quest for the one million dollars.

Overall the analysis is interesting and it gives you some idea just how effective simple methods are, if you have enough data.

whowantstobeamillionare

More Information

1 Billion Pages = 1 Million Dollars? Mining the Web to Play "Who Wants to be a Millionaire?"

Data Analytics for Weathering Storms

The Significance Of Big Data

Microsoft's New Research Center into Social Data

How the Music Flows from Place to Place

Deep Blue Turns 15

Watson wins Jeopardy! - trick or triumph

Watson wins on Jeopardy

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin, or sign up for our weekly newsletter.

Comments

or email your comment to: comments@i-programmer.info

The Strange Case Of the JavaScript Trademark
15/01/2025

JavaScript is the most widely used programming language in the world. Now in its 30th year, JavaScript is firmly established as the language that powers the web. But it has a problem. Oracle owns the [ ... ]

+ Full Story

InfluxDB 3 Core Released In Public Alpha
03/02/2025

Core, the new open source InfluxDB offering, promises enhanced performance and more freedom in using it under a permissive license.

+ Full Story

More News

Last Updated ( Sunday, 16 December 2012 )

More Information

Related Articles

Comments