Yahoo Releases Record Machine Learning Dataset
Written by Kay Ewbank   
Tuesday, 19 January 2016

Thirteen Terabytes of anonymized user-news item interaction data has been made available for developers to use in machine learning applications.

This is the largest ever set of data to be made available for general use. It began life as user-news interaction data, collected by recording the user-news item interactions of about 20 million Yahoo users from February 2015 to May 2015. The dataset contains around 100 billion events. The Yahoo news Feed dataset was drawn from the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

yahoolabs

Writing about the dataset, Suju Rajan of Yahoo Labs said:

"Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research. The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprising anonymized user data for non-commercial use."

In addition to the interaction data, Yahoo is providing a range of categorized demographic information for a subset of the anonymized users. The demographic information includes age range, gender, and generalized geographic data. On the item side, the dataset contains the title, summary, and key-phrases of the news article. The interaction data is timestamped with the relevant local time and also contains partial information about the device used to access the news feed. Rajan says this: 

"allows for interesting work in contextual recommendation and temporal data mining."

The dataset has already led to work by Yahoo on a scalable recommendation system based on the concept of Factorization Machines, and on a research paper investigating user engagement based on the amount of time that users spend on content items. Yahoo Research has also been using the data for investigating the areas of behavior modeling, recommender systems, large-scale and distributed machine learning, ranking, online algorithms, content modeling, and time-series mining.

The hope is that this data will be used by researchers, data scientists, and machine learning enthusiasts in academia who need an extensive, “real-world” dataset. The researchers believe that this dataset can become the benchmark for large-scale machine learning and recommender systems.

yahoolabs

More Information

Yahoo Newsfeed Dataset

Yahoo Tumbler Post

Related Articles

Google Cloud Datalab Beta

GCHQ Open Sources Gaffer

Coursera's Machine Learning Specialization

The Flaw In Every Neural Network Just Got A Little Worse

The Deep Flaw In All Neural Networks 

The Flaw Lurking In Every Deep Neural Net  

Neural Networks Describe What They See       

Neural Turing Machines Learn Their Algorithms       

Learning To Be A Computer       

The Triumph Of Deep Learning

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter,subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin

Banner


pg_parquet - Postgres To Parquet Interoperability
28/11/2024

pg_parquet is a new extension by Crunchy Data that allows a PostgreSQL instance to work with Parquet files. With pg_duckdb, pg_analytics and pg_mooncake all of which can access Parquet files, is  [ ... ]



TestSprite Announces End-to-End QA Tool
14/11/2024

TestSprite has announced an early access beta program for its end-to-end QA tool, along with $1.5 million pre-seed funding aimed at accelerating product development, expanding the team, and scaling op [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 19 January 2016 )