Subscribe to DSC Newsletter

Facebook Shares Large Data Sets to Help Improve its AI and Data Science Algorithms

Facebook published resources related to its AI Research project, organized towards the goal of automatic text understanding and reasoning. The datasets released consist of:

  • The (20) bAbI tasks
  • The Children's Book Test
  • The Movie Dialog dataset
  • The SimpleQuestions dataset

One of the rather larger ones (1,6 GB compressed) is the Children's Book Test  and can be downloaded here. The following is an extract from the public Facebook research page

The Children’s Book Test (CBT), designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg. Details and baseline results on this dataset can be found in the paper:

Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children's Books with Explicit Me... arXiv:1511.02301.

After allocating books to either training, validation or test sets, we formed example ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions

Here is an example of question (context + query) from Alice in Wonderland by Lewis Carroll:

For other large data sets released by public companies, check out this page (Yahoo, 1.5 TB) 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 17077

Reply to This

Replies to This Discussion

I like the "remove word from 21st sentence and 'guess' it" test.  Hadn't seen this before, though, obviously, it may have been used elsewhere.  Anyway, someone had a good imagination.

Question to "Laetitia Van Cauwenberge", how do you decide which fake (or real) identity to use on a given post?  If you are fake, does that mean that you are a fake data scientist, and that all the supposed data science you do is fake data science?

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service