Facebook published resources related to its AI Research project, organized towards the goal of automatic text understanding and reasoning. The datasets released consist of:
The Children’s Book Test (CBT), designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg. Details and baseline results on this dataset can be found in the paper:
Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children's Books with Explicit Me... arXiv:1511.02301.
After allocating books to either training, validation or test sets, we formed example ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions
Here is an example of question (context + query) from Alice in Wonderland by Lewis Carroll:
For other large data sets released by public companies, check out this page (Yahoo, 1.5 TB)
I like the "remove word from 21st sentence and 'guess' it" test. Hadn't seen this before, though, obviously, it may have been used elsewhere. Anyway, someone had a good imagination.
Question to "Laetitia Van Cauwenberge", how do you decide which fake (or real) identity to use on a given post? If you are fake, does that mean that you are a fake data scientist, and that all the supposed data science you do is fake data science?