Subscribe to DSC Newsletter

Has the pace of information growth started to slow?

It all started as a little data science project, possibly a job interview question for applicants: How would you compute the number of entries on Wikipedia.

The idea was to use large keyword lists (say 5,000,000) and check how many keywords from these lists have a Wikipedia entry, using a web crawler to run 5,000,000 searches on Wikipedia. Based on the number of Wikipedia entries found in your list (say 400,000), you would estimate the size of Wikipedia (say 6,000,000 articles). Other strategies would consist in checking how many entries start with aa, ab, ac, ad, etc.

Wikipedia has actually very precise statistics and historical data about its size and growth, which makes my project even more interesting: You can start with your own statistical model and keyword lists, and then check it against real data!

Wikipedia has actually 4,000,000 entries (in English, wondering if the Chinese version is growing faster). This metric (especially the growth curve) is one among many metrics you could use to measure the growth of information, versus the growth of big data.

Wikipedia's growth has dramatically slowed since 2007. Is it just a problem with Wikipedia, or have things really started to change in 2007? I'll leave it to you to answer this question. Also keep in mind that even though fewer new articles are added, it's possible that there is a lot of activity and updates on old articles - this is not factored in into the overall growth. Pace of innovation or creation might indeed be a better word than pace of growth, in connection with the slow down.

Views: 1431

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on December 14, 2013 at 10:52am

Apparently, it is an internal Wikipedia issue. They've regrouped articles, have more general rather than specialized articles, something like that. Incidentally, the slow down (it's still a growth, but not as fast), started when innovation budgets were cut because of the great recession. But it might just be a coincidence. Or a very, very indirect side effect (for instance, fewer editors as they suddenly spend all their time trying to find new jobs).

Comment by Rob Burton on December 14, 2013 at 6:52am

Is it possible that we've now documented most of the information and the amount of information not documented is shrinking?

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service