With the holiday shopping season officially kicked off, sales data is on everyone's mind. But that's not the only data people are talking about. Talk of data brokers and consumer tracking is becoming more commonplace, as is a certain backlash against big data. This week, we highlight some articles touching on these subjects and round out the discussing with some coverage of natural language processing techniques.
Everyone knows about online tracking by large tech companies. According to NBC an astonishing 78% of the top 1 million websites give Google their web behavior data. Realistically this data transfer is via Google Analytics. Facebook is second highest with some 32% of the top 1 million sites providing their data to Facebook. Our data is being collected by more than Internet behemoths. Less well-known are data brokers that collect all sorts of data on individuals on and off line. Their primary purpose is to "create profiles and then sell those profiles to businesses looking to advertise a product." (NBC) The level of detail is eye-opening, with credit bureaus like Experian even collecting pay stubs. There is so much data and so little transparency that after the Ashley Madison hacks, the Wall Street Journal openly worried about a hack of a data broker. It's unfortunate that most of this data is used primarily for advertising and marketing purposes. Used responsibly, this data could actually cut back the number of ads in circulation while increasing conversions. It could also radically change the way insurance is managed.
Big data is so ubiquitous that it's only natural that people start challenging the merits of big data. Okay, nobody is saying big data doesn't have value. Rather, the emphasis is on the value of small, or lean, data. From the perspective of the authors, the point is being smart about the data you collect. For Hollie Russon Gillman, it's about "human-sized data" for civic engagement. These are great points, but are we talking about the same thing here? Both of these articles are really focused on data collection. Since collecting data (particularly in the field) can be costly, it's imperative that surveys are designed well and the scope fairly limited. Small data is thus more or less a given.
Big data on the other hand can mean two things. One is the truly massive datasets that the Internet titans collect. According to CIO magazine, some 31 million messages are sent via Facebook per minute. Most of this data is passively collected, which leads to the second meaning of big data: finding value from data after the fact. Big data isn't about collecting massive amounts of data to answer specific questions. Rather it's about finding relationships and other structures within data that can augment existing knowledge.
In short, the difference between lean data and big data is around intent. Survey data is purposefully collected to answer specific questions. Big data is often about re-purposing previously collected data to discover new relationships. That said, from a cost and simplicity perspective, it's generally advisable to stick with smaller data unless you absolutely need big data.
In the world of startups, applications built around conversational AI are hot. These include personal assistants like Siri, Facebook M. However, there is a newer class of conversational AI taking on tasks like scheduling meetings to running errands. I know a little about this space, since I'll be announcing my own conversational AI service soon. So how do these apps work? There is a continuum of intelligence afforded to these apps, anywhere from simple chatbots to sophisticated neural networks or hybrid "cyborgs" that get assistance from humans.
At the core of most systems is semantic parsing, or extracting meaning from natural language. One approach is based on detailed linguistic models, such as the Combinatorial Categorical Grammar, which melds category theory and the lambda calculus with linguistics. In other words, it uses a logical framework that describes semantic structures as typed functions. What separates CCG from other models, like context-free grammars, is that CCG is considered cognitively plausible. From an application perspective, CCG can be used to both parse language and also generate (aka realize) sentences using the grammar. To get started, take a look at the OpenCCG home page. You'll also want to read the "rough guide" on how to specify grammars. Unfortunately, while OpenNLP has R bindings, there don't appear to be any bindings for OpenCCG. That said, someone intrepid could connect to the library via Rjava or use system calls to access the executables.
While models like CCG are conceptually elegant and rich in their expressiveness, they are also difficult to understand and tedious to construct. An altogether different approach is to use deep learning to model all of the semantic structures. This approach in general eschews representational models up front and instead uses a large dataset to train the neural network. It's truly impressive that these models can learn all these features from scratch. Many large datasets exist to ease the training process. The drawback is that you need an army of computers to train the model. Otherwise it can take weeks for the network to converge, as mentioned in one paper. If you want to charge ahead, a good place to start is Stanford's course aptly titled Deep Learning for Natural Language Processing.
As a career tip, anyone who wants to work at Google or Facebook in AI should work through all the suggested reading and build some models of comparable performance. I suspect you'll get hired rather quickly. To create models in R, the most direct approaches are probably H2O or TensorFlow, which I wrote about a few weeks ago.
We all know that models and algorithms can have unexpected outcomes. The Nieman Foundation for Journalism runs a "heat-seeking Twitter bot" called Fuego. It collects and ranks articles shared amongst people talking about the future of journalism. Sometimes the juxtaposition of results can be profound.