Subscribe to DSC Newsletter

Averaging numbers is easy, but what about averaging text? I ask the question since the Oxford dictionary recently added a new word: big data.

So how do you define 'big data'? Do you pick up a few experts (how do you select them) and ask them to create the definition? What about asking thousands of practitioners (crowdsourcing) and somehow average their responses? How would you proceed to automatically 'average' their opinions, possibly after clustering their responses into a small number of groups. It looks like this would require

  • using taxonomies,
  • standardizing, cleaning and stemming all keywords, very smartly (using stop lists and exception lists, for instance booking is not the same as book)
  • using n-grams, but carefully (some keywords permutations are not equivalent - use list of these exceptions, e.g. data mining not the same as mining data)
  • matching cleaned/normalized keyword sequences found  in the responses, with taxonomy entries, to assign categories to the response, to further simplify the text averaging process
  • maybe a rule set associated with the taxonomy, such as average(entry A, entry C) = xyz
  • a synonym dictionary

On a different note, have you ever heard of software that automatically summarize text? This would be a fantastic tool!

Related articles

 

Views: 3785

Reply to This

Replies to This Discussion

To those who claim that you can't add up oranges and apples, here's my answer: average(apples, oranges) = edible, widespread fruits.

 

If you used the "mode" (most frequently occurring value) of your data clusters then average(apples, oranges, apples, kiwi) would yield "apples".

It seems to me, if you average a variety of fruits, you wind up with juice.

Or wine.

Nick Galemmo said:

It seems to me, if you average a variety of fruits, you wind up with juice.

Seems to depend upon the definition of 'average' with relation to 'inputs' :)

There is a free library for text summarization:

http://libots.sourceforge.net/

Comment by Doug Lautzenheiser :

There was a story about a high school boy who wrote an app called Summly to summarize text. See: http://venturebeat.com/2012/11/01/summly-launch/ . 

Yahoo acquired the app for $30 million USD. See:http://appleinsider.com/articles/13/03/25/yahoo-buying-ios-app-summ... . 

However, the story gets weirder in that some report the teenager didn't even have rights to the core technology. See: http://www.businessinsider.com/the-17-year-old-that-yahoo-paid-30-m... . 

Instead, it looks like the text summarization software was developed by SRI International, the same people who did the Siri Apple app.

I like this question. People think that numbers just work so easily with math but it's not true. Math is just as symbolic as words are. What is the average of 3.2 Hz and 5.1 gorillas? What, exactly is 5.1 gorillas? 3.2 and 5.1 can work in math abstractly but as soon as you apply the units it makes no sense. Average is meeting in the middle of all things. Median assumes a sort of spatial ordering on a linear dimension and is the middle as defined by the objects thus sorted. Not every math concept applies, I would think. Roots don't apply in a way I can think of. However, similarly, ontological or taxonomic relationships don't have a direct one to one relationship with a math structure though we can use a lot of math to discover those relationships given enough data.

Thanks, this was a nice mind expanding question to get my day started with!

Vincent Granville said:

To those who claim that you can't add up oranges and apples, here's my answer: average(apples, oranges) = edible, widespread fruits.

Summarizing text is a pretty widely-studied problem in Natural Language Processing. There's half a chapter on it in the Jurafsky & Martin textbook. Work in the area goes back at least to the 1950s (Luhn).

Getting back to the main topic of the post: if I were looking to "average" text, I think where I'd start is in computing differences among entries in the corpus at various levels of abstraction. Feature vectors of references to salient terms, for example, is one abstraction that seems it might be useful. Some abstraction of rhetorical intent would be useful; that might start with basic sentiment analysis, but it'd be even better to employ Rhetorical Structure Theory analysis and perhaps even more complex approaches from Computational Rhetoric (see particularly the work of DiMarco and her team - some really fascinating stuff there).
Some further thoughts. Comparing texts on the same subject is complicated in part by the number of (already complex) dimensions. You want to compare the denotative content - what each text refers to, what claims it makes, etc. You want to compare the intensive content: what the writer appears to intend to communicate (leaving completely aside all the issues with intentionality and philosophy of language, and assuming a hopelessly naive model of language use as the transmission of ideas). You want to compare the subjective content: the opinions the author expresses, and how they set the tone of the piece.

These can diverge radically between two texts. Two authors can express the same denotative content with completely opposite subjective stances and different intents. And you want your "average" (and other metrics - obviously in a case like this it'll often be useful to have some analog of the deviation!) to account for all of those dimensions.

Here's a possible approach that might yield some useful information. Start by defining a feature vector appropriate to the subject of the texts. Create a Maximum Entropy Markov Model using that feature vector and train it from a sample group selected at random from the texts. Then see how well the MEMM predicts the remaining texts. If it does well, the texts cluster relatively tightly around that feature vector; if it doesn't, they don't. Probably you'd want to run this a few times using different random samples to cut down on artifacts. That's just off the top of my head; I don't know if the results would actually be useful (and making any sort of real argument about them is clearly a lot harder than the sort of handwaving I'm doing here). And the tricky part is defining that feature vector (and implementing the NLP analysis to be able to set it).

The handy thing about using a feature vector and an MEMM is that the vector could include all the sorts of features Mirko mentioned above, like whether a particular term or n-gram appears in an entity.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service