Subscribe to DSC Newsletter

GIP IG Barometer: a text-mining approach to Internet Governance Monitoring

The Internet Governance (IG) Barometer methodology presents a quantitative summary of the main developments in the Internet Governance arena based on computational text and data-mining approaches. The IG Barometer is based on the statistical modeling of large collections of textual documents: thus, it essentially presents an advanced discourse processing system. These collections - called text corpora - are obtained by querying various online media sources with IG specific keywords and search terms, retrieving the most relevant IG news, articles, papers, etc. Thus, the IG Barometer reflects the status of the debate as represented in media - not human expert judgment on particular IG issues. The IG Barometer was developed for the Geneva Internet Platform by DiploFoundation (GIP is an initiative supported by the Swiss authorities and operated by DiploFoundation).

For the purposes of analysis, the IG debate is first “dissected” into a number of IG Issues (ie. Network neutralityCybersecurityMultilingualismIoTE-money and virtual currenciesChild safety online, to name only a few) , following the IG taxonomy developed by Dr Jovan Kurbalija (cf. “An Introduction to Internet Governance”, Kurbalija, 2014). The IG Barometer methodology results in the computation of four scores for each IG Issue: RelevanceSpecificityDiversity, and Positivity, all expressed as percentile ranks - similar to standardized test results reporting. Each of these scores reflects the relative position of a particular IG Issue in respect to all other issues encompassed by the analysis. The computation of the scores is based upon the previously statistically modeled IGF Session Transcripts Text Corpus: a collection of hundreds manually tagged session transcripts from the Internet Governance Forum 2006-14, rich with meta-data, encompassing the codification of expert IG knowledge as represented in the various IGF sessions, workshops, and fora . All IG Barometer computations are supported by the IG Terminological Model, a hand-picked and manually tagged selection of approximately 5,000 most relevant IG keywords, terms and phrases.

Hereby we describe conceptually the elements upon which the computation of the IG Barometer scores is founded; the interpretation of the four IG Barometer scores is provided immediately afterwards. For technical and mathematical details, we refer to the IG Barometer White Paper (to become available in October, 2015).

IG Terminological Model

The language used in the IG arena is highly specific. We need a tool to recognize this specificity in the context of the ordinary linguistic production. The IG Terminological Model presents a hand-picked selection of IG relevant keywords and phrases. They are used to describe any IG text document by an array of their frequencies of occurrence - a count of how many time each phrase or keyword appears in the text. Such data present typical inputs to computational text-mining procedures. The IG Terminological Model currently encompasses approximately 5,000 relevant keywords, terms and phrases, including the names of the most important IG processes, institutional or organisational actors, technical, political and diplomatic terms and phrases.

image

The IG Terminological Model. 350 most frequently used keywords and phrases tagged in the IGF Corpus Session Transcripts (2006-14); size of the word or a phrase scales with how often it was used in the IGF.

IG Taxonomy

The IG Taxonomy used to categorise and describe the field of IG is based upon Dr Jovan Kurbalija’s work, presented and updated through many editions of his “An Introduction to Internet Governance” handbook that is widely used in IG introductory courses for diplomats and other scholars, translated into several world languages. The taxonomy is based on the dissection of the IG arena into seven baskets (or clusters): Infrastucture and StandardisationSecurityHuman RightsLegalDevelopmentEconomic, and Socio-Cultural. Each basket encompasses several IG Issues on the next level of taxonomic organisation. Currently, as the IG debate evolves, various numbers of IG Issues are used for analytical purposes; around forty IG issues are used standardly in the computation of the IG Barometer scores.

image

IG Taxonomy. A subset of 40 currently used IG Issues is represented. The arrows indicate which IG issue is found to be closely similar in its usage to another issue in the IGF Session Transcripts Corpus (2006-14).

image

IG Taxonomy. A selection of IG Issues is represented on a 2D map obtained by ordinal multidimensional scaling (Smacof MDS). The proximity of issues scales with the similarity in their usage in the IGF Session Transcripts Corpus (2006-14).

IGF Knowlegde Model

A collection of several hundred session transcripts from the IGF2006-14 were manually inspected and tagged by a number of highly experienced IG experts as being representative of different IG Issues. This collection of session transcripts comprise the IGF Session Transcripts Text Corpus. The IGF Text Corpus was statistically modeled by an application of the most contemporary approaches in computational text-mining (e.g. topic models via Latent Dirichlet Allocation), resulting in the IGF Knowledge Model. This model encompasses quantitative information about the distribution of importance of thousands of IG relevant terms across all available session transcripts and all IG Issues from the current version of the taxonomy. This quantitative summary of the IGF session transcripts is used to “project” the expert knowledge on IG issues - as it is represented in the IGF discussions - to any other interesting collection of text documents. It represents a combination of human expert knowledge with the power of computational modeling. Essential information upon which the IG Barometer scores rely comes from this application of the Knowledge Model to any new collection of documents.

image

IGF Knowledge Model. The similarity structure among 40 IG Issues as computed from the 160 topics model of the IGF Session Transcripts (2006-14). The structure is produced by hierarchical clustering in R.

IGF Barometer Media Corpus

Periodically, we query various online media sources with a carefully selected list of search queries in order to retrieve the most relevant and up-to-date news, articles, papers and other texts relevant in the IG arena. Thousands of texts are typically processed by the application of the IG Terminological Model, the Taxonomy, and the Knowledge Model, before the IG Barometer scores for each IG Issue are computed. Relying on previously described knowledge modeling steps, we automatically recognize the presence of each relevant IG Issue in the source documents and apply additional computational and statistical procedures to compute the IG Barometer scores for each of them.

IG Barometer Scores

The IG Barometer basic quantitative summary encompasses four scores:

Relevance. The relevance score describes the relative importance of each IG issue in the present moment. This score is computed from two components: explicit relevance, which relies on the exact count of issue-specific keywords and phrases in the analysed sources, and implicit relevance, which accounts for any presence of the semantics similar to the issue-specific semantics in the sources. Thus the IG Barometer can “sense” the trace of - an association to, precisely speaking - some particular IG issue in the source documents even if the issue-specific keywords and phrases are used less frequently or avoided altogether.

Specificity. The most important words and phrases used in the debate on each IG issue are studied in respect to their overall frequencies of use. The more unique their usage in a particular IG issue relative to the whole IG debate, the language used to discuss the respective IG Issue is considered to be more specific. The specificity score describes the degree of linguistic, semantic specialization of the debate in a particular IG issue.

Diversity. The diversity score measures the variation that is present in the use of the most specific IG relevant words and phrases used to discuss a particular IG issue. Imagine a debate in which all stakeholders use the same or similar terms in a more or less similar manner, and a parallel debate where one can detect two or more groups that differ in the way they make use of the most relevant words and phrases. While the later debate is considered to be diversified, the former is said to be less diversified.

Positivity. Sentiment analysis - a psychologically-based automated method of considering the presence of emotionally charged words in a collection of text documents to estimate the overall affective tone of the discourse - is performed over the collection of source documents. The positivity score, indicating the degree upon which the present debate is charged with more positive emotions, is computed for each IG relevant issue.

image

IG Barometer scores. RelevanceSpecificityDiversity, and Positivity for Cybersecurity, as computed in September 2015. This IG issue had a drop of -5% in relevance in comparison to summer months 2015.

All IG Barometer scores are expressed as percentile ranks, implying that there will be some particular IG issue that scores a maximum of 100% points on a particular scale each month (the percentile rank of a score is the percentage of scores that are equal to or lower than it).

The idea and the concept of IG Barometer were developed by Dr Jovan Kurbalija, Founding Director of DiploFoundation. Mathematical definitions of the four IG Barometer scores were developed by Dr Goran Milovanović, Data Scientist, DiploFoundation, who also developed the text-mining framework for DiploFoundation (DTAF) and Geneva Internet Platform upon which the IG Barometer computations rely. None of the computations for the IG Barometer would be possible without previous efforts in knowledge engineering on behalf of the following IG experts: Barbara Rosen Jacobson, Matthew Bugeja, Razee Liechti, Sorina Teleanu, Roxana Radu, and Vladimir Radunović.

All text-mining procedures upon which the computations of the IG Barometer rely are developed in the scope of the DiploFoundation’s Text-Analytics Framework (DTAF), a set of computational and statistical procedures and stand-alone software applications developed in the programming language R, relying on various R packages and many other open-source contributions from the scientific community. We thank all R contributors for making such a powerful system for technical computing available to the community.

Originally published at Geneva Internet PlatformGIP IG Barometer

 

Views: 559

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service