Subscribe to DSC Newsletter

Contributed by Jake Lehrhoff, John Montroy, and Chris Neimeth. They took NYC Data Science Academy 12 week full time Data Science Bootcamp pr...between Sept 23 to Dec 18, 2015. The post was based on his fourth class project (due the 9th week of the program).

For the greater part of 2015, Hillary Clinton has been under fire for her near-exclusive use of a personal email account on a non-governmental server during her time as Secretary of State. The press was certainly not the type a presidential candidate hopes for 18 months before an election. Amidst continued investigation on the 2012 attacks on the Benghazi consulate, scandal rolled into scandal. State department lawyers uncovered the former Secretary of State's potential breach of email protocol while working through documents from the House Select Committee on Benghazi. Investigators and pundits alike questioned Mrs. Clinton's continued council from Sidney Blumenthal, a family friend and advisor who held no role in the State Department. As if Mrs. Clinton wasn't already dominating the 24-hour news cycle as the Benghazi deposition loomed, suddenly there was even more to the conversation: rather than questioning Mrs. Clinton's foresight and commitment to defending United States citizens working overseas, news anchors had us questioning our very trust of a POTUS frontrunner.

22-hillary-clinton-benghazi.w529.h352

HRC during an 11-hour Benghazi hearing. Months later, after an 11-hour Benghazi hearing that included a representative symbolically ripping up a blank piece of paper and another piling a desk with a stack of emails from the former Secretary of State's private-server emails for dramatic effect, perhaps Bernie Sanders said it best: "Enough of the emails. Let's talk about the real issues facing America." Even for those who agree with the Vermont Senator's sentiment, it can be difficult to disengage from the discussion, especially now that we have the actual emails. After a series of Freedom of Information Act lawsuits, each passing month sees newly released emails from the original 55,000 pages. Nearly 8,000 of the released emails were cleaned and hosted on Kaggle.com, a website for data science competitions, inviting users to "uncover the political landscape" in these documents. Given how tired this scandal has grown, we decided the only thing left to do was create a tool such that you can do your own investigation. No matter what your feelings are about Hillary Clinton, we hope you enjoy The Hillary Clinton Email Explorer.

*The website is currently in beta, so please do let us know if you experience any outages.


 

Overview

Screen Shot 2015-12-04 at 5.24.20 PM The landing page of the Explorer provides basic exploratory data analysis of the dataset. Many of the 7,945 released emails were redacted, and far more will likely never be released. What was available to us can paint a variety of pictures. Overall, Mrs. Clinton's emails were extremely short, averaging just 19 words, while Sidney Blumenthal's averaged over 600 words. It appears that Mr. Blumenthal sent far more emails than he received, but we know nothing about the content of the over 40,000 unreleased emails.


 

EPT Counter

  Screen Shot 2015-12-04 at 6.48.58 PMThe Email-Person-Topic Counter allows you to type in a series of search words and select senders to see the count of emails sent containing those words. Above, you can see that "Benghazi," "Libya," and "attack" appear in surprisingly few emails. One would imagine that the majority of emails concerning those topics remain unreleased or classified. But finer details can be gleaned: though Jake Sullivan and Huma Abedin held the title of Deputy Chief of Staff, the content of these emails suggest that Huma was less involved in the affairs in Benghazi. Screen Shot 2015-12-04 at 6.02.22 PM

This simple tool can answer so many questions. Who says their "please" and "thank you's"? Who is concerned with basic administrative tasks? It seems that Mrs. Clinton is quite courteous in her emails (although she does prefer "pls" to "please"). We also see that Huma Abedin is more concerned with administrative tasks than her counterpart or her superior, Cheryl Mills.

Screen Shot 2015-12-04 at 6.24.22 PM

Here's another question: how does everyone refer to President Obama? Only Mrs. Clinton uses the shortened POTUS, while Mr. Blumenthal is more likely than the rest to just use the president's first name. Overall, "Obama" is the most popular, while very few use the formal "President Obama." Fascinating! Please play with it yourself! If you find anything particularly interesting, please post it in the comments below. A few quick instructions: separate your search terms with only a comma, no space, (e.g., "Iran,Iraq,Syria"). Also, hold the shift key to select multiple senders. To get you started, why don't you check out who wants tea and who wants coffee? And for a sanity check, search "tea party" as well, to make sure we know what kind of tea everyone is talking about!


 

Wordcloud Generator

Screen Shot 2015-12-04 at 6.29.49 PM To add to our investigation of the content within the released emails, we developed a wordcloud application. Select the name of one of the top contributors, and see a cloud of the most common words in his or her emails. Size represents the frequency that a given word appeared in the emails; the color is merely cosmetic. A quick scan of Mrs. Clinton's word cloud shows that she is largely concerned with high level administrative tasks. "Thx," "pls," "print," "will," "call," "time," and "know" all appear prominently. Screen Shot 2015-12-04 at 6.39.39 PMSidney Blumenthal's wordcloud shows more focused content covering a range of political topics. "Obama," "political," "Israel," and "issue" are all easy to locate. This wordcloud gives a peak as to the nature of Mr. Blumenthal's advice for the former Secretary. It's also relieving that even the highest level political advisors in the country aren't above abbreviating "you" down to a single letter. Check out the wordclouds for the other top contributors to get a sense of their roles within Mrs. Clinton's State Department.


 

Sentiment Analysis

Screen Shot 2015-12-04 at 6.44.48 PM   The sentiment analysis tab provides the most unique view of the data. "Sentiment" runs from -1 to 1, depicting a range of completely negative to completely positive content. Sentiment was determined using TextBlob, a Python package that determines sentiment by comparing text to its own lexicons of positive and negative words. Each query returns two graphs. The first graph gives a density of a given sender's sentiment. The highest peak shows the sentiment (x-axis) that is most common for that sender. Mrs. Clinton's emails were largely positive, with peaks at 0.2 and 0.5. This isn't surprising given all the "please" and "thank you's" that we discovered she uses in her emails. On the right we see the average sentiment of emails sent to each of the recipients along the x-axis. While Mrs. Clinton is positive with everyone, she is the most positive with Mr. Blumenthal and least positive with Mr. Sullivan. When selecting a sender besides Mrs. Clinton, it's important to remember that sentiment to any recipient besides the Secretary of State comes from a small sample of emails. However, there is still understanding to be gleaned. Cheryl Mills and Jake Sullivan, though positive toward Hillary Clinton and Huma Abedin, are not so rosy with each other. In fact, Jake Sullivan's sentiment in emails to Cheryl Mills is actually negative.   sentpeopleplot-1sentpeopleplot


The Code

All the code necessary to run this website can be found in our github repository. The Hillary Clinton Email Explorer website was built with Flask, a python-based web development framework. The design of the website comes from Bootstrap, which puts the stylings of the entire internet at your fingertips. Each page of the website has its own unique HTML file that contains the structure of the page. A series of Bootstrap CSS files decorate the pages while javascript files add functionality. While we worked with the HTML to populate the pages with our material, we did not have to touch the CSS or JS files that Bootstrap produces for the user. The following functions are housed in an __init.py__ file.

Get the emails

The data is in a mysql database, and the below function gets the data and populates a pandas data frame with the given column names.  

def getEmails():con = mysql.connect()cur = con.cursor()
sql = """SELECT * FROM EmailsC"""
cur.execute(sql)
Emails = cur.fetchall()
Emails2 = [tuple(elm,) for elm in Emails]
EmailsFinal = pd.DataFrame(Emails2, columns = [u'Id', u'DocNumber', u'MetadataSubject', u'MetadataTo', u'MetadataFrom',
u'SenderPersonId', u'MetadataDateSent', u'MetadataDateReleased',
u'MetadataPdfLink', u'MetadataCaseNumber', u'MetadataDocumentClass',
u'ExtractedSubject', u'ExtractedTo', u'ExtractedFrom', u'ExtractedCc',
u'ExtractedDateSent', u'ExtractedCaseNumber', u'ExtractedDocNumber',
u'ExtractedDateReleased', u'ExtractedReleaseInPartOrFull',
u'ExtractedBodyText', u'RawText'])

return EmailsFinal

 

Cleaning the Data

The following functions use regular expressions to clean unnecessary or problem elements from the text. The first eliminates symbols and the second strips particular phrases from the emails, particularly those that label an email as being investigated by the House Benghazi investigation committee. While these regular expressions certainly strip out unwanted material, it is feasible that they pull a few words out of the emails that were in fact innocuous and part of the true body text. However, as the words in question are sentiment-less, we do not feel that we are risking the loss of pertinent data.  

def rmNonAlpha(texts): """ Remove non-alphabetic characters (roughly)
"""

if isinstance(texts, list):
ctext = [re.sub(r'\s+', ' ', ctext) for ctext in [re.sub(r'[[\]()<>{}!:,;-_|\."\'\\]', '', text) for text in texts]]

elif isinstance(texts, (str, unicode)):
ctext = re.sub(r'[(){}<>,\.!?;:\'"/\\\_|]', '', texts)

return ctext
def rmBoring(texts):
"""
Remove boring stuff.
Warning: strong assumptions ahead...but we gotta do some chopping.
"""

# overhead stuff
ctext = re.sub(r'^From .*\n', '', texts, flags=re.MULTILINE)
ctext = re.sub(r'^To .*\n', '', ctext, flags=re.MULTILINE)
ctext = re.sub(r'^Case No .*\n', '', ctext, flags=re.MULTILINE)
ctext = re.sub(r'^Sent .*\n', '', ctext, flags=re.MULTILINE)
ctext = re.sub(r'^Doc No .*\n', '', ctext, flags=re.MULTILINE)
ctext = re.sub(r'^Subject .*\n', '', ctext, flags=re.MULTILINE)

# other misc
ctext = re.sub(r'.*@.*', '', ctext) # emails
ctext = re.sub(r'(?i)(monday|tuesday|wednesday|thursday|friday|saturday|sunday).*\d{3,4} [AP]M\n', '', ctext, flags = re.MULTILINE) # timestamps
ctext = re.sub(r'Fw .*\n', '', ctext, flags = re.MULTILINE) # forward line
ctext = re.sub(r'Cc .*\n', '', ctext, flags = re.MULTILINE) # Cc line
ctext = re.sub(r'B[56(7C)]', '', ctext)

# house benghazi committee stuff
ctext = re.sub(r'Date 05132015.*\n', '', ctext, flags = re.MULTILINE)
ctext = re.sub(r'STATE DEPT .*\n', '', ctext, flags = re.MULTILINE)
ctext = re.sub(r'SUBJECT TO AGREEMENT.*\n', '', ctext, flags = re.MULTILINE)
ctext = re.sub(r'US Department of State.*\n', '', ctext, flags = re.MULTILINE)
return re.sub(r'\s+', ' ', ctext).lower()

 

Counts by keyword

The following two functions creates a count of emails that contain particular topics. The first function takes a single person and creates a count of emails that contain each of the given keywords. The second uses that function to complete that task for all the selected email senders.

 

def CountsByKeyword(df, col, person, topics, StartDate = '2009-01-01', EndDate = '2013-01-01'):""" Returns a dict of total mention counts per keyword for the given person. 
Returns counts for the passed in time frame, defaults to entire timeframe.

'By' parameter controls which field you're getting counts by.
Big return says: return a dictionary via comprehension for lists, or just a dict for one value

"""

if not isinstance(topics, (str, unicode, list)):
raise TypeError('\'topics\' parameter must be either str or list')

person = '(' + person + ')'
StartDate = datetime.strptime(StartDate, '%Y-%m-%d')
EndDate = datetime.strptime(EndDate, '%Y-%m-%d')
return (
{topic: df[col].loc[
(df[col].str.contains(person, case = False))
& (df['ExtractedBodyText'].str.contains(topic, case = False))
& (df['MetadataDateSent'] > StartDate)
& (df['MetadataDateSent'] /span> EndDate)].count()
for topic in topics}

def buildCounterDF(personlist, topiclist):
PersonThing = list()
PersonTopic = pd.DataFrame()
topiclist = topiclist.split(',')
for person in personlist:
PersonThing.append(
tuple((person,
CountsByKeyword(Emails, col = 'MetadataFrom', person = person, topics = topiclist))
)
)
for item in PersonThing:
tdf = pd.DataFrame.from_dict(item[1], orient = 'index')
tdf['Person'] = item[0]
tdf.reset_index(level = 0, inplace = True)
tdf.rename(columns = {'index': 'Topic', 0: 'count'}, inplace = True)
tdf = tdf[['Person', 'Topic', 'count']]
PersonTopic = PersonTopic.append(tdf)
return PersonTopic

 

Make Sentiment

Sentiment is determined with TextBlob. Two functions are necessary as the first creates the data that will populate the first sentiment graph, which shows the density of sentiments among a given sender's corpus of emails. The second determines the sentiment by recipient. The extra argument, "personlist" is populated with the selected recipients from the dropdown menu on the application.  

def GetSentimentPerPerson(df, person):text = df[['MetadataFrom','ExtractedBodyText']].loc[
(df.MetadataFrom.str.contains('(' + person + ')'))]

text.ExtractedBodyText = text.ExtractedBodyText.apply(lambda x: rmBoring(rmNonAlpha(x)).decode('ascii', 'ignore'))

text['sentiment'] = text['ExtractedBodyText'].apply(lambda x: TextBlob(x).polarity)
return text.loc[text.sentiment != 0] # only return meaningful
def GetSentimentForPeople(df, target, personlist):

sentimentlist = list()
stoplist = set('for a of the and to in on from'.split())

for person in personlist:
text = df['ExtractedBodyText'].loc[
(df.MetadataFrom.str.contains('(' + target + ')'))
& (df.MetadataTo.str.contains('(' + person + ')'))].values.tolist()

text = ' '.join([str(word) for word in text if word not in stoplist])
text = rmBoring(rmNonAlpha(text)).decode('ascii', 'ignore')

sentimentlist.append(tuple((person, TextBlob(text).polarity)))

return sentimentlist

 

Creating Visualizations

The HTML files running the website contain Flask code seen below. The request.args.get function takes the two inputs, "target" and "personlist" and uses them to make "sentplot" and "sentpeopleplot," two plots that are defined back in __init.py__ and will eventually populate the sentiment page.  

{% if request.args.get('target') != None and request.args.get('personlist') != None %} <div class="panel panel-primary">
<div class="panel-heading">
<h3 class="panel-title" style="text-align:center;">Graph Results</h3>
</div>
<div class="panel-body">
<span><center>
<img src="{{ url_for('sentplot', target = target, personlist = personlist) }}" alt="Distribution of sentiment" style = "position:relative;top:-10px;" height = 600 width = 650>
<img src="{{ url_for('sentpeopleplot', target = target, personlist = personlist) }}" alt="Personal sentiment towards others" style = "position:relative;top:-10px;" height = 600 width = 650>

 

Back in __init.py__ these graphs are defined much like they would be in a normal python shell. The only unique pieces are the "@app.route" lines. You may notice, as you play with the website, that your graphing requests appear in the url itself. The line in question maps where the arguments exist in the url, to be passed into the function itself. The lines plt.clf() clear matplotlib of any existing plot, making room for the new visualization. The functions return send_file(img, mimetype='image/png'), a png of the otherwise typical plot.

 

@app.route('/fig/<target>/<personlist>/sentplot.png')def sentplot(target, personlist):target = re.sub(r'\+', ' ', target)
EmailSnt = GetSentimentPerPerson(Emails, target)
plt.clf()
sns.distplot(EmailSnt.sentiment)
plt.xlim(-1,1)
plt.title('Email Sentiment: {}'.format(target), fontsize = 16)
fig = plt.gcf()
img = StringIO.StringIO()
fig.savefig(img)
img.seek(0)
return send_file(img, mimetype='image/png')
@app.route('/fig/<target>/<personlist>/sentpeopleplot.png')
def sentpeopleplot(target, personlist):
target = re.sub(r'\+', ' ', target)
personlist = personlist.split(',')
s = GetSentimentForPeople(Emails, target, personlist)
s = pd.DataFrame(s, columns = ['Person', 'Sentiment'])
plt.clf()
sns.barplot(x='Person', y = 'Sentiment', data=s)
plt.ylabel('Sentiment')
plt.title('How {} feels'.format(target))
fig = plt.gcf()
img = StringIO.StringIO()
fig.savefig(img)
img.seek(0)
return send_file(img, mimetype='image/png')

Debugging

Flask has an intuitive debugging interface. If the app does not run, traceback errors appear on the page that allow the programmer to run Python as if the web browser were your console. The final grey line contains the original error, and then prompts the user to use the window like a console. Here we've typed in the object "personlis" when we meant "personlist." We can recreate the error and work to determine what the line ought to be. In a world where bugs may too often feel beyond one's reach, Flask's debugging system is a welcome and effective tool.

 

Pasted image at 2015_12_04 09_04 PM

Conclusion

This project was a fascinating topic wrapped in an exciting technical challenge. As data scientists, we here at NYC Data Science Academy are not full stack engineers, and yet with the help of Flask and Bootstrap, we were able to create a functional, attractive website to showcase our work.

 

Hillary Clinton's released Secretary of State emails contain a wealth of information. Our website only begins to tap into all that content. For all it can dig out of this dataset, we have yet to show you a single, complete email. So, to conclude, here is one particularly important example. Enjoy!

 

gefiltefish

Views: 1666

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sergey Zelvenskiy on October 13, 2016 at 5:42pm

How DNC characterized Tim Kaine based on wikileaks DNCleak dataset.

Comment by Sergey Zelvenskiy on October 13, 2016 at 5:40pm

How DNC characterized Bernie Sanders based on wikileaks dataset. 

Comment by todd morris on August 26, 2016 at 1:32pm

Sadly, it seems that Amazon doesn't want us to explore Hillary.

Videos

  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service