Home » Uncategorized

Webscraping the WSJ

Webscraping the WSJ

Contributed by Joseph van Bemmelen.

There are many newspapers available in the New York City area that cater to different segments of the population. This project focuses on the Wall Street Journal (WSJ), an international newspaper with a high circulation in the New York area. For this project, I scraped details of articles over a three-week period in order to analyze some basic metrics of the newspaper as well as some of the topics that the WSJ focuses on.

Using Python and Scrapy, I scraped several metrics for each article:

  1. titles of articles
  2. subtitle
  3. section(s)
  4. author(s)
  5. time that article was published

Below is an example of an article from the WSJ that I scraped, with the individual metrics highlighted:

Webscraping the WSJ

In total, the project encompassed a total of 4,126 articles from three consecutive weeks in August 2016. As a result of collecting these metrics, several interesting findings emerged. First, the number of articles published online varied by the day of the week. The number of articles appears to grow during the week, peaking on Wednesday and Thursday, with Saturday and Sunday having the lowest number of articles on average. This may reflect user demand, as most readers may be more likely to check news sources throughout the week, compared to the weekend. As a result, there may also be fewer writers working on the weekend and fewer articles being published.

Webscraping the WSJ

Additionally, the number of articles varies by section. The largest section is actually articles syndicated from the Associated Press (AP), which do not usually show up on the WSJ home page, and include short articles on sport scores and winning lottery numbers. The next largest sections in terms of number of articles are the Business and Markets sections. As its name indicates, the WSJ has a large focus on the markets and business news, which we see reflected in the number of articles published in these sections.

Webscraping the WSJ

More specific than the section, we can focus on the topics discussed by looking at the words that appear in article titles. During this three-week period, “Trump” was the most commonly found word across all published article titles, appearing close to 150 times (over 3.5% of titles). By comparison, “Clinton” appeared around 60 times (1.5%). Another popular topic by word count during this period was “China” (around 80 times). The discrepancy between “Trump” and “Clinton” could be a result of Trump being a part of more newsworthy events during this period, or a result of the WSJ focusing on one of the candidates more than the other.

Webscraping the WSJ

To further explore the WSJ’s coverage of both candidates, we can look at the correlations between mentions of the candidates’ names and other words found in the same title.

Webscraping the WSJ

As seen above, the candidates are both most correlated to their first names. After his first name, Trump is most often associated with the Trump Tower, immigration or immigrants, and his campaign. Clinton, on the other hand, is most often associated with the Clinton Foundation, Chelsea, Huma Abedin, and emails. Some of these associations may have more negative connotations, but it is hard to measure whether the newspaper is merely transmitting the news or if the paper is overly focusing on certain negative events for one candidate over the other (such as “emails” for Hillary). By looking at multiple newspapers during the same period, we would be able to compare which newspapers focus relatively more or less on news events during that period that boost or hurt a candidate. That way, we might be able to estimate whether a newspaper may lean towards one candidate more than another newspaper.

Lastly, using sentiment analysis on WSJ article titles proved largely fruitless. The classification algorithm was unable to interpret many number of words in the titles, possibly due to the significant number of names and places that often appear. The words that were able to be categorized ended up being skewed to the positive, which many would argue is not the emotion most often evoked by newspaper titles.

Webscraping the WSJ

Future research on other newspapers, such as the New York Times or New York Post, could provide additional insight as to topics that different newspapers focus on and whether different newspapers might lean politically to one side on a topic compared to other newspapers. Additionally, this project was unable to look at popularity of an article measured by comment count due to the WSJ’s website structure. Using Selenium, we might be able to glean more color on article popularity and look at full article text for additional text analysis.

As always, please feel free to reach out to me with any comments, criticism, or other feedback on this project. Thank you!