Subscribe to DSC Newsletter

Michael Grogan's Blog (13)

Summarizing Economic Bulletin Documents with TF-IDF

A key strength of NLP (natural language processing) is being able to process large amounts of texts and then summarise them to extract meaningful insights.

In this example, a selection of economic bulletins in PDF format from 2018 to 2019 are analysed in order to gauge economic sentiment. The bulletins in question are sourced from the European Central Bank website. tf-idf is used to rank…


Added by Michael Grogan on July 11, 2019 at 12:28pm — No Comments

Deploying Python application using Docker and AWS

The use of Docker in conjunction with AWS can be highly effective when it comes to building a data pipeline.

Let me ask you if you have ever had this situation before. You are building a model in Python which you need to send over to a third-party, e.g. a client, colleague, etc. However, the person on the other end cannot run the code! Maybe they don't have the right libraries installed, or their system is not configured correctly.

Whatever the reason, Docker alleviates this…


Added by Michael Grogan on July 5, 2019 at 8:30am — No Comments

Multilevel Modelling of U.S. Home Loan Data

The housing market has undergone quite a change in the past decade, with more stringent lending criteria for housing having been enforced.

A key objective of financial institutions is to minimise the risk of mortgage lending by ensuring that the debtor is ultimately able to repay the loan.

In this example, multilevel modelling techniques are used to analyse data from the Federal Home Loan Bank…


Added by Michael Grogan on July 3, 2019 at 3:01am — No Comments

Predicting Hotel Cancellations with Support Vector Machines and SARIMA

Hotel cancellations can cause issues for many businesses in the industry. Not only is there the lost revenue as a result of the customer cancelling, but this can also cause difficulty in coordinating bookings and adjusting revenue management practices.

Data analytics can help to overcome this issue, in terms of identifying the customers who are most likely to cancel – allowing a hotel chain to adjust its marketing strategy accordingly.

To investigate how machine learning can…


Added by Michael Grogan on July 2, 2019 at 3:00am — No Comments

Visualizing New York City WiFi Access with K-Means Clustering

Visualization has become a key application of data science in the telecommunications industry.

Specifically, telecommunication analysis is highly dependent on the use of geospatial data. This is because telecommunication networks in themselves are geographically dispersed, and analysis of such dispersions can yield valuable insights regarding network structures, consumer demand and availability.


To illustrate this point, a k-means clustering algorithm is used…


Added by Michael Grogan on February 19, 2019 at 3:44am — No Comments

Image Recognition with Keras: Convolutional Neural Networks

Image recognition and classification is a rapidly growing field in the area of machine learning. In particular, object recognition is a key feature of image classification, and the commercial implications of this are vast.

For instance, image classifiers will increasingly be used to:

  • Replace passwords with facial recognition
  • Allow autonomous vehicles to detect obstructions
  • Identify geographical features from satellite imagery



Added by Michael Grogan on February 17, 2019 at 11:00am — No Comments

Variance-Covariance Matrix: Stock Price Analysis in R

The purpose of a variance-covariance matrix is to illustrate the variance of a particular variable (diagonals) while covariance illustrates the covariances between the exhaustive combinations of variables.

Why do we use variance-covariance matrices?

A variance-covariance matrix is particularly useful when it comes to analysing the volatility between elements of a group of data. For instance, a variance-covariance matrix has particular applications when it comes to…


Added by Michael Grogan on June 30, 2018 at 4:30am — No Comments

Linear regression in Python: Use of numpy, scipy, and statsmodels

The numpy, scipy, and statsmodels libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user might choose different libraries depending on the data in question, among other considerations. Here, we will go through how to use each of the above to generate regression output.

Linear Regression using numpy and…


Added by Michael Grogan on August 26, 2017 at 6:30am — No Comments

Creating maps in R using ggplot2 and maps libraries

Here is how we can use the maps, mapdata and ggplot2 libraries to create maps in R.

In this particular example, we’re going to create a world map showing the points of Beijing and Shanghai, both cities in China. For this particular map, we will be displaying the Northern Hemisphere from Europe to Asia.





cities =…


Added by Michael Grogan on August 22, 2017 at 4:00am — 1 Comment

Creating functions in R

Functions are used to simplify a series of calculations.

For instance, let us suppose that there exists an array of numbers which we wish to add to another variable. Instead of carrying out separate calculations for each number in the array, it would be much easier to simply create a function that does this for us automatically.

A function in R generally works by:

(a) Defining the variables to include in the function and the calculation. e.g. to add two…


Added by Michael Grogan on August 12, 2017 at 5:30am — No Comments

Create PostgreSQL Database In Linux And Connect To R

PostgreSQL is a commonly used database language for creating and managing large amounts of data effectively.

Here, you will see how to:

1) create a PostgreSQL database using the Linux terminal

2) connect the PostgreSQL database to R using the “RpostgreSQL” library

Creating our PostgreSQL database

In this example, we are going to create a simple database containing a table of dates, cities, and average temperature in degrees (Celsius).

We will name…


Added by Michael Grogan on August 7, 2017 at 7:30am — No Comments

Data Cleaning and Wrangling With R

One of the big issues when it comes to working with data in any context is the issue of data cleaning and merging of datasets, since it is often the case that you will find yourself having to collate data across multiple files, and will need to rely on R to carry out functions that you would normally carry out using commands like VLOOKUP in Excel.

The tips I give below for data manipulation in R are not exhaustive - there are a myriad of ways in which…


Added by Michael Grogan on July 10, 2017 at 6:00pm — 1 Comment

Python: Implementing a k-means algorithm with sklearn

The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm.

The purpose of k-means clustering is to be able to partition observations in a dataset into a specific number of clusters in order to aid in analysis of the data. From this perspective, it has particular value from a data visualisation perspective.

This post explains how to:

  1. Import kmeans and PCA through the sklearn…

Added by Michael Grogan on June 17, 2017 at 8:00am — 9 Comments

Blog Topics by Tags

Monthly Archives





  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service