In this tutorial, we will cover how to extract information from a matrimonial website using R.  We will do web scraping which is a process of converting data available in unstructured format on the website to structured format which can be further used for analysis.

We will use a R package called rvest which was created by Hadley Wickham. This package simplifies the process of scraping web pages.

Web Scraping in R

Install the required packages

To download and install the rvest package, run the following command. We will also use dplyr which is useful for data manipulation tasks.

Load the required Libraries

To make the libraries in use, you need to submit the program below.

Scrape Information from Matrimonial Website

First we need to understand the structure of URL. See the URLs below.

The first URL takes you to the webpage wherein girls' profiles of Punjabi community are shown whereas second URL provides details about boys' profiles' of Punjabi community.
We need to split the main URL into different elements so that we can access it. 
Main_URL = Static_URL + Mother_Tongue + Brides_Grooms

Check out the following R code how to prepare the main URL. In the code, you need to provide the following details -

  1. Whether you are looking for girls'/boys' profiles. Type bride to see girls' profiles. Enter groom to check out boys' profiles.
  2. Select Mother Tongue. For example, punjabi, tamil etc.
# Looking for bride/groom Bride_Groom = "bride" # Possible Values : bride, groom  # Select Mother Tongue Mother_Tongue = "punjabi" # Possible Values # punjabi # tamil # bengali # telugu # kannada # marathi  # URL if (tolower(Bride_Groom) == "bride") { html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-brides-girls') } else { html = paste0('https://www.jeevansathi.com/',tolower(Mother_Tongue),'-grooms-boys') } 

See the output : 

[1] "https://www.jeevansathi.com/punjabi-brides-girls" 

Extract Profile IDs

First you need to select parts of an html document using css selectors: html_nodes(). Use SelectorGadget which is a chrome extension available for free. It is the easiest and quickest way to find out which selector pulls the data that you are interested in.

How to use SelectorGadget : Click on a page element that you would like your selector to match (it will turn green). It will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.

text = read_html(html) %>% html_nodes(".profileContent .color11 a") %>% html_text()
profileIDs = data.frame(ID = text)
         ID 1  ZARX0345 2  ZZWX5573 3  ZWVT2173 4  ZAYZ6100 5  ZYTS6885 6  ZXYV9849 7   TRZ8475 8   VSA7284 9  ZXTU1965 10 ZZSA6877 11 ZZSZ6545 12 ZYSW4809 13 ZARW2199 14 ZRSY0723 15 ZXAT2801 16 ZYXX8818 17 ZAWA8567 18  WXZ2147 19 ZVRT8875 20 ZWWR9533 21 ZYXW4043 

The basic functions in rvest are very user-friendly and robust. Explanation of these functions are listed below -

  1. read_html() :  you can create a html document from a URL
  2. html_nodes() : extracts pieces out of HTML documents.
  3. html_nodes(".class") : calls node based on CSS class
  4. html_nodes("#class") : calls node based on <div>, <span>, <pre> id
  5. html_text() : extracts only the text from HTML tag
  6. html_attr() : extracts contents of a single attribute

Difference between .class and #class

1. .class targets the following element:
<div class="class"></div>

2. #class targets the following element:
<div id="class"></div>
To read the complete tutorial,  check out the link Webscraping with R

Views: 918

Tags: R, webscraping


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service