We all are surrounded by data and it reveals lot of things to us to make our decisions and recommends the next steps. Data is collected from different sources such as Web, Database, log files etc. and then it is thoroughly cleaned and reshaped, and further used for analysis and explored to determine the hidden patterns and trends which is really essential for any business decision making, Extracting data from web is always easy with the help of API's but what if website doesn't provide any API's, In such case, Web Scraping is an excellent way to extract the unstructured data from web and put that in structured format like excel,csv, database etc..
Web Scraping with Selenium Web driver
Open Page in Chrome Browser:
Just provide the address(Xpath or any other locator) of the data to be extracted and Selenium webdriver extracts all the data from the page just with the help of one api(find_element_by_xpath), See how easy it is. Isn't it?
Data is extracted from the page with the help of webdriver and is stored in a list, So individual list for all the following data is created:
Sample Data set for Movie Name, Votes and Director is displayed here, Rest of the data is also stored in individual python list
['Bajirao Mastani', 'Queen', 'Bhaag Milkha Bhaag', 'Barfi!', 'Zindagi Na Milegi Dobara']
['17,362', '39,518', '39,731', '52,308', '41,731']
['Director: Sanjay Leela Bhansali', 'Director: Vikas Bahl', 'Director: Anurag Basu']
Don't see any co-relation between these data, if someone have to pick the release year and director for a movie then it's difficult to get it from these lists, So let's put the data in a Structured and more meaningful format which will make sense for someone looking at this data. So lets bind the data in Python dictionary
{
"Director": "Director: Sanjay Leela Bhansali",
"Votes": "17,362",
"RunTime": "A historical ... (158 mins.)",
"Year": 2015,
"Genre": "Drama",
"Movie Name": "Bajirao Mastani",
"Rating": "7.2"
}
This Data in python dictionary(Key:value pair) looks good and make more sense now, However if you look carefully the data is not in correct format for data manipulation, Votes value contains comma, Director contains unwanted text "Director:" and Ratings and Runtime are not in correct data type. Lets Clean this data to bring it in shape for performing analysis
{
"Director": "Sanjay Leela Bhansali",
"Votes": 17362,
"RunTime": 158,
"Year": 2015,
"Genre": "Drama",
"Movie Name": "Bajirao Mastani",
"Rating": 7.2
}
The entire movie data is stored in python dictionary but for doing further analysis this data needs to be consumed by Pandas Dataframe so that by using Pandas rich data structures and built-in function we can do some analysis on this data. Import data in Dataframe.
There are some missing values in this data, But Pandas provides excellent feature to handle missing and null values. So for these 3 movies RunTime data is not available on the page. so for further analysis we will replace this missing data with the mean value of the available data for RunTimeColumn
Comment
Hi,
Am a beginner in this scraping ,this post really useful for me and its well explained and i would like to know how to navigate to the next pages to extract all the reviews next pages and i would like to know to avoid blocking the server form our request through program ,i tried to scrape all the reviews using beauifull soup ,but after few pages it is blocking me from scraping ,i will be happy if i know to how to advance my knowledge further. thank you
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central