To extract data from websites, you can take advantage of data extraction tools like Octoparse. These tools can pull data from websites automatically and save them into many formats such as Excel, JSON, CSV, HTML, or to your own database via APIs. It only takes a few minutes to extract thousands of lines of data, and the best part is that no coding is required in this process.
Take Google Search as an example. Let’s say we are interested in information related to “smoothie” and we want to extract all the titles, descriptions, and webpage URLs from the search results. To extract data from Google Search, you can use a web scraping template. A template is a preformatted crawler that is ready-to-use without any configuration. There are over 50 templates for you to choose from. You will see all the templates ranging from eCommerce websites like Amazon and eBay to social media channels like Facebook, Twitter and Instagram. Octoparse offers custom templates as well.
To use the templates, you need to have Octoparse installed on your computer. Select the “Task Template” mode. Navigate to Google Search web scraping template under the “search engine” category.
Open the template. Check the instructions and the sample output to make sure that this template will get you the data you need. You can hover the cursor on the data fields to see which elements on the websites will be extracted.
Check out the parameters to get a better idea of what you need to input. The parameters would vary in different templates as they may require a different search term to proceed. It could be a URL, a keyword, a list of URLs/keywords, the number of pages that you want to scrape, and so on. In this case, we need to input the search term “smoothie”.
Proceed by clicking “use template”, then enter “smoothie” and hit “save and run”. If it’s a one-time project, you can simply run the crawler on your local computer. Whereas, if you are handling an on-going project, you can schedule the extraction on the Octoparse cloud platform. When the extraction is done, you can export it into many formats, like Excel, CSV and txt.
We just introduced how to use a web scraping template to extract web data from Google Search. You can also build your own crawler within clicks using the “Advanced Mode”. It may need a few configurations but it is highly flexible in terms of data extraction.
If you are trying to extract data on a large scale, you can enter a list of as many as 10,000 URLs into the box. In this case, since we are only scraping one website, let’s just paste our target URL into the box and click “save URL” to proceed.
Switch the browser to Firefox 45. Now Octoparse has loaded the webpage in the built-in browser successfully. Then, we need to build pagination by clicking on the “Next” page button and choose “Loop click next page” on the Action Tips panel. You’ll see the pagination loop we just built in the workflow area.
Now we can extract the data. Click on the title of a search result and click “select all”. Once all the titles are selected, they’ll be highlighted in green. Click “extract text of the selected element” to extract all the titles. Let’s pause for a moment to take a look at the workflow. We just built an extraction loop inside the pagination loop. The entire extraction process will work like this: the bot will first open the webpage, extract the titles on the first page one by one, and then goes on to the next page to repeat the extraction until the extraction is stopped or completed.
You can follow the same method to extract the descriptions. Finally, to extract the URLs, click on the "A" tag and choose “extract the URL of the selected link”. After the description and URL show up in the upper right corner, it means we’ve extracted them successfully. Now we can edit the field name, save the scraping task, and start extraction.
Besides Google, data extraction tools can pull data from many other websites, and they are widely used across industries. For example, companies can extract Yellowpages, Yelp, and Google maps to generate sales leads.