Home » Uncategorized

How Xpath Plays Vital Role In Web Scraping

XPath is a language for finding information in structured documents like XML or HTML. You can say that XPath is (sort of) SQL for XML or HTML files. XPath is used to navigate through elements and attributes in an XML or HTML document.

To understand XPath we must be clear about elements and nodes which are the building blocks of XML and HTML. Let’s talk about them. Here is an example element in an HTML document:

   <a class=”hyperlink” href=http://www.google.com>google</a>

Copy the above text to a file, name it as sample.html and open it in a browser. This will end up as a text link displaying the words “google” and it will take you to www.google.com. For each element there are three main parts: The typethe attributes, andthe text. They are listed below:

 a                                 Type
class,  href                Attributes
google                       Text 


Let’s grab some XPath developer tools. I am on Firebug for Firefox or you can use Chrome’s developer tools. We will now form some XPath expressions to extract data from the above element. We will also verify the XPath by using Firebug Console.

For extracting the text “google”:

   //a[@href]/text()    

   //a[@class=”hyperlink”]/text()
  

For extracting the hyperlink i.e. ”www.google.com” :

   //a/@href
//a[@class=”hyperlink”]/@href

That’s all with a single element but in reality, you need to deal with more complex forms.

Let’s proceed to the idea of nodes, and its familial relationship of HTML elements. Look at this example code:

<div title=”Section1″>

   <table id=”Search”>

       <tr class=”Yahoo”>Yahoo Search</tr>

       <tr class=”Google”>Google Search</tr>

   </table>

</div>

Notice the </div> at the bottom? That means the table and tr elements are contained within the div. These other elements are considered descendants of the div. The table is a child, and the tr is a grandchild (and so on and so forth). The two tr elements are considered siblings each other. This is vital, as XPath uses these relationships to find your element.

So suppose you want to find the Google item. Any of the following expressions will work:

   //tr[@class=’Google’]
   //div/table/tr[2]
  //div[@title=”Section1″]//tr[text()=”Google Search”]

So let’s analyze the expressions. We start at the top element (also known as a node). The // means to search all descendants, / means to just look at the current element’s children. So //div means look through all descendants for a div element. The brackets [] specify something about that element. So we can look for an attribute with the @ symbol, or look for text with the text() function. We can chain as many of these together as we can.

Here is a quick reference:

   //             Search all descendant elements
   /              Search all child elements
   []             The predicate (specifies something about the element you are looking for)
   @           Specifies an element attribute. (For example, @title)
    
   .               Specifies the current node (useful when you want to look for an element’s children in the predicate)
   ..              Specifies the parent node
  text()       Gets the text of the element.
    
In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors.

Read the original article here