Last modified: Sep 12, 2023 By Alexander Williams

Scrapy xpath() - Find By Xpath

Today, we're going to learn how to find by Xpath using Scrapy. And explore how to effectively find and extract data using XPath expressions.

Visit this guide if you have not set up your Scrapy project before beginning.

What is Xpath

XPath (XML Path Language) is a language used for navigating XML and HTML documents.
You can use XPath to select elements and attributes within an XML or HTML document by specifying their paths.

How to get the Xpath of any element

By using your browser, you can simply get the Xpath of any element. Here are the steps:

  1. Open the web page in Google Chrome.
  2. Open Chrome Developer Tools by pressing Ctrl + Shift + I (or Cmd + Option + I on macOS) or right-clicking and selecting "Inspect" or pressing F12.
  3. In Chrome Developer Tools, go to the "Elements" tab.
  4. Locate the element you want to find the XPath for by hovering over it or selecting it in the "Elements" tab.
  5. Right-click on the element.
  6. From the context menu, hover over "Copy" and then select "Copy XPath."

And, here's an example from Chrome:

Get Xpath Using Chrome

How to Find By Xpath

To find by Xpath, we will use the xpath() method, which is used to locate and select elements from a web page's HTML or XML source code using XPath expressions.

Here is the syntax of the xpath() method:

response.xpath('xpath')

To select all the <a> elements on a page, you can use the following XPath expression:

response.xpath('//a')

To select all <a> elements with the class attribute set to "button," you can use:

response.xpath('//a[@class="button"]')

To select all <p> elements that are children of a <div> element with class "article," you can use an XPath expression like this:

response.xpath('//div[@class="article"]/p')

To select all <h2> elements with the exact text "Latest News," you can use:

response.xpath('//h2[text()="Latest News"]')

These are just a few examples of selecting elements using the xpath() method. Now, let's make a Scrapy spider that gets all of the article titles and links from the home page of pytutorial.com.

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://pytutorial.com']

    def parse(self, response):
        articles = response.css('.post-title a')

        for article in articles:
            # Extract text from each selected <a> element (title)
            title = article.css('::text').get()
            
            # Extract the href attribute (link) from the <a> element
            link = article.css('::attr(href)').get()

            # Print or yield the extracted data as needed
            yield {'title': title, 'link': link}

To run the spider, here is the command:

scrapy crawl myspider -O data.json

Output of data.json:

[
{"title": "Scrapy - Find by ID Attribute", "link": "/scrapy-find-by-id"},
{"title": "Scrapy css() - Find By Class Selector", "link": "/scrapy-css-find-by-class-selector"},
{"title": "How to Install and Setup Scrapy", "link": "/how-to-install-and-setup-scrapy"},
{"title": "How to Append Multiple Items to List in Python", "link": "/python-append-multiple-list"},
{"title": "How to Use BeautifulSoup clear() Method", "link": "/how-to-use-beautifulsoup-clear-method"},
{"title": "Python: Add Variable to String & Print Using 4 Methods", "link": "/python-variable-in-string"},
{"title": "How to Use Beautifulsoup  select_one() Method", "link": "/how-to-use-beautifulsoup-select_one-method"},
{"title": "How To Solve ModuleNotFoundError: No module named in Python", "link": "/how-to-solve-modulenotfounderror-no-module-named-in-python"},
{"title": "Beautifulsoup Get All Links", "link": "/beautifulsoup-get-all-links"},
{"title": "Beautifulsoup image alt Attribute", "link": "/beautifulsoup-image-alt-attribute"}
]

Conclusion

By the end of this article, you will be able to find elements by the XPath selector using Scrapy. If you want to learn how to find elements by CSS selector, please visit the following articles:

Scrapy css() - Find By Class Selector

Additionally, if you'd like to explore finding elements by their ID attribute, you can check out:

Scrapy - Find by ID Attribute