Last modified: Sep 13, 2023 By Alexander Williams
Scrapy - Get Href Attribute
Using Xpath and CSS selectors, we will explain how to get HREF attributes from web pages using Scrapy.
Ensure your Scrapy project is set up before you begin by following this guide.
Get href Using Xpath selectors
You can use XPath to get elements' href. This is what the code looks like:
response.xpath('//a/@href').extract()
response
: This is the web page's response object received after sending the HTTP request to the start URL.xpath('//a/@href')
: This is an XPath selector that extracts the HREF attributes (@href
) of all anchor (<a>
) elements on the page. It selects all links on the page..extract()
: Extracts the selected data.
In the next step, we'll write a spider that gets all links (href) from the pytutorial.com website.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://pytutorial.com']
def parse(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
yield {'link': link}
To execute the spider, we'll run the following command:
scrapy crawl myspider -O data.json
And this is the data.json result:
[
{"link": "/"},
{"link": "/category/python-tutorial"},
{"link": "/category/django-tutorial"},
{"link": "#"},
{"link": "/online/email-extractor-online-free"},
{"link": "/online/calculate-text-read-time-online"},
{"link": "/online/html-to-markdown-converter-online"},
{"link": "/online/tools/"},
{"link": "/about-us"},
{"link": "/contact-us"},
{"link": "#search"},
{"link": "/scrapy-find-by-xpath"},
{"link": "/scrapy-find-by-id"},
{"link": "/scrapy-css-find-by-class-selector"},
{"link": "/how-to-install-and-setup-scrapy"},
{"link": "/python-append-multiple-list"},
{"link": "/how-to-use-beautifulsoup-clear-method"},
{"link": "/python-variable-in-string"},
{"link": "/how-to-use-beautifulsoup-select_one-method"},
{"link": "/how-to-solve-modulenotfounderror-no-module-named-in-python"},
{"link": "/beautifulsoup-get-all-links"},
{"link": "https://www.facebook.com/Pytutorial-108500610683725/?modal=admin_todo_tour"},
{"link": "https://twitter.com/pytutorial"},
{"link": "https://www.youtube.com/@pytutorial9501"},
{"link": "/privacy-policy/"},
{"link": "/dmca/"}
]
The links on the web page are all here, as you can see.
Get href Using CSS selectors
CSS selectors can also be used to get the href attribute. Here is the code:
response.css("a::attr(href)").extract()
Now, let's get all href attributes (links) from pytutorial.com using CSS selectors.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://pytutorial.com']
def parse(self, response):
links = response.css("a::attr(href)").extract()
for link in links:
yield {'link': link}
the result is the same.
Conclusion
Scrapy provides efficient ways to extract HREF attributes from web pages using XPath and CSS selectors.
Using the steps outlined in this article, you can scrape links and HREF attributes from websites using Scrapy Spiders