Last modified: Sep 10, 2023 By Alexander Williams
How to Install and Setup Scrapy
Scrapy is an open-source web crawling and web scraping framework written in Python. It is used to crawl and extract data from websites.
This guide will teach you how to install and set up a Scrapy project.
Scrapy Installation
To install Scrapy, we can use the PIP command:
pip install scrapy
If you haven't already. Pip, check out this article How to Install PIP on Windows/Mac/Linux Easily.
You can also install Scrapy using conda with the following command:
conda install -c conda-forge scrapy
Scrapy Set Up
After installing Scrapy modules, let's set up a project.
Create a Scrapy Project
To create a new project run:
scrapy startproject project_pytutorial
Replace project_pytutorial with the name you want to give your Scrapy project.
The project structure looks like this:
├── project_pytutorial
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── __pycache__
│ ├── settings.py
│ └── spiders
│ └── spider_app.py
└── scrapy.cfg
project_pytutorial/scrapy.cfg
: This is the configuration file for your Scrapy project.project_pytutorial/items.py
: This is the file where you define the data structures that you will be scraping.project_pytutorial/spiders/__init__.py
: This is the empty directory where you will create your spiders.
Create a spider app
Now, we'll write a spider app in the spider_app.py
file:
import scrapy
class MySpider(scrapy.Spider):
# Spider name
name = 'myspider'
# Starting URL for web scraping.
start_urls = ['http://pytutorial.com']
# Parse method for processing web page responses.
def parse(self, response):
# Extract the text of the first <h1> element.
title = response.css('h1::text').get()
# Yield the extracted title.
yield {'title': title}
This spider gets the text of the first <h1>
tag from pytutorial.com.
To run the spider execute this command:
scrapy crawl myspider -O data.json
The command scrapy crawl myspider -O data.json
is used to run a Scrapy spider named myspider
and save the scraped data to a JSON file named data.json.
2023-09-10 20:46:58 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: project_pytutorial)
2023-09-10 20:46:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Linux-5.15.0-53-generic-x86_64-with-glibc2.35
2023-09-10 20:46:58 [scrapy.addons] INFO: Enabled addons:
[]
2023-09-10 20:46:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project_pytutorial',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'project_pytutorial.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['project_pytutorial.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-09-10 20:46:58 [asyncio] DEBUG: Using selector: EpollSelector
2023-09-10 20:46:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-09-10 20:46:58 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-09-10 20:46:58 [scrapy.extensions.telnet] INFO: Telnet Password: 3d046408575915e3
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-09-10 20:46:58 [scrapy.core.engine] INFO: Spider opened
2023-09-10 20:46:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-10 20:46:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-10 20:46:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://pytutorial.com/robots.txt> from <GET http://pytutorial.com/robots.txt>
2023-09-10 20:46:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pytutorial.com/robots.txt> (referer: None)
2023-09-10 20:46:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://pytutorial.com/> from <GET http://pytutorial.com>
2023-09-10 20:47:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pytutorial.com/> (referer: None)
2023-09-10 20:47:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pytutorial.com/>
{'title': 'Pytutorial | Python and Django Tutorials Blog'}
2023-09-10 20:47:00 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-10 20:47:00 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2023-09-10 20:47:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 880,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 12850,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'elapsed_time_seconds': 1.775556,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 9, 10, 19, 47, 0, 337537),
'httpcompression/response_bytes': 47819,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 8,
'log_count/INFO': 11,
'memusage/max': 60801024,
'memusage/startup': 60801024,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2023, 9, 10, 19, 46, 58, 561981)}
2023-09-10 20:47:00 [scrapy.core.engine] INFO: Spider closed (finished)
data.json:
[
{"title": "Pytutorial | Python and Django Tutorials Blog"}
]
Conclusion
In this guide, we've learned how to install and set up the Scrapy framework in this guide. We've also made a simple spider app that gets the text of the <h1>
tag.