Last modified: Sep 10, 2023 By Alexander Williams

How to Install and Setup Scrapy

Scrapy is an open-source web crawling and web scraping framework written in Python. It is used to crawl and extract data from websites.

This guide will teach you how to install and set up a Scrapy project.

Scrapy Installation

To install Scrapy, we can use the PIP command:

pip install scrapy

If you haven't already. Pip, check out this article How to Install PIP on Windows/Mac/Linux Easily.

You can also install Scrapy using conda with the following command:

conda install -c conda-forge scrapy

Scrapy Set Up

After installing Scrapy modules, let's set up a project.

Create a Scrapy Project

To create a new project run:

scrapy startproject project_pytutorial

Replace project_pytutorial with the name you want to give your Scrapy project.

The project structure looks like this:

├── project_pytutorial
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       └── spider_app.py
└── scrapy.cfg
  • project_pytutorial/scrapy.cfg: This is the configuration file for your Scrapy project.
  • project_pytutorial/items.py: This is the file where you define the data structures that you will be scraping.
  • project_pytutorial/spiders/__init__.py: This is the empty directory where you will create your spiders.

Create a spider app

Now, we'll write a spider app in the spider_app.py file:

import scrapy

class MySpider(scrapy.Spider):
    # Spider name
    name = 'myspider'
    
    # Starting URL for web scraping.
    start_urls = ['http://pytutorial.com']

    # Parse method for processing web page responses.
    def parse(self, response):
        # Extract the text of the first <h1> element.
        title = response.css('h1::text').get()
        
        # Yield the extracted title.
        yield {'title': title}

This spider gets the text of the first <h1> tag from pytutorial.com.

To run the spider execute this command:

scrapy crawl myspider -O data.json

The command scrapy crawl myspider -O data.json is used to run a Scrapy spider named myspider and save the scraped data to a JSON file named data.json.

2023-09-10 20:46:58 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: project_pytutorial)
2023-09-10 20:46:58 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0], pyOpenSSL 23.2.0 (OpenSSL 3.1.2 1 Aug 2023), cryptography 41.0.3, Platform Linux-5.15.0-53-generic-x86_64-with-glibc2.35
2023-09-10 20:46:58 [scrapy.addons] INFO: Enabled addons:
[]
2023-09-10 20:46:58 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'project_pytutorial',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'project_pytutorial.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['project_pytutorial.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2023-09-10 20:46:58 [asyncio] DEBUG: Using selector: EpollSelector
2023-09-10 20:46:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-09-10 20:46:58 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2023-09-10 20:46:58 [scrapy.extensions.telnet] INFO: Telnet Password: 3d046408575915e3
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-09-10 20:46:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2023-09-10 20:46:58 [scrapy.core.engine] INFO: Spider opened
2023-09-10 20:46:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-10 20:46:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-10 20:46:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://pytutorial.com/robots.txt> from <GET http://pytutorial.com/robots.txt>
2023-09-10 20:46:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pytutorial.com/robots.txt> (referer: None)
2023-09-10 20:46:59 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://pytutorial.com/> from <GET http://pytutorial.com>
2023-09-10 20:47:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pytutorial.com/> (referer: None)
2023-09-10 20:47:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pytutorial.com/>
{'title': 'Pytutorial | Python and Django Tutorials Blog'}
2023-09-10 20:47:00 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-10 20:47:00 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2023-09-10 20:47:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 880,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 12850,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/301': 2,
 'elapsed_time_seconds': 1.775556,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 9, 10, 19, 47, 0, 337537),
 'httpcompression/response_bytes': 47819,
 'httpcompression/response_count': 1,
 'item_scraped_count': 1,
 'log_count/DEBUG': 8,
 'log_count/INFO': 11,
 'memusage/max': 60801024,
 'memusage/startup': 60801024,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2023, 9, 10, 19, 46, 58, 561981)}
2023-09-10 20:47:00 [scrapy.core.engine] INFO: Spider closed (finished)

data.json:

[
{"title": "Pytutorial | Python and Django Tutorials Blog"}
]

Conclusion

In this guide, we've learned how to install and set up the Scrapy framework in this guide. We've also made a simple spider app that gets the text of the <h1> tag.