Last modified: Jan 19, 2026 By Alexander Williams
Web Scraping Guide with BeautifulSoup for Beginners
Web scraping is a powerful skill. It lets you collect data from websites.
Python and BeautifulSoup make this easy. This guide will teach you the basics.
You will learn to extract information from HTML pages step by step.
What is Web Scraping?
Web scraping is automated data collection from the web. It's like copying and pasting but faster.
It is used for price comparison, research, and data analysis. Always check a website's robots.txt file and terms of service.
Scraping responsibly is crucial. Do not overload servers.
Prerequisites
You need Python installed on your computer. Basic Python knowledge helps.
We will use two main libraries: requests and beautifulsoup4.
Open your terminal or command prompt to get started.
Step 1: Install Required Libraries
First, install the necessary packages. Use the pip package manager.
pip install requests beautifulsoup4
This command downloads and installs both libraries. You only need to do this once.
Step 2: Import Libraries
Create a new Python file. Start by importing the modules.
import requests
from bs4 import BeautifulSoup
The requests library fetches web pages. BeautifulSoup parses the HTML content.
Step 3: Fetch a Web Page
Use requests.get() to download a page. We will use a simple example page.
URL = 'http://example.com'
response = requests.get(URL)
# Check if the request was successful
if response.status_code == 200:
print('Page fetched successfully!')
else:
print('Failed to retrieve page')
The status_code 200 means success. Always handle possible errors.
Step 4: Parse HTML with BeautifulSoup
Create a BeautifulSoup object. This object lets you navigate the HTML structure.
soup = BeautifulSoup(response.content, 'html.parser')
We pass the page content and the parser type. 'html.parser' is built into Python.
Step 5: Explore the Page Structure
Use your browser's Developer Tools. Right-click on a webpage and select "Inspect".
This shows the HTML code. Identify the tags containing your target data.
Look for unique class names or IDs. This makes extraction precise.
Step 6: Extract Data by Tag Name
Find elements using their tag name. Use the find() or find_all() methods.
# Find the first tag
title_tag = soup.find('h1')
print(title_tag.text)
# Find all paragraph
tags
all_paragraphs = soup.find_all('p')
for p in all_paragraphs:
print(p.text)
Example Domain
This domain is for use in illustrative examples...
find() returns the first match. find_all() returns a list of all matches.
Step 7: Extract Data by Class or ID
Tags often have class or ID attributes. These are more specific selectors.
# Find element with a specific class
div_with_class = soup.find('div', class_='example-class')
# Find element with a specific ID
main_content = soup.find(id='main')
Note: class_ has an underscore because 'class' is a Python keyword.
Step 8: Extract Attributes and Links
You can get attributes like 'href' from links. Treat the tag like a dictionary.
# Find the first link tag
link = soup.find('a')
print('Link Text:', link.text)
print('Link URL:', link['href'])
Link Text: More information...
Link URL: https://www.iana.org/domains/example
This is useful for collecting all links on a page.
Step 9: Putting It All Together: A Complete Script
Let's build a script that scrapes book titles from a mock page.
import requests
from bs4 import BeautifulSoup
# Target URL (a mock book listing site)
url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all article tags with class 'product_pod'
books = soup.find_all('article', class_='product_pod')
for book in books:
# Find the h3 tag inside the article, then the 'a' tag inside it
title_tag = book.h3.a
title = title_tag['title'] # The title is stored in the 'title' attribute
print(title)
This script finds all book articles. It then extracts the title from each one.
Step 10: Handle Common Issues
Websites change. Your script might break if the HTML structure updates.
Use try-except blocks to handle missing elements gracefully.
try:
price = soup.find('p', class_='price_color').text
except AttributeError:
price = 'Price not found'
This prevents your program from crashing. For more complex debugging, see our guide on Debug and Test BeautifulSoup Scripts Efficiently.
Next Steps and Best Practices
You now know the basics. Real-world projects need more techniques.
To scrape data across many pages, learn to Scrape Multiple Pages with BeautifulSoup.
Modern sites load data dynamically. For this, check our guide on Scrape AJAX Content with BeautifulSoup.
Always scrape ethically. Add delays between requests. Respect robots.txt rules.
Conclusion
BeautifulSoup is a fantastic tool for beginners. It turns messy HTML into structured data.
You learned to install, fetch, parse, and extract data. Start with simple projects.
Practice on sites that allow scraping. Always be respectful of server resources.
Happy scraping!