Last modified: Jan 12, 2026 By Alexander Williams
Avoid Getting Blocked While Scraping BeautifulSoup
Web scraping is a powerful tool. But websites often block scrapers. Getting blocked stops your data collection. This guide shows you how to scrape responsibly. You will learn to avoid bans.
BeautifulSoup is a parsing library. It needs a requests library to fetch web pages. The blocking happens at the fetching stage. Servers detect non-human traffic patterns. They then block the IP address.
You must mimic a real browser. You must also respect the website's rules. Let's explore the key techniques.
1. Use Realistic Headers
Every HTTP request sends headers. They tell the server about your client. The User-Agent header is critical. Default Python requests look suspicious.
Always set a custom User-Agent. Rotate between different common browsers. This makes your requests appear human.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
url = 'https://example.com'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
You can store a list of user agents. Rotate them for each request. This simple step reduces block risk.
2. Implement Polite Delays
Rapid, consecutive requests are a red flag. They overwhelm servers. This is a denial-of-service attack pattern. You must slow down.
Use time.sleep() between requests. Add random intervals. This mimics human reading speed. It is crucial for scraping multiple pages.
import time
import random
for page in range(1, 6):
url = f'https://example.com/page/{page}'
response = requests.get(url, headers=headers)
# Process page with BeautifulSoup here...
# Wait 2 to 5 seconds before next request
delay = random.uniform(2, 5)
time.sleep(delay)
print(f"Scraped page {page}, waiting {delay:.2f} seconds.")
Scraped page 1, waiting 3.41 seconds.
Scraped page 2, waiting 4.87 seconds.
Scraped page 3, waiting 2.15 seconds.
This is the simplest form of rate limiting. Always use it.
3. Respect robots.txt
The robots.txt file is a website's rulebook. It tells bots which pages are off-limits. Ignoring it is unethical. It also increases your chance of being blocked.
Check the file before you scrape. The `robotparser` module can help. Avoid disallowed paths.
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
user_agent = 'MyScraperBot'
url_to_check = 'https://example.com/admin'
if rp.can_fetch(user_agent, url_to_check):
print("Scraping is allowed.")
else:
print("Scraping is disallowed by robots.txt.")
Respecting this file builds good will. It is a fundamental scraping practice.
4. Handle Sessions and Cookies
Some sites use sessions. They track user state with cookies. Using a requests Session object manages cookies automatically. It makes your scraper look like a persistent user.
session = requests.Session()
# First request gets and stores cookies
initial_page = session.get('https://example.com/login', headers=headers)
# Subsequent requests use the same session and cookies
data_page = session.get('https://example.com/dashboard', headers=headers)
soup = BeautifulSoup(data_page.content, 'html.parser')
This is more efficient. It also appears more natural to the server.
5. Use Proxies for Large-Scale Scraping
For heavy scraping, one IP is not enough. Proxies distribute requests across many IP addresses. This prevents any single IP from being flagged. It is essential for large-scale scraping best practices.
You can use free or paid proxy services. Always test proxies for reliability. Rotate them with each request.
proxies_list = [
{'http': 'http://proxy1:port', 'https': 'https://proxy1:port'},
{'http': 'http://proxy2:port', 'https': 'https://proxy2:port'},
]
import random
proxy = random.choice(proxies_list)
try:
response = requests.get(url, headers=headers, proxies=proxy, timeout=5)
print("Request successful with proxy.")
except:
print("Proxy failed, try another.")
For a robust setup, consider dedicated guides on proxies and user agents.
6. Handle Errors and Retries Gracefully
Blocks often manifest as HTTP errors. Common codes are 403 (Forbidden), 429 (Too Many Requests), and 503 (Service Unavailable). Your code must handle these.
Implement a retry mechanism with backoff. The `tenacity` or `retrying` libraries are great for this.
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def make_request(url):
response = requests.get(url, headers=headers)
response.raise_for_status() # Raises an HTTPError for bad status codes
return response
try:
r = make_request(url)
except requests.exceptions.HTTPError as e:
print(f"Request failed after retries: {e}")
This makes your scraper resilient. It pauses and retries instead of hammering a blocked endpoint.
7. Avoid Scraping During Peak Hours
Websites have traffic patterns. Server load is high during business hours. Scraping then adds unnecessary stress.
Schedule your scrapers for off-peak times. Nighttime or early morning is better. This is considerate. It also reduces the chance of being noticed.
8. Use Headless Browsers Sparingly
Sometimes you need JavaScript rendering. BeautifulSoup cannot do this. Tools like Selenium are used. But they are slow and easily detected.
Use them only when absolutely necessary. For static HTML, requests with BeautifulSoup is best. If you must, learn to combine BeautifulSoup and Selenium efficiently.
Conclusion
Avoiding blocks is about being polite and smart. You are a guest on the website's server. Act like one.
Always use headers and delays. Respect robots.txt. Use proxies for large jobs. Handle errors gracefully.
These practices keep your scrapers running. They also maintain a respectful data ecosystem. Happy and responsible scraping!