Last modified: Jan 19, 2026 By Alexander Williams
Advanced BeautifulSoup Pagination & Infinite Scroll
Web scraping often requires data from many pages. This tutorial covers advanced methods.
You will learn to handle pagination and infinite scroll. These are common on modern websites.
We assume you know basic BeautifulSoup. Let's begin.
Understanding Pagination Patterns
Pagination splits content across multiple pages. Links like "Next" or page numbers are used.
Your scraper must find and follow these links. It must collect data from each page it visits.
First, identify the pagination structure on your target site. Look for a common URL pattern.
Scraping Static Pagination
Static pagination uses direct links to numbered pages. The URL often changes predictably.
For example, a site might use `?page=2` in its URL. You can loop through these page numbers.
Here is a Python script to scrape such a site. We use requests and BeautifulSoup.
import requests
from bs4 import BeautifulSoup
base_url = "https://example.com/items?page="
all_items = []
for page_num in range(1, 6): # Scrape pages 1 to 5
url = base_url + str(page_num)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all item elements (adjust selector as needed)
items = soup.find_all('div', class_='item')
for item in items:
title = item.find('h2').text.strip()
all_items.append(title)
print(f"Scraped page {page_num}")
print(f"Total items collected: {len(all_items)}")
Scraped page 1
Scraped page 2
Scraped page 3
Scraped page 4
Scraped page 5
Total items collected: 50
Following "Next" Button Links
Some sites use a relative "Next" button. The URL is not always predictable.
Your script must find the link to the next page. It stops when no "Next" link exists.
This approach is more robust. It adapts to the site's specific HTML structure.
import requests
from bs4 import BeautifulSoup
url = "https://example.com/items"
all_data = []
while url:
print(f"Fetching: {url}")
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Scrape data from current page
items = soup.find_all('article')
for item in items:
all_data.append(item.text.strip())
# Find the link to the next page
next_link = soup.find('a', text='Next')
if next_link and next_link.get('href'):
url = next_link['href']
# Handle relative URLs
if url.startswith('/'):
url = "https://example.com" + url
else:
url = None # Exit loop
print(f"Scraping complete. Collected {len(all_data)} items.")
Handling Infinite Scroll with BeautifulSoup
Infinite scroll loads content dynamically as you scroll. It uses JavaScript and AJAX.
BeautifulSoup alone cannot handle this. It only parses static HTML.
You need to find the data source. Often it's a JSON API endpoint.
Inspect the network traffic in your browser's developer tools. Look for XHR requests.
Finding and Parsing the Data API
Many sites load data via a hidden JSON API. The URL might contain parameters like `offset`.
You can simulate these requests with requests. Then parse the JSON response directly.
This method is very efficient. It avoids downloading unnecessary HTML.
import requests
import json
api_url = "https://example.com/api/items"
params = {'offset': 0, 'limit': 20}
all_items = []
while True:
response = requests.get(api_url, params=params)
data = response.json()
items = data.get('results', [])
if not items:
break # No more data
for item in items:
all_items.append(item['title'])
print(f"Fetched batch with offset {params['offset']}")
params['offset'] += params['limit'] # Prepare for next batch
print(f"Total items from API: {len(all_items)}")
For more on AJAX, see our AJAX scraping guide.
Combining BeautifulSoup with Selenium
Sometimes the API is hard to find. You can use Selenium to control a real browser.
Selenium scrolls the page and loads all content. Then you pass the HTML to BeautifulSoup.
This method is slower but powerful. It works on almost any site.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get("https://example.com/scroll-page")
# Scroll to bottom multiple times to load content
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Wait for new content to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Now parse the fully loaded page
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.find_all('div', class_='scroll-item')
print(f"Found {len(items)} items after scroll.")
driver.quit()
Best Practices and Error Handling
Always respect the website's robots.txt file. Add delays between requests.
Use try-except blocks to handle network errors. Log your scraping progress.
Set a User-Agent header to mimic a real browser. This helps avoid blocks.
For large projects, follow our large-scale best practices.
import requests
import time
from random import uniform
headers = {'User-Agent': 'Mozilla/5.0'}
base_url = "https://example.com/page?num="
for page in range(1, 10):
url = base_url + str(page)
try:
resp = requests.get(url, headers=headers, timeout=10)
resp.raise_for_status() # Check for HTTP errors
except requests.exceptions.RequestException as e:
print(f"Error on page {page}: {e}")
break
# Process page with BeautifulSoup here...
time.sleep(uniform(1, 3)) # Random delay between 1-3 seconds
Conclusion
You now know advanced BeautifulSoup techniques. You can scrape paginated and infinite scroll sites.
For static pagination, loop through URLs or find "Next" links. For infinite scroll, find the JSON API or use Selenium.
Always scrape ethically and legally. Check a site's terms of service before scraping.
Start with a simple beginner's guide if needed. Then tackle these advanced patterns.
Happy scraping!