Last modified: Jan 12, 2026 By Alexander Williams

Use Regex with BeautifulSoup for Web Scraping

Web scraping often requires finding specific patterns.

BeautifulSoup is great for parsing HTML structure.

But sometimes you need more flexible search power.

That's where regular expressions, or regex, come in.

Combining them unlocks precise data extraction.

Why Combine Regex and BeautifulSoup?

BeautifulSoup's find_all method is powerful.

You can search by tag name, class, or ID.

But text or attribute patterns can be complex.

Regex lets you match dynamic, non-standard content.

Use it for partial matches, patterns, or variations.

This combo is perfect for messy, real-world data.

Importing the Necessary Modules

First, ensure you have the libraries installed.

You need BeautifulSoup and the re module.

The re module is built into Python.

Here is the basic import setup.


from bs4 import BeautifulSoup
import re  # Python's regular expression module

# Your HTML content or fetching logic goes here

Using Regex with find_all for Text

The find_all method accepts a regex object.

Use the text parameter to search string content.

Pass a compiled regex pattern from re.compile.

This finds tags whose text matches the pattern.

Let's look at a practical example.


html_doc = """
<div>
    <p>Price: $19.99</p>
    <p>Cost: €15.50</p>
    <p>Sale: $25.00</p>
    <p>Normal text here.</p>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Regex to find text containing a dollar amount
price_pattern = re.compile(r'\$\d+\.\d{2}')

# Find all <p> tags with text matching the dollar pattern
price_tags = soup.find_all('p', text=price_pattern)

for tag in price_tags:
    print(tag.text)

Price: $19.99
Sale: $25.00

Only paragraphs with dollar prices were selected.

The euro price and normal text were ignored.

This is very useful for extracting specific data formats.

Using Regex with find_all for Attributes

You can also search tag attributes with regex.

Common uses include class names, IDs, or href links.

Use a dictionary for the attrs parameter.

The value can be a regex pattern object.

This helps find elements with dynamic attributes.


html_doc = """
<div>
    <a class="btn-primary" href="/page1">Link 1</a>
    <a class="btn-secondary" href="/archive/2023/post">Link 2</a>
    <a class="btn" href="/static/image.jpg">Link 3</a>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Regex to find href attributes containing 'archive'
href_pattern = re.compile(r'.*archive.*')

# Find all <a> tags where href matches the pattern
archive_links = soup.find_all('a', attrs={'href': href_pattern})

for link in archive_links:
    print(link.get('href'), "-", link.text)

/archive/2023/post - Link 2

This technique is key for scraping multiple pages.

You can find all links that follow a certain URL pattern.

Using Regex with the string Parameter

BeautifulSoup has a string parameter too.

It is similar to the text parameter.

It searches for NavigableString objects directly.

This can be slightly faster for simple text searches.

The usage is identical to the text example.


# Using the 'string' parameter
result = soup.find_all(string=re.compile(r'Price:'))
print(result)

Common Regex Patterns for Web Scraping

Here are useful patterns to keep in your toolkit.

Match emails: r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

Match phone numbers: r'\+?\d[\d\s-]{7,}\d'

Match URLs: r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'

Match specific word patterns in classes: r'product-\d+'

These help extract structured data from chaos.

Practical Example: Extracting Product IDs

Let's build a complete, realistic example.

We will scrape a product list.

We need to find items with specific ID patterns.

We'll use regex on the class attribute.


html_doc = """
<ul class="products">
    <li class="item prod-12345">Product A</li>
    <li class="item prod-abcde">Product B</li>
    <li class="item prod-67890">Product C</li>
    <li class="item">Product D</li>
</ul>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Pattern to match class containing 'prod-' followed by digits
id_pattern = re.compile(r'prod-\d+')

# Find li tags where any class matches the pattern
product_items = soup.find_all('li', attrs={'class': id_pattern})

for item in product_items:
    print(f"Found: {item.text} with classes: {item.get('class')}")

Found: Product A with classes: ['item', 'prod-12345']
Found: Product C with classes: ['item', 'prod-67890']

Product B was skipped as its ID had letters.

Product D had no prod- class at all.

This shows precise, pattern-based filtering.

Tips and Best Practices

Always test your regex patterns separately first.

Use tools like regex101.com to validate them.

Pre-compile patterns with re.compile for reuse.

This improves performance in loops.

Combine regex with other BeautifulSoup filters.

You can still use class_ or id with regex.

For large-scale projects, consider using proxies and user agents.

This avoids being blocked while scraping.

Also, be prepared to handle broken HTML gracefully.

Real websites are rarely perfectly formatted.

Conclusion

Regular expressions make BeautifulSoup incredibly powerful.

They allow you to find elements based on complex patterns.

You can search within text content or tag attributes.

This is essential for professional web scraping tasks.

Start with simple patterns and gradually increase complexity.

Mastering this combination will solve most data extraction challenges.

Remember to scrape responsibly and respect website terms.

Now you have the tools to tackle dynamic web data.