Last modified: Jan 12, 2026 By Alexander Williams

Extract HTML Tables to Pandas with BeautifulSoup

Web scraping is a key data skill. HTML tables hold valuable structured data. This guide shows you how to extract it.

We will use Python's BeautifulSoup and Pandas. You will learn to find, parse, and convert tables into a DataFrame.

Why Scrape Tables into Pandas?

Tables on websites contain organized data. Think financial reports, sports stats, or product listings.

Manually copying this data is slow and error-prone. Automated scraping is fast and reliable.

Pandas DataFrames are perfect for this data. They allow for easy cleaning, analysis, and exporting.

Prerequisites and Setup

First, ensure you have Python installed. Then, install the necessary libraries using pip.


pip install beautifulsoup4 pandas requests lxml

The requests library fetches web pages. beautifulsoup4 parses HTML. pandas structures data. lxml is a fast parser.

Now, import these libraries in your Python script.


import requests
from bs4 import BeautifulSoup
import pandas as pd

Step 1: Fetch and Parse the HTML

Use requests.get() to download the webpage. Pass the URL as an argument.

Then, create a BeautifulSoup object. This object parses the HTML content for you.


# URL of the page containing the table
url = 'https://example.com/data-table'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using the 'lxml' parser
    soup = BeautifulSoup(response.content, 'lxml')
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Always check the status code. A 200 code means success. For complex pages, you might need to combine BeautifulSoup & Selenium.

Step 2: Locate the Target Table

HTML tables are defined with the <table> tag. Use BeautifulSoup's find() or find_all() methods.

Inspect the webpage to find your table's unique identifier. This could be an id or class.


# Find the first table on the page
table = soup.find('table')

# Or find a table with a specific class
table = soup.find('table', class_='data-table')

# Or find a table with a specific id
table = soup.find('table', id='main-data')

# To find all tables and select one
all_tables = soup.find_all('table')
target_table = all_tables[1]  # Selects the second table

If the HTML is messy, learn to handle broken HTML with BeautifulSoup. This ensures robust scraping.

Step 3: Extract Table Headers

Table headers are in <th> tags, usually inside a <thead> or <tr>. We extract their text.


# Initialize an empty list for headers
headers = []

# Find all header cells. Often in the first row or thead.
header_row = table.find('tr')  # Gets the first row
for th in header_row.find_all('th'):
    header_text = th.get_text(strip=True)
    headers.append(header_text)

# If no  tags, try getting text from the first 's  tags
if not headers:
    for td in header_row.find_all('td'):
        headers.append(td.get_text(strip=True))

print("Headers:", headers)


Headers: ['Rank', 'City', 'Population']

Step 4: Extract Table Rows and Data

Data cells are in <td> tags within <tr> rows. We loop through rows, then cells.


# Initialize an empty list for all row data
table_data = []

# Find all rows in the table body
rows = table.find_all('tr')[1:]  # Skip the header row

for row in rows:
    # Get all data cells in this row
    cells = row.find_all('td')
    # Extract text from each cell and strip whitespace
    row_data = [cell.get_text(strip=True) for cell in cells]
    # Only add row if it has data (avoids empty rows)
    if row_data:
        table_data.append(row_data)

print("First few rows:", table_data[:2])


First few rows: [['1', 'Tokyo', '37,400,000'], ['2', 'Delhi', '31,400,000']]

Step 5: Create the Pandas DataFrame

Now, pass the headers and data to the pd.DataFrame() constructor. This creates your structured dataset.


# Create the DataFrame
df = pd.DataFrame(table_data, columns=headers)

# Display the first few rows
print(df.head())


   Rank    City Population
0     1   Tokyo 37,400,000
1     2   Delhi 31,400,000
2     3 Shanghai 27,100,000

You now have a powerful Pandas DataFrame. You can analyze, filter, and visualize this data easily.

Handling Complex Table Structures

Real-world tables can be tricky. They may have rowspans, colspans, or nested elements.

For complex extractions, consider using regex with BeautifulSoup for web scraping. It helps match specific patterns.

Sometimes data is split across many pages. Our guide on scrape multiple pages with BeautifulSoup shows you how.

Data Cleaning and Post-Processing

Scraped data often needs cleaning. Pandas provides excellent tools for this.


# Convert 'Population' from string with commas to integer
df['Population'] = df['Population'].str.replace(',', '').astype(int)

# Convert 'Rank' to integer
df['Rank'] = df['Rank'].astype(int)

# Set 'Rank' as the index
df.set_index('Rank', inplace=True)

print(df.dtypes)
print(df.head())


City          object
Population     int64
dtype: object
           City  Population
Rank
1         Tokyo    37400000
2         Delhi    31400000
3      Shanghai    27100000

Saving Your Scraped Data

After cleaning, save your DataFrame. Common formats are CSV, Excel, or JSON.


# Save to CSV
df.to_csv('city_population.csv', index=True)

# Save to Excel (requires openpyxl or xlsxwriter)
# df.to_excel('city_population.xlsx', index=True)

# Save to JSON
df.to_json('city_population.json', orient='records')

For a dedicated guide on exporting, read about how to save scraped data to CSV with BeautifulSoup.

Best Practices and Tips

Always respect the website's robots.txt file. Do not overload their servers with rapid requests.

Use time.sleep() between requests when scraping multiple pages. This is polite and avoids bans.

For large-scale scraping, use BeautifulSoup with proxies and user agents. This helps mimic real browsers.

Check your data types after scraping. Strings often need conversion to numbers or dates.

Conclusion

Extracting table data to a Pandas DataFrame is a powerful technique. It automates data collection from the web.

The process is simple. Fetch HTML, find the table, extract headers and rows, then build the DataFrame.

This skill opens doors to vast data sources. You can track prices, analyze trends, or gather research data.

Start with simple tables. Then tackle more complex structures. Happy scraping!