Last modified: Oct 06, 2024 By Alexander Williams

How to Extract Data from Tables Using BeautifulSoup

In this tutorial, we're going to cover how to extract data from HTML tables using BeautifulSoup, a powerful Python library for web scraping.

BeautifulSoup: Extracting Data from Tables

To extract data from tables, we'll use BeautifulSoup's methods to locate and parse table elements. Let's start with the basic syntax for finding table elements.

syntax:


soup.find_all('table')

Now, let's dive into some practical examples.

In the following example, we'll extract data from a simple HTML table:


from bs4 import BeautifulSoup

# HTML source with a table
html_source = '''
    <table>
        <tr>
            <th>Name</th>
            <th>Age</th>
        </tr>
        <tr>
            <td>John</td>
            <td>30</td>
        </tr>
        <tr>
            <td>Jane</td>
            <td>25</td>
        </tr>
    </table>
'''

# Parsing
soup = BeautifulSoup(html_source, 'html.parser')

# Find the table
table = soup.find('table')

# Extract data from the table
for row in table.find_all('tr'):
    columns = row.find_all(['th', 'td'])
    print([column.text for column in columns])

output:

['Name', 'Age']
['John', '30']
['Jane', '25']

In this example, we found the table, then iterated through its rows and columns to extract the text content.

BeautifulSoup: Extracting Data from Specific Columns

Sometimes you might want to extract data from specific columns. Here's how you can do that:


from bs4 import BeautifulSoup

# HTML source with a more complex table
html_source = '''
    <table id="data-table">
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>City</th>
        </tr>
        <tr>
            <td>John</td>
            <td>30</td>
            <td>New York</td>
        </tr>
        <tr>
            <td>Jane</td>
            <td>25</td>
            <td>London</td>
        </tr>
    </table>
'''

# Parsing
soup = BeautifulSoup(html_source, 'html.parser')

# Find the table by id
table = soup.find('table', id='data-table')

# Extract names and cities (1st and 3rd columns)
for row in table.find_all('tr')[1:]:  # Skip the header row
    columns = row.find_all('td')
    if len(columns) >= 3:
        name = columns[0].text
        city = columns[2].text
        print(f"Name: {name}, City: {city}")

output:

Name: John, City: New York
Name: Jane, City: London

BeautifulSoup: Handling Tables with Colspan and Rowspan

Tables with colspan and rowspan can be tricky. Here's an approach to handle such cases:


from bs4 import BeautifulSoup

# HTML source with colspan and rowspan
html_source = '''
    <table>
        <tr>
            <th>Name</th>
            <th colspan="2">Contact</th>
        </tr>
        <tr>
            <td rowspan="2">John Doe</td>
            <td>Email</td>
            <td>john@example.com</td>
        </tr>
        <tr>
            <td>Phone</td>
            <td>123-456-7890</td>
        </tr>
    </table>
'''

# Parsing
soup = BeautifulSoup(html_source, 'html.parser')

# Find the table
table = soup.find('table')

# Function to handle colspan and rowspan
def extract_cell_data(cell):
    colspan = int(cell.get('colspan', 1))
    rowspan = int(cell.get('rowspan', 1))
    return [cell.text] * colspan, rowspan

# Extract data from the table
data = []
for row in table.find_all('tr'):
    row_data = []
    row_spans = []
    for cell in row.find_all(['th', 'td']):
        cell_data, cell_rowspan = extract_cell_data(cell)
        row_data.extend(cell_data)
        row_spans.extend([cell_rowspan] * len(cell_data))
    
    # Handle rowspans from previous rows
    for i, span in enumerate(row_spans):
        if span > 1:
            row_spans[i] -= 1
            if len(data) > 0:
                row_data.insert(i, data[-1][i])

    data.append(row_data)

# Print the extracted data
for row in data:
    print(row)

output:

['Name', 'Contact', 'Contact']
['John Doe', 'Email', 'john@example.com']
['John Doe', 'Phone', '123-456-7890']

This approach handles both colspan and rowspan, ensuring that the data is correctly structured even with complex table layouts.

Conclusion

Extracting data from tables using BeautifulSoup involves finding the table elements, iterating through rows and columns, and handling special cases like colspan and rowspan. With these techniques, you can effectively scrape and process tabular data from HTML sources.

Remember to always respect website terms of service and robots.txt files when scraping data, and consider using APIs when available instead of scraping directly from HTML.