Last modified: Mar 04, 2023 By Alexander Williams
Understand How to Work with Table in beautifulsoup
This article will cover everything you need to know about using tables and BeautifulSoup. Specifically, we will go over how to:
- Find the table within HTML
- Find the table headers
- Retrieve the table columns
- Find the table by class
- Find the table by ID
- Find the table in a table
- Find all tables
By the end of this article, you will understand how to work with tables in BeautifulSoup.
Find table within HTML
To find a table within HTML using BeautifulSoup, follow the code below:
from bs4 import BeautifulSoup
html_doc = '''
<table class="table-1">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
<table class="table-2">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Facebook</td>
<td>Mark</td>
<td>USA</td>
</tr>
<tr>
<td>Centro comercial Newyork</td>
<td>Newyork Chang</td>
<td>USA</td>
</tr>
</table>
'''
# Parse HTML
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the first table
table = soup.find('table')
print(table)
As you can see in the HTML code provided, there are two tables with different class attributes. To find the first table, we can use the following line of code: soup.find('table')
.
<table class="table-1">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
If you need to find all the columns in a table, you can use the find_all()
function and specify the tr
tag as the parameter.
# Find the first table
table = soup.find('table')
# find all rows in the table
rows = table.find_all('tr')
print(rows)
Output:
[<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>, <tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>, <tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>]
Additionally, if you want to extract the headers of a table, you can use the find_all()
function and specify the th
tag as the parameter.
# find headers in the table
headers = table.find_all('th')
print(headers)
Output:
[<th>Company</th>, <th>Contact</th>, <th>Country</th>]
To get the text inside each header cell, you can use the .text
property as shown in the following example:
for header in headers:
print(header.text)
Output:
Company
Contact
Country
To find and retrieve the text of each cell <td>
In a table, you can follow these steps::
- Iterate over the rows of the table.
- Find all
<td>
tags within each row. - Iterate over the
<td>
tags. - Retrieve the text inside each
<td>
tag.
Here is an example:
# Find the first table
table = soup.find('table')
# find all rows in the table
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td') # Find all <td> tags
for cell in cells:
print(cell.text) # Get <td> text
Output:
Alfreds Futterkiste
Maria Anders
Germany
Centro comercial Moctezuma
Francisco Chang
Mexico
Find all tables
In Beautiful Soup, we can use either the find_all()
or select()
function to locate all tables within HTML. Here's an example of how to use each function:
1. find_all():
html_doc = '''
<table class="table-1">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
<table class="table-2">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Facebook</td>
<td>Mark</td>
<td>USA</td>
</tr>
<tr>
<td>Centro comercial Newyork</td>
<td>Newyork Chang</td>
<td>USA</td>
</tr>
</table>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
tables = soup.find_all('table') # Find all tables using find_all()
2. Select()
html_doc = '''
<table class="table-1">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
<table class="table-2">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Facebook</td>
<td>Mark</td>
<td>USA</td>
</tr>
<tr>
<td>Centro comercial Newyork</td>
<td>Newyork Chang</td>
<td>USA</td>
</tr>
</table>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
tables = soup.select('table') # Find all tables using select()
Both the find_all()
and select()
functions will return a list of all tables in the HTML code. You can then iterate over this list to access each table individually
Find table by class
To find a table by class, we can use the find()
function and specify the class in the class_
parameter.
html_doc = '''
<table class="table-1">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
<table class="table-2">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Facebook</td>
<td>Mark</td>
<td>USA</td>
</tr>
<tr>
<td>Centro comercial Newyork</td>
<td>Newyork Chang</td>
<td>USA</td>
</tr>
</table>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# Find Table by Class
table = soup.find("table", class_="table-1")
In the example above, we want to find the table with the table-1 class.
Another function that can be used to find a table by class is select_one()
. As you can see in the following example:
table = soup.select_one("table.table-1")
Extract table inside other tables
To extract tables nested inside other tables using Beautiful Soup, you can use the same method for extracting any other element. Here is an example:
html_doc = '''
<table class="outer-table">
<tr>
<td>Outer Table Cell 1</td>
<td>Outer Table Cell 2</td>
</tr>
<tr>
<td colspan="2">
<table class="inner-table">
<tr>
<td>Inner Table Cell 1</td>
<td>Inner Table Cell 2</td>
</tr>
<tr>
<td>Inner Table Cell 3</td>
<td>Inner Table Cell 4</td>
</tr>
</table>
</td>
</tr>
</table>
'''
#
soup = BeautifulSoup(html_doc, 'html.parser')
# find the outer table
outer_table = soup.find('table', {'class': 'outer-table'})
# find the inner table within the outer table
inner_table = outer_table.find('table', {'class': 'inner-table'})
# iterate through the rows of the inner table and extract data from each cell
rows = inner_table.find_all('tr')
for row in rows:
cells = row.find_all('td')
for cell in cells:
print(cell.text)
Output:
Inner Table Cell 1
Inner Table Cell 2
Inner Table Cell 3
Inner Table Cell 4
Let me explain:
- Find the outer table that contains the nested table.
- Find the inner table within the outer table.
- Iterate through the rows of the inner table.
- Extract data from each cell as required.
Conclusion
In this article, we have covered everything you need to know about working with tables in BeautifulSoup. As you can see, BeautifulSoup provides a robust set of functions for extracting tables and other structured data from HTML documents.
I hope this article helps you.