Last modified: Jul 01, 2023 By Alexander Williams
Python: Parsing an HTML File or Multiple Files Using Beautiful Soup
To parse an HTML file in Python, we'll need the Beautifulsoup library. Beautiful Soup is a Python library that simplifies the process of parsing HTML and XML documents.
Installing Beautiful Soup
To get started, you need to install Beautiful Soup by using the following command:
pip install beautifulsoup4
Parsing an HTML File Using Beautiful Soup
To parse an HTML file in Python, follow these steps:
- Open a file
- Parsing the file
In my situation, I have file1.html that contains HTML content. We'll parse the file and extract the title tag using Beautiful Soup.
Here is an example:
from bs4 import BeautifulSoup
# Open File
with open('files/file1.html') as f:
# Parse HTML File
soup = BeautifulSoup(f, 'html.parser')
# Find Title Tag
print(soup.title)
Output:
<title>pytutorial | The Simplest Python and Django Tutorials</title>
If you want to get the text inside the tag, check out BeautifulSoup: Get Text value of Element
Parsing Multiple HTML Files Using BeautifulSoup
We need to use the os.listdir()
function to parse files of a directory. This function returns a list of all the files and directories in a specified directory.
In the following example, we'll extract the title tag from all HTML files in the directory:
# Specify the directory containing HTML files
html_dir = 'path/to/html/files'
# Iterate over the files in the directory
for file_name in os.listdir(html_dir):
# Check if the file is an HTML file
if file_name.endswith('.html'):
# Construct the file path
file_path = os.path.join(html_dir, file_name)
# Open and parse the HTML file
with open(file_path) as file:
soup = BeautifulSoup(file, 'html.parser')
# Parse HTML File
soup = BeautifulSoup(content, 'html.parser')
# Extract Title Tag
print(soup.title)
Let me explain the code.
- Specify the directory containing HTML files
- Iterate over the files in the directory
- Check if the file is an HTML file
- Open and parse the HTML file
- Find the title tag
Conclusion
This article explored the basics of using Beautiful Soup to parse HTML files. Now you can unlock vast information in HTML documents and automate various data-intensive tasks.