Last modified: Jan 10, 2023 By Alexander Williams
Python: Parse an Html File Using Beautifulsoup
In this tutorial, I'll show you how to parse an HTML file and multiple files using the Beautifulsoup library.
If you don't know the Beautifulsoup library, take a look at Beautifulsoup documentation.
Parse a file using BeautifulSoup
To parse an HTML file in python, we need to follow these steps:
- Open a file
- Parsing the file
In my situation, I have file1.html that contains HTML content.
In the following code, we'll open file1.html then get the title tag.
from bs4 import BeautifulSoup
with open('files/file1.html') as f:
#read File
content = f.read()
#parse HTML
soup = BeautifulSoup(content, 'html.parser')
#print Title tag
print(soup.title)
Output:
<title>pytutorial | The Simplest Python and Django Tutorials</title>
As you can see, we have used the with open() method.
Now let's do the same thing using the open() method.
f = open('file.html')
content = f.read()
#parse HTML
soup = BeautifulSoup(content, 'html.parser')
#print Title tag
print(soup.title)
Output:
<title>pytutorial | The Simplest Python and Django Tutorials</title>
Parse multiple files using BeautifulSoup and glob
To parse files of a directory, we need to use the glob module.
With this module, we can retrieve files/pathnames matching a specified pattern.
In the following code, we'll get the title tag from all HTML files.
import glob
files = glob.glob('files/*.html')
for fi in files:
with open(fi) as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
print(soup.title)
Output:
<title>Google</title> <title>Flavio Copes</title> <title>pytutorial | The Simplest Python and Django Tutorials</title>