Last modified: April 28, 2021

Python: Parse an Html File Using Beautifulsoup

In this tutorial, I'll show you how to parse an HTML file and multiple files using the Beautifulsoup library.

If you don't know the Beautifulsoup library, take a look at Beautifulsoup documentation.

Parse a file using BeautifulSoup

To parse an HTML file in python, we need to follow these steps:

  1. Open a file
  2. Parsing the file

In my situation, I have file1.html that contains HTML content.

In the following code, we'll open file1.html then get the title tag.


from bs4 import BeautifulSoup


 with open('files/file1.html') as f:
    #read File
    content = f.read()
    #parse HTML
    soup = BeautifulSoup(content, 'html.parser')
    #print Title tag
    print(soup.title)      

Output:

<title>pytutorial | The Simplest Python and Django Tutorials</title>

As you can see, we have used the with open() method.

Now let's do the same thing using the open() method.


f = open('file.html')
content = f.read()
#parse HTML
soup = BeautifulSoup(content, 'html.parser')
#print Title tag
print(soup.title)

Output:

<title>pytutorial | The Simplest Python and Django Tutorials</title>

Parse multiple files using BeautifulSoup and glob

To parse files of a directory, we need to use the glob module.

With this module, we can retrieve files/pathnames matching a specified pattern.

In the following code, we'll get the title tag from all HTML files.

       
import glob

files = glob.glob('files/*.html')

for fi in files:
    with open(fi) as f:
        content = f.read()
        soup = BeautifulSoup(content, 'html.parser')
        print(soup.title)
        

Output:

   
<title>Google</title>
<title>Flavio Copes</title>
<title>pytutorial | The Simplest Python and Django Tutorials</title>