Last modified: Jul 01, 2023 By Alexander Williams

Python: Parsing an HTML File or Multiple Files Using Beautiful Soup

To parse an HTML file in Python,  we'll need the Beautifulsoup library. Beautiful Soup is a Python library that simplifies the process of parsing HTML and XML documents.

Installing Beautiful Soup

To get started, you need to install Beautiful Soup by using the following command:

pip install beautifulsoup4

Parsing an HTML File Using Beautiful Soup

To parse an HTML file in Python, follow these steps:

  1. Open a file
  2. Parsing the file

In my situation, I have file1.html that contains HTML content. We'll parse the file and extract the title tag using Beautiful Soup.

Here is an example:

from bs4 import BeautifulSoup

# Open File
with open('files/file1.html') as f:
    # Parse HTML File
    soup = BeautifulSoup(f, 'html.parser')
    # Find Title Tag
    print(soup.title)      

Output:

<title>pytutorial | The Simplest Python and Django Tutorials</title>

If you want to get the text inside the tag, check out BeautifulSoup: Get Text value of Element

Parsing Multiple HTML Files Using BeautifulSoup

We need to use the os.listdir() function to parse files of a directory. This function returns a list of all the files and directories in a specified directory.

In the following example, we'll extract the title tag from all HTML files in the directory: 

# Specify the directory containing HTML files
html_dir = 'path/to/html/files'

# Iterate over the files in the directory
for file_name in os.listdir(html_dir):
    # Check if the file is an HTML file
    if file_name.endswith('.html'):
        # Construct the file path
        file_path = os.path.join(html_dir, file_name)
        
        # Open and parse the HTML file
        with open(file_path) as file:
            soup = BeautifulSoup(file, 'html.parser')
            # Parse HTML File
            soup = BeautifulSoup(content, 'html.parser')
            # Extract Title Tag
            print(soup.title)

Let me explain the code.

  1. Specify the directory containing HTML files
  2. Iterate over the files in the directory
  3. Check if the file is an HTML file
  4. Open and parse the HTML file
  5. Find the title tag

Conclusion

This article explored the basics of using Beautiful Soup to parse HTML files. Now you can unlock vast information in HTML documents and automate various data-intensive tasks.