Last modified: Sep 04, 2023 By Alexander Williams

Beautifulsoup Get All Links

If you want to extract all the links (anchor tags) from HTML content from <div> element, or a webpage, then you are in the right place.

Get all the links from HTML content

Use the find_all method with the .get() method to get all the links from HTML content. Here is an example:

from bs4 import BeautifulSoup

# Your HTML content
html_content = """
<html>
  <body>
    <a href="https://example.com">Example Website</a>
    <a href="https://openai.com">OpenAI</a>
    <a href="https://github.com">GitHub</a>
  </body>
</html>
"""

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all anchor tags
links = soup.find_all('a')

# Get and print the href attribute (link) for each anchor tag
for link in links:
    href = link.get('href')
    if href:
        print(href)

The output is:

https://example.com
https://openai.com
https://github.com

In this particular example, we follow these steps:

  1. Find all the <a> tags within the HTML.
  2. Iterate through the obtained results.
  3. Get the link for each <a> tag using the .get() method.

Get all the links from div element

o extract all the links located inside a <div> element, follow these steps:

  1. Use the find() method to locate the <div> element.
  2. Find all links (anchor tags) inside the <div> element.
  3. Get the href attribute (link) for each link using the get() method.

Now let's see an example:

from bs4 import BeautifulSoup

# Your HTML content
html_content = """
<html>
  <body>
    <div id="content">
      <a href="https://example.com">Example Website</a>
      <a href="https://openai.com">OpenAI</a>
      <a href="https://github.com">GitHub</a>
    </div>
    <div id="other">
      <a href="https://stackoverflow.com">Stack Overflow</a>
      <a href="https://python.org">Python.org</a>
    </div>
  </body>
</html>
"""

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find the specific div element by its id
div_element = soup.find('div', {'id': 'content'})

# Find all anchor tags
links_in_div = div_element.find_all('a')

# Get and print the href attribute (link)
for link in links_in_div:
    href = link.get('href')
    if href:
        print(href)

Output:

https://example.com
https://openai.com
https://github.com

Get all the links from Webpage

In this part of the lesson, we'll learn how to get all the links from the https://pytutorial.com page. We need the requests library for this task to get the page's source code.

If you haven't installed the requests library yet, you can install it by using the following command:

pip install requests

Now, let's look at an example of how to get all of the links on the https://pytutorial.com page.

import requests
from bs4 import BeautifulSoup

# Send an HTTP GET request to the URL
url = "https://pytutorial.com"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all anchor tags (links) in the HTML
    links = soup.find_all('a')

    # Extract and print the href attribute (link) for each anchor tag
    for link in links:
        href = link.get('href')
        if href:
            print(href)
else:
    print(f"Failed to retrieve the web page. Status code: {response.status_code}")

This code will print all links.

Conclusion

Ultimately, we've learned how to get all links from HTML text, inside a div, and from a webpage.

Note that find_all() and find() can be replaced with select() and select_one().