Last modified: Sep 04, 2023 By Alexander Williams
Beautifulsoup Get All Links
If you want to extract all the links (anchor tags) from HTML content from <div>
element, or a webpage, then you are in the right place.
Get all the links from HTML content
Use the find_all method with the .get() method to get all the links from HTML content. Here is an example:
from bs4 import BeautifulSoup
# Your HTML content
html_content = """
<html>
<body>
<a href="https://example.com">Example Website</a>
<a href="https://openai.com">OpenAI</a>
<a href="https://github.com">GitHub</a>
</body>
</html>
"""
# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Find all anchor tags
links = soup.find_all('a')
# Get and print the href attribute (link) for each anchor tag
for link in links:
href = link.get('href')
if href:
print(href)
The output is:
https://example.com
https://openai.com
https://github.com
In this particular example, we follow these steps:
- Find all the
<a>
tags within the HTML. - Iterate through the obtained results.
- Get the link for each
<a>
tag using the.get()
method.
Get all the links from div element
o extract all the links located inside a <div>
element, follow these steps:
- Use the
find()
method to locate the<div>
element. - Find all links (anchor tags) inside the
<div>
element. - Get the
href
attribute (link) for each link using theget()
method.
Now let's see an example:
from bs4 import BeautifulSoup
# Your HTML content
html_content = """
<html>
<body>
<div id="content">
<a href="https://example.com">Example Website</a>
<a href="https://openai.com">OpenAI</a>
<a href="https://github.com">GitHub</a>
</div>
<div id="other">
<a href="https://stackoverflow.com">Stack Overflow</a>
<a href="https://python.org">Python.org</a>
</div>
</body>
</html>
"""
# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Find the specific div element by its id
div_element = soup.find('div', {'id': 'content'})
# Find all anchor tags
links_in_div = div_element.find_all('a')
# Get and print the href attribute (link)
for link in links_in_div:
href = link.get('href')
if href:
print(href)
Output:
https://example.com
https://openai.com
https://github.com
Get all the links from Webpage
In this part of the lesson, we'll learn how to get all the links from the https://pytutorial.com page. We need the requests library for this task to get the page's source code.
If you haven't installed the requests
library yet, you can install it by using the following command:
pip install requests
Now, let's look at an example of how to get all of the links on the https://pytutorial.com page.
import requests
from bs4 import BeautifulSoup
# Send an HTTP GET request to the URL
url = "https://pytutorial.com"
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all anchor tags (links) in the HTML
links = soup.find_all('a')
# Extract and print the href attribute (link) for each anchor tag
for link in links:
href = link.get('href')
if href:
print(href)
else:
print(f"Failed to retrieve the web page. Status code: {response.status_code}")
This code will print all links.
Conclusion
Ultimately, we've learned how to get all links from HTML text, inside a div, and from a webpage.
Note that find_all()
and find()
can be replaced with select()
and select_one()
.