Last modified: Feb 22, 2023 By Alexander Williams
Beautifulsoup Get all Links
In this article, we'll learn how to use a BeautifulSoup to get all links from HTML code and web pages.
Get all links from HTML Code
We can use find_all() or select() methods to get all links from HTML code.
Using findall()
Here's an example of how to use find_all()
in BeautifulSoup to get all links from an HTML Code:
from bs4 import BeautifulSoup
html = '''
<a href="example1.com">example1</a>
<div>
<a href="example2.com">example2</a>
<a href="example3.com">example3</a>
<a href="example4.com">example4</a>
</div>
'''
soup = BeautifulSoup(html, "html.parser") # Parse HTML
links = soup.find_all("a") # Get All <a> Tag
for link in links:
print(link['href']) # Print Link
Output:
example1.com
example2.com
example3.com
example4.com
As you can see, we've:
- Used the findall() function to find all <a> tags
- Looped through the list of
a
tags - Printed the value of each tag's
href
attribute.
Let's see how to get only the links inside the div tag.
div = soup.find("div") # Get Div
links = div.find_all("a") # Get All <a> Tag
for link in links:
print(link['href']) # Print Link
Using Select()
We can also use select() to get all inks from HTML, as demonstrated in the following example.
from bs4 import BeautifulSoup
html = '''
<a href="example1.com">example1</a>
<div>
<a href="example2.com">example2</a>
<a href="example3.com">example3</a>
<a href="example4.com">example4</a>
</div>
'''
soup = BeautifulSoup(html, "html.parser") # Parse HTML
links = soup.select("a") # Get <a> Tags
for link in links:
print(link['href']) # Print Link
Output:
example1.com
example2.com
example3.com
example4.com
Here is an example of using select() to extract links inside the div tag.
from bs4 import BeautifulSoup
html = '''
<a href="example1.com">example1</a>
<div>
<a href="example2.com">example2</a>
<a href="example3.com">example3</a>
<a href="example4.com">example4</a>
</div>
'''
soup = BeautifulSoup(html, "html.parser") # Parse HTML
links = soup.select("div a") # Get All Links Inside Div
for link in links:
print(link['href']) # Print Link
Output:
example2.com
example3.com
example4.com
Get All links From Web Page
To get all links from a web page, we need the requests library to get the web page's source by making an HTTP request to the URL of the page.
In the following example, we will get all the links from the homepage of "pytutorial.com".
from bs4 import BeautifulSoup
import requests # pip install requests
w = requests.get("https://pytutorial.com") # Get Page Source
soup = BeautifulSoup(w.text, "html.parser") # Parse
links = soup.find_all("a")
for link in links:
print(link['href'])
Output:
/category/python-tutorial
/category/django-tutorial
/about-us
/contact-us
/find-a-word-in-a-list-python
/find-a-word-in-a-list-python
/remove-comma-in-number-python
/remove-comma-in-number-python
/convert-your-django-project-to-a-static-site-and-host-it-for-free
/convert-your-django-project-to-a-static-site-and-host-it-for-free
/how-to-use-beautifulsoup-to-extract-title-tag
/how-to-use-beautifulsoup-to-extract-title-tag
/remove-first-character-of-string-in-python
/remove-first-character-of-string-in-python
/python-variable-in-string
/python-variable-in-string
/Python-check-internet-connection
/Python-check-internet-connection
/python-capture-screenshot-mouse-clicked
/python-capture-screenshot-mouse-clicked
/how-to-use-pyautogui-python-library
/how-to-use-pyautogui-python-library
/how-to-use-glob-module-in-python
/how-to-use-glob-module-in-python
https://www.facebook.com/Pytutorial-108500610683725/?modal=admin_todo_tour
https://twitter.com/pytutorial
https://www.youtube.com/@pytutorial9501
As you can see, all the links have been gotten successfully. It is worth noting that you can also use select() to achieve the same result.
Counclusion
In conclusion, using the BeautifulSoup library in Python makes getting all the links from an HTML document or web page easy. By using the find_all()
or select()
methods. To extract links from a web page, we typically need to use the requests
library first to retrieve the HTML source code for the page we're interested in.