Last modified: Jan 28, 2026 By Alexander Williams
Python Get HTML from URL: Fetch Web Data
Fetching HTML from a URL is a common task. It is the first step in web scraping and data extraction. Python makes this process simple. You can use built-in or third-party libraries.
This guide will show you how to get HTML content. We will cover two main methods. You will learn about the requests library and the built-in urllib. We will also discuss error handling and best practices.
Why Fetch HTML from a URL?
Getting HTML from a web page is useful. It allows you to automate data collection. You can monitor website changes. You can gather information for analysis.
Common use cases include price tracking, news aggregation, and research. Before you scrape, always check the website's robots.txt file. Respect the site's terms of service.
Method 1: Using the Requests Library
The requests library is the most popular choice. It is simple and user-friendly. You need to install it first.
Use pip to install it.
pip install requests
Once installed, you can use it in your script. Import the library and use the get() function. This function sends a GET request to the URL.
import requests
# Define the target URL
url = 'https://example.com'
# Send a GET request
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Get the HTML content as text
html_content = response.text
print(html_content[:500]) # Print first 500 characters
else:
print(f"Failed to retrieve page. Status code: {response.status_code}")
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
}
...
The response.text attribute contains the HTML. The status_code tells you if the request succeeded. A 200 code means success.
Always handle errors. A network issue or a wrong URL can cause failure. Your code should not crash.
Method 2: Using urllib (Built-in)
Python has a built-in module called urllib. You do not need to install anything. It is part of the standard library.
It is a bit more verbose than requests. But it works without extra dependencies. Use urllib.request.urlopen() to open a URL.
from urllib.request import urlopen
url = 'https://example.com'
try:
# Open the URL
response = urlopen(url)
# Read the HTML content and decode it to a string
html_content = response.read().decode('utf-8')
print(html_content[:500])
except Exception as e:
print(f"An error occurred: {e}")
The read() method gets the raw bytes. You must decode it to a string. Use decode('utf-8') for most websites.
Wrap the call in a try-except block. This handles network errors gracefully. It is a good practice.
Handling Common Issues
Fetching HTML is not always straightforward. You may encounter problems. Let's look at solutions.
Dealing with HTTP Errors
Web servers return error codes. A 404 means page not found. A 403 might mean access is forbidden.
Use the status code to decide what to do. The requests library has a raise_for_status() method. It raises an exception for bad status codes.
import requests
url = 'https://httpstat.us/404' # A test URL that returns 404
try:
response = requests.get(url)
response.raise_for_status() # Will raise HTTPError for 4xx/5xx
html = response.text
except requests.exceptions.HTTPError as err:
print(f"HTTP error: {err}")
Setting a User-Agent Header
Some websites block default Python user-agents. They think it is a bot. You can set a custom User-Agent header to mimic a browser.
import requests
url = 'https://example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
print(response.request.headers['User-Agent'])
This makes your request look like it's from a web browser. It can help avoid being blocked.
Managing Relative URLs
HTML often contains relative links. You may need to convert them to absolute URLs. This is important for crawling.
You can use Python's urllib.parse.urljoin for this. For a detailed guide, see our article on Python urljoin: Build URLs Correctly.
Parsing the HTML Content
Getting the HTML is just the first step. To extract data, you need to parse it. Libraries like Beautiful Soup and lxml are great for this.
Here is a quick example using Beautiful Soup with the requests library.
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Get the page title
page_title = soup.title.string
print(f"Page Title: {page_title}")
# Find all paragraph tags
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.get_text())
First, fetch the HTML. Then, pass it to Beautiful Soup. You can now search and navigate the HTML structure easily.
Best Practices and Ethics
Always scrape responsibly. Do not overload servers. Add delays between requests. Use caching when possible.
Check for an API first. Many websites offer official APIs. They are a more reliable data source.
Read the website's robots.txt file. It tells you which pages you can and cannot scrape. Respect these rules.
Identify yourself. Use a proper User-Agent string. Some sites require it.
Conclusion
Fetching HTML from a URL in Python is easy. The requests library is the best tool for most jobs. The built-in urllib works when you cannot install packages.
Remember to handle errors. Set appropriate headers. Parse the HTML with a library like Beautiful Soup.
Use this power responsibly. Happy coding!