Last modified: Jan 28, 2026 By Alexander Williams
Python urljoin: Build URLs Correctly
Working with URLs is a common task in Python. You might be building a web scraper or integrating with an API. Manually joining strings to create URLs is error-prone. A missing slash or an extra dot can break your application.
Python provides a robust solution. The urljoin function from the urllib.parse module handles this complexity for you. It intelligently combines a base URL with a relative path.
This article will guide you through using urljoin effectively. You will learn its syntax, see practical examples, and understand its behavior with different inputs.
What is urllib.parse.urljoin?
The urllib.parse module is part of Python's standard library. It offers functions to break down and construct URLs. The urljoin function is one of its most useful tools.
Its primary job is to resolve a relative URL against a base URL. This mimics how web browsers work. When you click a link like "/about", the browser knows to attach it to the current site's address.
urljoin does this programmatically, ensuring your URLs are always valid and well-formed.
Importing and Basic Syntax
First, you need to import the function. Here is the basic syntax.
from urllib.parse import urljoin
# Syntax: urljoin(base, url, allow_fragments=True)
joined_url = urljoin(base_url, relative_path)
The function takes two main arguments. The first is the base URL string. The second is the url to be joined. The optional allow_fragments argument controls the handling of the fragment (the part after a #).
It returns a single, complete URL string.
How urljoin Works: Key Behaviors
Understanding urljoin's logic is crucial. It doesn't just concatenate strings. It follows specific rules from web standards.
Joining Relative Paths
This is the most common use case. You have a base page URL and a link to another page on the same site.
from urllib.parse import urljoin
base = "https://www.example.com/docs/tutorial/"
path = "chapter1.html"
result = urljoin(base, path)
print(result)
https://www.example.com/docs/tutorial/chapter1.html
The function correctly appended the relative path to the base. It handled the trailing slash intelligently.
Handling Absolute Paths
What if the second argument is an absolute path (starting with /)? urljoin replaces the entire path of the base URL.
base = "https://www.example.com/docs/tutorial/"
path = "/api/v2/users"
result = urljoin(base, path)
print(result)
https://www.example.com/api/v2/users
Notice how /docs/tutorial/ was completely replaced by /api/v2/users. This is the correct behavior for web links.
When the Second Argument is a Full URL
If the second argument is a complete URL (with scheme like http://), urljoin simply returns it. The base URL is ignored.
base = "https://www.example.com/"
path = "https://www.anotherexample.com/page"
result = urljoin(base, path)
print(result)
https://www.anotherexample.com/page
This is useful when parsing HTML. You might find links to other websites. urljoin correctly identifies and keeps them as absolute URLs.
Dealing with Parent Directory (..)
urljoin understands navigation symbols like .. which means "go up one directory".
base = "https://www.example.com/docs/tutorial/chapter1/"
path = "../images/photo.jpg"
result = urljoin(base, path)
print(result)
https://www.example.com/docs/images/photo.jpg
It correctly resolved .. by moving from /chapter1/ up to /docs/ before adding the image path.
Common Pitfalls and Solutions
While powerful, urljoin has behaviors that can surprise beginners.
Missing Trailing Slash in Base URL
If your base URL lacks a trailing slash and you join a relative path, the last segment is treated as a file and is removed.
base = "https://www.example.com/docs/tutorial" # No trailing slash
path = "page.html"
result = urljoin(base, path)
print(result)
https://www.example.com/docs/page.html
The word "tutorial" was replaced. To fix this, ensure your base URL ends with a / if it represents a directory. You can use other urllib.parse functions like urlparse to analyze and normalize URLs first.
Query Parameters and Fragments
urljoin also handles query strings (?key=value) and fragments (#section).
base = "https://www.example.com/search?q=python"
path = "&sort=date"
result = urljoin(base, path)
print(result)
https://www.example.com/search?q=python&sort=date
It appended the new query parameter correctly. However, logic with complex queries can be tricky. Always test with your specific URL structures.
Practical Example: Building a Simple Scraper
Let's see urljoin in a real-world scenario. We'll find all links on a webpage and convert them to absolute URLs.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_absolute_links(page_url):
"""Fetch a page and return a list of absolute links on it."""
try:
response = requests.get(page_url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching {page_url}: {e}")
return []
soup = BeautifulSoup(response.text, 'html.parser')
absolute_links = []
for link_tag in soup.find_all('a', href=True):
href = link_tag['href']
# Use urljoin to convert relative href to absolute URL
absolute_url = urljoin(page_url, href)
absolute_links.append(absolute_url)
return absolute_links
# Example usage
base_page = "https://books.toscrape.com/catalogue/category/books_1/index.html"
links = get_absolute_links(base_page)
print(f"Found {len(links)} links.")
print("First 5 absolute links:")
for link in links[:5]:
print(f" - {link}")
This script is a foundation for a web crawler. The urljoin function is essential. It ensures every link, whether relative like ./book1.html or absolute, is stored as a full, usable URL.
Conclusion
Python's urljoin function is a small but mighty tool. It saves you from the tedious and bug-prone process of manually stitching URLs together. By understanding its rules for relative paths, absolute paths, and full URLs, you can write more robust web-related code.
Always use urljoin when dealing with URLs from external sources like websites or APIs. It handles edge cases you might not initially consider. Remember to pay attention to trailing slashes in your base URLs, as this affects the joining behavior.
Integrating urljoin into your projects promotes cleaner code and prevents common web scraping errors. It's a best practice that demonstrates a solid understanding of web fundamentals in Python.