Last modified: Jan 28, 2026 By Alexander Williams
Extract Domain from URL in Python
Working with web data often involves URLs. You might need to extract just the domain name. This is a common task in web scraping, data analysis, and logging.
Python provides excellent tools for this. The urllib.parse module is part of the standard library. It is the best way to parse URLs correctly.
Why Use urllib.parse?
You might think splitting the string by '/' is enough. It is not. URLs can be complex. They have schemes, subdomains, ports, and paths.
Manual string splitting fails on edge cases. The urllib.parse module handles all URL components. It ensures your code is robust and reliable.
Using a dedicated parser is a best practice. It saves you from bugs and security issues.
Using urlparse to Get Network Location
The core function is urlparse. It breaks a URL string into its components. These are stored in a ParseResult object.
You access the netloc attribute. This stands for "network location". For most standard HTTP/HTTPS URLs, netloc contains the domain and port.
from urllib.parse import urlparse
# Example URL
url = "https://www.example.com:8080/path/to/page?query=python"
parsed_url = urlparse(url)
print("Full parsed object:", parsed_url)
print("\nNetwork Location (netloc):", parsed_url.netloc)
Full parsed object: ParseResult(scheme='https', netloc='www.example.com:8080', path='/path/to/page', params='', query='query=python', fragment='')
Network Location (netloc): www.example.com:8080
The output shows parsed_url.netloc returned "www.example.com:8080". This includes the subdomain and port. Often, you just want the main domain.
Extracting the Clean Domain Name
The netloc attribute often contains extra parts. You may need to strip the subdomain (www) and the port number (:8080).
You can use simple string operations on the netloc value. The split method is very useful here.
def get_domain(url):
"""Extracts the clean domain (e.g., example.com) from a URL."""
parsed = urlparse(url)
# Remove port number if present
netloc = parsed.netloc.split(':')[0]
# Split by dots and get the last two parts for common domains
parts = netloc.split('.')
# Handle cases like 'www.example.co.uk'
if len(parts) > 2:
# This simple logic works for .com, .org, .co.uk, etc.
# For a robust solution, use a Public Suffix List library.
domain = '.'.join(parts[-2:])
else:
domain = netloc
return domain
# Test the function
test_urls = [
"https://www.example.com/page",
"http://blog.example.co.uk/article",
"https://example.com:443/home",
"ftp://files.example.org/data"
]
for url in test_urls:
print(f"URL: {url}")
print(f"Domain: {get_domain(url)}\n")
URL: https://www.example.com/page
Domain: example.com
URL: http://blog.example.co.uk/article
Domain: co.uk
URL: https://example.com:443/home
Domain: example.com
URL: ftp://files.example.org/data
Domain: example.org
Notice the second result. Our simple logic got "co.uk". This is the public suffix, not the domain. For accurate results, you need to know about public suffixes like .com, .co.uk, or .github.io.
For production code, do not rely on simple splitting. Use a proper library if domain accuracy is critical.
Handling Public Suffixes with tldextract
The best way to get the true registrable domain is with a library. The tldextract library uses the Public Suffix List. It separates the domain, subdomain, and suffix correctly.
First, install it using pip: pip install tldextract.
import tldextract
def get_domain_tld(url):
"""Accurately extracts the domain and suffix using tldextract."""
extracted = tldextract.extract(url)
# Returns the registrable domain (e.g., example) and suffix (e.g., com)
# Combine them for the full domain
return f"{extracted.domain}.{extracted.suffix}"
# Test with the same URLs
for url in test_urls:
print(f"URL: {url}")
print(f"Domain (tldextract): {get_domain_tld(url)}\n")
URL: https://www.example.com/page
Domain (tldextract): example.com
URL: http://blog.example.co.uk/article
Domain (tldextract): example.co.uk
URL: https://example.com:443/home
Domain (tldextract): example.com
URL: ftp://files.example.org/data
Domain (tldextract): example.org
Now "example.co.uk" is correctly identified. The library knew that "co.uk" is a public suffix. The registrable domain is "example". This is the correct result.
For any serious project, tldextract is the recommended tool.
Common Pitfalls and Best Practices
Here are key points to remember.
Always use urllib.parse.urlparse as the first step. It correctly handles the URL structure.
Be aware of internationalized domain names (IDNs). They may contain non-ASCII characters. Libraries like tldextract handle them.
Remember that URLs might not have a scheme (like "http://"). The urlparse function can behave differently. It's safer to ensure your URL string has a scheme before parsing. Sometimes, you may need to build or reconstruct URLs from parts. In such cases, understanding how to Python urljoin: Build URLs Correctly is an essential complementary skill.
Validate your input. Not every string is a valid URL. Your code should handle errors gracefully.
Conclusion
Extracting a domain from a URL in Python is straightforward with the right tools.
For basic needs, use urllib.parse.urlparse() and process the netloc attribute. For accurate, production-ready results, use the tldextract library.
This approach ensures your code handles all URL complexities. It works with subdomains, ports, and international domains.
You can now confidently parse URLs in your web projects.