py4u blog

How to Extract Domain Name Without Subdomain Using Python urlparse

In today’s data-driven world, URLs are everywhere—from web scraping and analytics to security auditing and content aggregation. A common task when working with URLs is extracting the main domain name (e.g., example.com) while excluding subdomains (e.g., blog.example.com or api.dev.example.com).

Python’s built-in urllib.parse module (formerly urlparse in Python 2) is a go-to tool for parsing URLs, but it doesn’t directly return the "main domain" by default. Instead, it extracts the full network location (e.g., blog.api.example.co.uk), which includes subdomains. To isolate the main domain, we need additional logic or libraries to handle complex top-level domains (TLDs) like .co.uk, .com.au, or .org.uk.

In this blog, we’ll explore how to use urllib.parse as a starting point and combine it with other techniques to reliably extract the main domain without subdomains. We’ll cover manual parsing, third-party libraries, edge cases, and best practices to ensure accuracy.

2026-01

Table of Contents#

  1. Understanding Domain Names and Subdomains
  2. Python’s urllib.parse Module: A Primer
  3. Limitations of urllib.parse for Main Domain Extraction
  4. Methods to Extract Main Domain Without Subdomain
  5. Practical Examples
  6. Handling Edge Cases
  7. Best Practices
  8. Conclusion
  9. References

1. Understanding Domain Names and Subdomains#

Before diving into code, let’s clarify key terms:

  • URL: A full web address (e.g., https://blog.api.example.co.uk:8080/path?query=1).
  • Netloc: The network location part of a URL, including subdomains, main domain, TLD, and port (e.g., blog.api.example.co.uk:8080).
  • Subdomain: A prefix to the main domain (e.g., blog, api, or blog.api in blog.api.example.co.uk).
  • Main Domain: The core domain name (e.g., example in example.co.uk).
  • Top-Level Domain (TLD): The suffix (e.g., .com, .co.uk, .org). TLDs can be "generic" (gTLDs like .com) or "country-code" (ccTLDs like .uk), and some have nested levels (e.g., .co.uk, .com.au).

Example Breakdown:
For https://blog.api.example.co.uk/path, the components are:

  • Subdomain: blog.api
  • Main domain: example
  • TLD: co.uk
  • Full main domain (without subdomains): example.co.uk

2. Python’s urllib.parse Module: A Primer#

The urllib.parse module (Python 3) parses URLs into components like scheme (https), netloc, path, query, etc. For our task, the netloc attribute is critical because it contains the full domain (including subdomains).

Basic Usage of urllib.parse#

To extract netloc from a URL:

from urllib.parse import urlparse
 
url = "https://blog.api.example.co.uk:8080/path?query=1"
parsed_url = urlparse(url)
 
print("Scheme:", parsed_url.scheme)  # Output: https
print("Netloc:", parsed_url.netloc)  # Output: blog.api.example.co.uk:8080
print("Path:", parsed_url.path)      # Output: /path

Here, netloc is blog.api.example.co.uk:8080. We can strip the port (if present) using split(':')[0]:

netloc = parsed_url.netloc.split(':')[0]  # Remove port: "blog.api.example.co.uk"

Now we have the full domain string (blog.api.example.co.uk), but we need to extract example.co.uk from this.

3. Limitations of urllib.parse for Main Domain Extraction#

urllib.parse excels at splitting URLs into components, but it cannot isolate the main domain from subdomains on its own. For example:

  • If netloc is blog.example.com, urllib.parse can’t tell us that example.com is the main domain (it returns the full blog.example.com).
  • For complex TLDs like co.uk, it can’t distinguish between example.co.uk (main domain) and co.uk (TLD).

To solve this, we need to identify the TLD in the netloc string, then work backward to find the main domain.

4. Methods to Extract Main Domain Without Subdomain#

4.1 Manual Parsing with the Public Suffix List#

The Public Suffix List (PSL) is a community-maintained list of all valid TLDs (e.g., .com, .co.uk, .github.io). To extract the main domain manually:

  1. Split the netloc into parts (e.g., blog.api.example.co.uk['blog', 'api', 'example', 'co', 'uk']).
  2. Use the PSL to identify the TLD (e.g., co.uk).
  3. The main domain is the part before the TLD plus the TLD (e.g., example + co.ukexample.co.uk).

However, manually implementing this is error-prone:

  • The PSL is large (thousands of entries) and updates frequently.
  • TLDs can have unusual structures (e.g., .xn--p1ai for Russian .рф).

Third-party libraries automate PSL checks and simplify main domain extraction. The most popular are:

tldextract#

tldextract is a robust library that uses the PSL to split a domain into subdomain, domain, and suffix (TLD). The main domain is domain + suffix.

Installation:

pip install tldextract

Usage:

import tldextract
 
# Extract components from netloc
netloc = "blog.api.example.co.uk"
extracted = tldextract.extract(netloc)
 
print("Subdomain:", extracted.subdomain)  # Output: blog.api
print("Domain:", extracted.domain)        # Output: example
print("Suffix:", extracted.suffix)        # Output: co.uk
 
main_domain = f"{extracted.domain}.{extracted.suffix}"  # Combine domain + suffix
print("Main Domain:", main_domain)  # Output: example.co.uk

extracted.subdomain gives subdomains, so if we want no subdomains, we ignore extracted.subdomain and use domain + suffix.

publicsuffixlist#

Another library that uses the PSL. It returns the public suffix (TLD), and you can derive the main domain from there.

Installation:

pip install publicsuffixlist

Usage:

from publicsuffixlist import PublicSuffixList
 
psl = PublicSuffixList()
netloc = "blog.api.example.co.uk"
 
# Get the public suffix (TLD)
suffix = psl.publicsuffix(netloc)  # Output: co.uk
 
# Split netloc into parts and find the main domain
parts = netloc.split('.')
suffix_parts = suffix.split('.')
main_domain_parts = parts[-(len(suffix_parts) + 1):]  # Take domain + suffix parts
main_domain = '.'.join(main_domain_parts)  # Output: example.co.uk

A naive approach might split the netloc by dots and take the last 2-3 parts (e.g., ['example', 'co', 'uk']example.co.uk). However, this fails for:

  • TLDs with 1 part (e.g., example.comexample.com works, but example.co.ukco.uk is wrong).
  • Uncommon TLDs (e.g., example.github.iogithub.io is the main domain, but naive logic might return io).

Example of Flawed Custom Logic:

def naive_extract_main_domain(netloc):
    parts = netloc.split('.')
    # Assume TLD is last 2 parts (fails for .co.uk)
    return '.'.join(parts[-3:]) if len(parts) >=3 else '.'.join(parts[-2:])
 
print(naive_extract_main_domain("example.co.uk"))  # Output: co.uk (WRONG)
print(naive_extract_main_domain("example.com"))    # Output: example.com (correct)

Conclusion: Custom logic is unreliable due to the complexity of TLDs. Use libraries like tldextract instead.

5. Practical Examples#

Let’s combine urllib.parse and tldextract to extract main domains from full URLs.

Example 1: Basic URL with Subdomains#

from urllib.parse import urlparse
import tldextract
 
url = "https://blog.example.com/path/to/post"
parsed_url = urlparse(url)
netloc = parsed_url.netloc.split(':')[0]  # Remove port if present
extracted = tldextract.extract(netloc)
main_domain = f"{extracted.domain}.{extracted.suffix}"
 
print("Main Domain:", main_domain)  # Output: example.com

Example 2: URL with Port and Multiple Subdomains#

url = "http://api.dev.example.co.uk:8080/data"
parsed_url = urlparse(url)
netloc = parsed_url.netloc.split(':')[0]  # "api.dev.example.co.uk"
extracted = tldextract.extract(netloc)
main_domain = f"{extracted.domain}.{extracted.suffix}"  # "example.co.uk"

Example 3: URL Without Subdomains#

url = "https://example.org"
netloc = urlparse(url).netloc  # "example.org"
extracted = tldextract.extract(netloc)
main_domain = f"{extracted.domain}.{extracted.suffix}"  # "example.org" (no subdomains)

6. Handling Edge Cases#

Edge Case 1: IP Addresses Instead of Domains#

If netloc is an IP (e.g., 192.168.1.1), tldextract will return domain='' and suffix=''. Check for this:

netloc = "192.168.1.1"
extracted = tldextract.extract(netloc)
if extracted.domain == '' and extracted.suffix == '':
    main_domain = netloc  # Treat IP as main domain
else:
    main_domain = f"{extracted.domain}.{extracted.suffix}"

Edge Case 2: Internationalized Domain Names (IDNs)#

IDNs use non-ASCII characters (e.g., bücher.example.de). tldextract handles IDNs by converting them to Punycode (e.g., xn--bcher-kva.example.de):

netloc = "xn--bcher-kva.example.de"  # Punycode for "bücher.example.de"
extracted = tldextract.extract(netloc)
main_domain = f"{extracted.domain}.{extracted.suffix}"  # "example.de"

Edge Case 3: Localhost or Intranet Domains#

For localhost or .local domains (e.g., server.local), tldextract may return suffix=''. Handle these explicitly:

netloc = "localhost"
extracted = tldextract.extract(netloc)
if extracted.suffix == '' and extracted.domain in ['localhost', 'local']:
    main_domain = netloc  # "localhost"

7. Best Practices#

  1. Use tldextract for Reliability: It’s actively maintained, uses the PSL, and handles edge cases.
  2. Update Libraries: TLDs change (new ones are added), so update tldextract regularly:
    pip install --upgrade tldextract
  3. Validate Inputs: Ensure URLs are well-formed before parsing (use urllib.parse to check for netloc).
  4. Handle IPs and Localhosts: Explicitly check for IP addresses or non-public domains.
  5. Avoid Custom TLD Logic: TLDs are too complex to parse manually.

8. Conclusion#

Extracting the main domain without subdomains requires more than just urllib.parse—it needs awareness of TLDs and the Public Suffix List. While urllib.parse extracts the netloc, libraries like tldextract use the PSL to reliably split subdomains, main domains, and TLDs.

By combining urllib.parse (to get netloc) with tldextract (to isolate the main domain), you can handle even complex cases like blog.api.example.co.uk or example.github.io.

9. References#