Build Your Own Broken Link Checker in Python for SEO Optimization

A broken link checker in Python offers a straightforward way to scan your website for dead URLs and keep your visitors happy. Rather than tediously clicking every link by hand, you can let Python do the boring work in minutes.

By building your own broken link finder in Python, you are in control of how deep it crawls, what it reports, and how you fix issues. For blogs, portfolios, small business sites, and even client projects that need a simple, repeatable SEO check, this approach works nicely.

What are broken links and why do they appear?

Broken links are links on your website that don’t work anymore. When somebody clicks them, they end up on an error page, mostly a 404 “Page Not Found”, or sometimes see a timeout or server error instead of the expected content.

These links break for very simple, very common reasons: You might remove or rename a page and forget to update the link that points to it. A website to which you linked in the past has changed its URL structure or is no longer available. Sometimes a small typing error in a URL is enough to break a link.

In other cases, teams move old marketing campaigns, PDFs, or images to a different folder, and the original link no longer points to the correct file. To someone visiting your website, a broken link is like a dead end. They click a link expecting useful information and land on an error page.

Benefits of removing broken links

Removing or fixing broken links gives your site several clear advantages:

  • Better user experience
  • Higher trust and brand image
  • Improved SEO performance
  • Stronger conversion paths and easier maintanance
  • More accurate internal navigation

Generating Broken Link Checker Using Python

To build a simple website link validator in Python, you only need two popular packages: requests and beautifulsoup4. They help you download pages and read the HTML to extract links. You can install them with pip:

pip install requests beautifulsoup4

The requests library makes HTTP calls in a clean and readable way. You use it to fetch each page and test whether a link returns a good status code or a dead response. The beautifulsoup4 library parses the HTML and lets you find all a, img, script, and other tags that contain URLs.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
from collections import deque

def is_valid_url(url):
    if not url:
        return False
    if not url.startswith(("http://", "https://")):
        url = "https://" + url
    parsed = urlparse(url)
    if not parsed.scheme or not parsed.netloc:
        return False
    if "." not in parsed.netloc:
        return False
    return True

def normalize_url(url):
    if not url.startswith(("http://", "https://")):
        url = "https://" + url.lstrip("/")
    return url.rstrip("/")

def is_internal(link_url, domain):
    return urlparse(link_url).netloc == domain

def check_url(url, session, timeout=10):
    try:
        r = session.head(url, allow_redirects=True, timeout=timeout, verify=False)
        if r.status_code == 405:
            r = session.get(url, allow_redirects=True, timeout=timeout, verify=False, stream=True)
        status = r.status_code
        r.close()
        return status
    except Exception:
        return None

def main():
    while True:
        raw = input("Enter website URL (for example https://example.com): ").strip()
        if is_valid_url(raw):
            start_url = normalize_url(raw)
            break
        print("Invalid URL. Please enter a valid URL like https://example.com")

    parsed_start = urlparse(start_url)
    domain = parsed_start.netloc

    visited_urls = []
    pending_urls = deque([start_url])
    internal_checked_urls = []
    external_checked_urls = []
    broken_urls = []

    session = requests.Session()
    session.headers.update({"User-Agent": "PythonBrokenLinkChecker/1.0"})
    requests.packages.urllib3.disable_warnings()

    while pending_urls:
        current_url = pending_urls.popleft()
        if current_url in visited_urls:
            continue

        visited_urls.append(current_url)
        print("Crawling:", current_url)

        try:
            resp = session.get(current_url, timeout=15, verify=False)
        except Exception:
            print("Failed to load page:", current_url)
            broken_urls.append({"url": current_url, "source": current_url, "status": "PAGE_ERROR"})
            continue

        content_type = resp.headers.get("content-type", "")
        if "text/html" not in content_type:
            resp.close()
            continue

        soup = BeautifulSoup(resp.text, "html.parser")
        resp.close()

        tags = soup.find_all(["a", "img", "link", "script", "source"])

        for tag in tags:
            link = None
            if tag.has_attr("href"):
                link = tag["href"]
            elif tag.has_attr("src"):
                link = tag["src"]
            elif tag.has_attr("srcset"):
                parts = tag["srcset"].split(",")
                if parts:
                    link = parts[0].strip().split(" ")[0]

            if not link:
                continue
            if link.startswith("#"):
                continue

            full_link = urljoin(current_url, link)
            parsed_link = urlparse(full_link)

            if parsed_link.scheme not in ("http", "https"):
                continue

            if is_internal(full_link, domain):
                if full_link in internal_checked_urls:
                    continue
            else:
                if full_link in external_checked_urls:
                    continue

            status = check_url(full_link, session)

            if status is None or status >= 400:
                print("Broken:", full_link, "Status:", status)
                broken_urls.append({"url": full_link, "source": current_url, "status": status})
            else:
                if is_internal(full_link, domain):
                    internal_checked_urls.append(full_link)
                    if full_link not in visited_urls and full_link not in pending_urls:
                        pending_urls.append(full_link)
                else:
                    external_checked_urls.append(full_link)

    print()
    print("Scan finished")
    print("Visited URLs:", len(visited_urls))
    print("Internal checked URLs:", len(internal_checked_urls))
    print("External checked URLs:", len(external_checked_urls))
    print("Broken URLs:", len(broken_urls))

    if broken_urls:
        print("Broken URL list:")
        for item in broken_urls:
            print("Broken:", item["url"], "Status:", item["status"], "Found on:", item["source"])

if __name__ == "__main__":
    main()

Output:

Broken link checker using python output - Codewollfy

Explination

The is_valid_url function checks if the text the user enters looks like a real web address with a proper scheme and domain. The normalize_url function cleans the URL by adding https:// if needed and removing extra slashes at the end so all URLs use the same format. The is_internal function compares the link’s domain to the main site’s domain to decide if the link belongs to the same website or an external one. The check_url function sends a quick request to the link and returns its status code so you can tell if the link works or is broken.

Below is concept of working for main function is below:

  • Ask the user for a website address, check that it looks valid, and clean it into a standard format.
  • Set up lists to track pages to visit, pages already visited, working internal links, working external links, and broken links.
  • Start with the main page in the pending list and reuse one web connection for all requests.
  • While pages remain, load each one and skip it if already visited or not valid HTML.
  • Scan the page for all links, turn them into full web addresses, and decide whether each one is internal or external.
  • For each new link, test it quickly. If it fails, record it as broken. If it works, save it as internal or external.
  • Add only new, working internal links to the pending list. When no pages are left, print a short summary of visited pages and broken links.

Conclusion

A custom Broken Link Checker in Python provides an easy way to protect both user experience and SEO. You are no longer dependent on online tools that have a limit, waiting for a third-party scan to finish. Instead, you run your own Python dead link detector whenever you change content, launch a new section. You catch broken internal links before they frustrate visitors and fix external links before they weaken your authority. You can also provide this as service.

If you want to explore web scraping in more depth, make sure to check out our related guide titled “Build a Python Web Scraper with Beautiful Soup”. It walks you through a clean and practical approach to extracting data from websites.