Overview
This Python script scans all Markdown files in the KServe repository for hyperlinks and verifies that each referenced resource exists, reporting any broken links.
Description
The script discovers Markdown files via glob patterns, extracts links using regex (both markdown-style [text](url) and plain URLs), resolves relative links against the GitHub repository URL, verifies local file links by checking the filesystem, and validates remote URLs concurrently using HTTP HEAD/GET requests. It includes retry logic for rate limiting (HTTP 429) and transient errors, respects GitHub's 60-request-per-minute rate limit, and exits with a non-zero code if any 404 errors are found. This makes it suitable for CI pipelines to prevent broken documentation links from accumulating.
Usage
Run this script in CI pipelines or locally to verify that all documentation links in the KServe repository are valid. It is located in the hack/ directory alongside other development and CI utility scripts.
Code Reference
Source Location
Signature
#!/usr/bin/env python3
import concurrent.futures
import itertools
import re
from datetime import datetime, timedelta
from glob import glob
from os import environ as env
from os.path import abspath, dirname, exists, relpath
from time import sleep
from urllib.request import Request, urlopen
from urllib.parse import urlparse
from urllib.error import URLError, HTTPError
GITHUB_REPO = env.get("GITHUB_REPO", "https://github.com/kserve/kserve/")
BRANCH = "master"
def find_md_files() -> [str]: ...
def get_links_from_md_file(md_file_path: str) -> [(int, str, str)]: ...
def test_url(file: str, line: int, text: str, url: str) -> (str, int, str, str, int): ...
def wait_before_retry(retry_time: datetime) -> datetime: ...
def set_retry_time() -> datetime: ...
def request_url(url: str, method: str = "HEAD", headers: dict = None, timeout: int = 10) -> int: ...
def verify_urls_concurrently(md_files: [str]) -> [(str, int, str, str, int)]: ...
def verify_doc_links() -> int: ...
Import
python3 hack/verify-doc-links.py
I/O Contract
Inputs
| Input |
Type |
Description
|
| Markdown files |
filesystem |
All .md files found via glob patterns /**/*.md and /.github/**/*.md
|
GITHUB_REPO |
env var |
GitHub repository URL (default: https://github.com/kserve/kserve/)
|
Outputs
| Output |
Type |
Description
|
| stdout |
text |
Progress messages and broken link reports
|
| exit code |
int |
0 if all links are valid, non-zero if broken links are found
|
Excluded Paths
| Path |
Reason
|
/node_modules/ |
Third-party dependencies
|
/temp/ |
Temporary files
|
/.venv/ |
Python virtual environment
|
Excluded URL Patterns
| Pattern |
Reason
|
URLs with <, >, $, {, } |
Placeholder/template URLs
|
0.0.0.0, localhost, :80, :90 |
Local/non-public URLs
|
example.com, customdomain.com |
Example domains
|
svc.cluster.local |
Kubernetes internal DNS
|
Key Functions
| Function |
Description
|
find_md_files() |
Discovers all Markdown files using glob, excluding paths in excluded_paths
|
get_links_from_md_file() |
Extracts links from a Markdown file using regex; resolves relative links to GitHub URLs
|
test_url() |
Tests a single URL with HEAD, falling back to GET; handles 403, 405, 429, 503 with retries
|
request_url() |
Makes an HTTP request with configurable method, headers, and timeout
|
verify_urls_concurrently() |
Uses ThreadPoolExecutor with up to 60 parallel workers
|
verify_doc_links() |
Main orchestrator: finds files, verifies URLs, reports broken links
|
Rate Limiting Configuration
| Parameter |
Value |
Description
|
parallel_requests |
60 |
Maximum concurrent requests
|
retry_wait |
60 seconds |
Wait time after HTTP 429
|
extra_wait |
5 seconds |
Additional buffer before retry
|
Usage Examples
# Run from the repository root
python3 hack/verify-doc-links.py
# Override the GitHub repository URL
GITHUB_REPO=https://github.com/my-fork/kserve/ python3 hack/verify-doc-links.py
# Use in CI (non-zero exit on broken links)
python3 hack/verify-doc-links.py || echo "Broken links detected!"
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.