Other

webdown

by kelp

Web page to Markdown converter — transforms online documentation into clean Markdown or Claude XML for LLM context

Webdown is a Python CLI tool and library for converting web pages, especially documentation sites, into clean Markdown or Claude XML format. It solves the problem of feeding structured, relevant web content to LLMs like Claude by stripping unnecessary formatting, allowing CSS selector-based extraction, and supporting multi-page crawling. Output is optimized for Anthropic's Claude AI models, including specific XML formatting options.

View on GitHub ↗

Key features

Converts web pages to clean, readable Markdown
Crawls entire documentation sites with configurable depth
Extracts specific page sections using CSS selectors
Generates Claude XML format for Anthropic AI models
Automatically streams large pages to optimize memory

Languages

Python80%HTML16%Shell3%Makefile2%

Top contributors

kelp126

Topics

clillmmarkdownpythonweb-scraping

README

View on GitHub ↗

Webdown

A Python CLI tool for converting web pages to clean, readable Markdown format. Webdown makes it easy to download documentation and feed it into an LLM coding tool.

Why Webdown?

Clean Conversion: Produces readable Markdown without formatting artifacts
Multi-Page Crawling: Crawl entire documentation sites with webdown crawl
Selective Extraction: Target specific page sections with CSS selectors
Claude XML Format: Optimized output format for Anthropic's Claude AI models
Progress Tracking: Visual download progress for large pages with -p flag
Optimized Handling: Automatic streaming for large pages (>10MB) with no configuration required

Use Cases

Documentation for AI Coding Assistants

Webdown is particularly useful for preparing documentation to use with AI-assisted coding tools like Claude Code, GitHub Copilot, or ChatGPT:

Convert technical documentation into clean Markdown for AI context
Extract only the relevant parts of large documentation pages using CSS selectors
Strip out images and formatting that might consume token context
Generate well-structured tables of contents for better navigation

# Example: Convert API docs and store for AI coding context
webdown https://api.example.com/docs -s "main" -I -c -w 80 -o api_context.md

Installation

From PyPI

pip install webdown

With Homebrew

# Add the tap
brew tap kelp/tools

# Install webdown
brew install webdown

Install from Source

# Clone the repository
git clone https://github.com/kelp/webdown.git
cd webdown

# Install with pip
pip install .

# Or install with Poetry
poetry install

Usage

Basic usage:

webdown https://example.com/page.html -o output.md

Output to stdout:

webdown https://example.com/page.html

Options

-o, --output: Output file (default: stdout)
-t, --toc: Generate table of contents
-L, --no-links: Strip hyperlinks
-I, --no-images: Exclude images
-s, --css SELECTOR: CSS selector to extract specific content
-c, --compact: Remove excessive blank lines from the output
-w, --width N: Set the line width for wrapped text (0 for no wrapping)
-p, --progress: Show download progress bar (useful for large files)
--claude-xml: Output in Claude XML format for use with Claude AI
--no-metadata: Exclude metadata section from Claude XML output (metadata is included by default)
--no-date: Exclude current date from metadata in Claude XML output (date is included by default)

For more details on the Claude XML format, see the Anthropic documentation on Claude XML.

For large web pages (over 10MB), streaming mode is automatically used to optimize memory usage without any configuration required.

Examples

Generate markdown with a table of contents:

webdown https://example.com -t -o output.md

Extract only main content:

webdown https://example.com -s "main" -o output.md

Strip links and images:

webdown https://example.com -L -I -o output.md

Compact output with progress bar and line wrapping:

webdown https://example.com -c -p -w 80 -o output.md

Generate Claude XML format for use with Claude AI:

webdown https://example.com --claude-xml -o doc.xml

Claude XML with no metadata section:

webdown https://example.com --claude-xml --no-metadata -o doc.xml

Claude XML without the current date in metadata:

webdown https://example.com --claude-xml --no-date -o doc.xml

Crawling Multiple Pages

Crawl an entire documentation site:

webdown crawl https://docs.example.com/ -o ./output/

Crawl with depth and delay settings:

webdown crawl https://docs.example.com/ -o ./output/ --max-depth 5 --delay 2.0

Crawl from a sitemap:

webdown crawl --sitemap https://docs.example.com/sitemap.xml -o ./output/

Crawl with content options:

webdown crawl https://docs.example.com/ -o ./output/ -s "main" --claude-xml

Crawl Options

--max-depth N: Maximum crawl depth from seed URLs (default: 3)
--delay SECONDS: Delay between requests (default: 1.0)
--same-domain: Allow crawling any path on the same domain
--path-prefix PREFIX: Only crawl URLs starting with this prefix
--sitemap URL: Parse sitemap.xml instead of crawling links
--max-pages N: Maximum number of pages to crawl (0 for unlimited)
-q, --quiet: Suppress progress output

For complete documentation, use the --help flag:

webdown --help

Documentation

API documentation is available online at tcole.net/webdown.

You can also generate the documentation locally with:

make docs        # Generate HTML docs in the docs/ directory
make docs-serve  # Start a local documentation server at http://localhost:8080

Development

Prerequisites

Python 3.10+ (3.13 recommended)
Poetry for dependency management

Setup

# Clone the repository
git clone https://github.com/kelp/webdown.git
cd webdown

# Install dependencies with Poetry
poetry install
poetry run pre-commit install

# Optional: Start a Poetry shell for interactive development
poetry shell

Development Commands

We use a Makefile to streamline development tasks:

# Install dependencies
make install

# Run tests
make test

# Run tests with coverage
make test-coverage

# Run integration tests
make integration-test

# Run linting
make lint

# Run type checking
make type-check

# Format code
make format

# Run all pre-commit hooks
make pre-commit

# Run all checks (lint, type-check, test)
make all-checks

# Build package
make build

# Start interactive Poetry shell
make shell

# Generate documentation
make docs

# Start documentation server
make docs-serve

# Publishing to PyPI (maintainers only)
# See CONTRIBUTING.md for details on the release process
make build         # Build package
make publish-test  # Publish to TestPyPI (for testing)

# Show all available commands
make help

Poetry Commands

You can also use Poetry directly:

# Start an interactive shell in the Poetry environment
poetry shell

# Run a command in the Poetry environment
poetry run pytest

# Add a new dependency
poetry add requests

# Add a development dependency
poetry add --group dev black

# Update dependencies
poetry update

# Build package
poetry build

Python API Usage

Webdown can also be used as a Python library in your own projects:

from webdown.converter import convert_url_to_markdown, WebdownConfig

# Basic conversion
markdown = convert_url_to_markdown("https://example.com")

# Using the Config object for more options
config = WebdownConfig(
    url="https://example.com",
    include_toc=True,
    css_selector="main",
    compact_output=True,
    body_width=80,
    show_progress=True
)
markdown = convert_url_to_markdown(config)

# Save to file
with open("output.md", "w") as f:
    f.write(markdown)

# Convert to Claude XML format (optimized for Anthropic's Claude AI)
from webdown.converter import convert_url_to_claude_xml, ClaudeXMLConfig

# Basic Claude XML conversion
xml = convert_url_to_claude_xml("https://example.com")

# With custom XML configuration
claude_config = ClaudeXMLConfig(
    include_metadata=True,   # Include title, URL, and date (default: True)
    add_date=True,           # Include current date in metadata (default: True)
    doc_tag="claude_documentation"  # Root document tag name (default)
)
xml = convert_url_to_claude_xml("https://example.com", claude_config)

# Save XML output
with open("output.xml", "w") as f:
    f.write(xml)

# For more information on Claude XML format, see:
# https://docs.anthropic.com/claude/docs/advanced-data-extraction

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)

Run tests to make sure everything works:

# Run standard tests
poetry run pytest

# Run tests with coverage
poetry run pytest --cov=webdown

# Run integration tests
poetry run pytest --integration

Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please make sure your code passes all tests, type checks, and follows our coding style (enforced by pre-commit hooks). We aim to maintain high code coverage (currently at 93%). When adding features, please include tests.

For more details, see our Contributing Guide.

Support

If you encounter any problems or have feature requests, please open an issue on GitHub.

License

MIT License - see the LICENSE file for details.

Similar other

Other

freeCodeCamp

by freeCodeCamp

Open-source coding curriculum — learn full-stack web development, Python, and computer science for free

445k+1.7kTypeScriptupdated 2w ago

GitHub ↗

Other

Python

by TheAlgorithms

Python algorithms collection — a vast, community-driven library of data structures and algorithms for education

$git clone https://github.com/TheAlgorithms/Python.git

221k+1.1kPythonupdated 2w ago

GitHub ↗

Other

rtk

by rtk-ai

CLI proxy for LLMs — reduces token consumption by 60-90% on common dev commands

$brew install rtk

57k+24kv0.42.0· 1w agoRust

GitHub ↗

Other

medusa

by medusajs

Open-source e-commerce platform — provides a customizable framework for building digital commerce applications

34k+1.0kv2.15.5· 2d agoTypeScript

GitHub ↗

See all Other →