A small, dependency-minimal Rust web crawler that fetches a seed URL, extracts same-host links from the homepage, and saves HTML responses to disk.
Built for learning and small local crawl tasks — not a production spider.
TL;DR#
What this is: A CLI and library that crawls a seed URL, follows same-host links found on the homepage, and writes each page to disk with structured stdout logging.
What this isn’t: A recursive or politeness-aware crawler. No robots.txt, rate limiting, or depth control by default.
Run: cargo run --release -- "https://example.com" --out-dir crawl_out
What it does#
- Accepts a seed URL (or hostname) as CLI input
- Fetches the homepage once
- Extracts
<a href="...">links on the first page - Normalizes each link to an absolute URL
- Follows only same-host links
- Fetches each same-host page once
- Saves each response body in
out_dirusing a deterministic URL hash filename - Logs crawl events to stdout with status and byte counts
Project structure#
| Module | Role |
|---|---|
src/main.rs | CLI entrypoint using clap + tokio |
src/lib.rs | Reusable crawler API |
src/engine.rs | Crawl orchestration |
src/fetch.rs | HTTP fetch wrapper with reqwest |
src/links.rs | HTML link extraction |
src/storage.rs | File path generation, save HTML |
src/url_util.rs | URL normalization and same-host checks |
src/log.rs | Logging abstraction (stdout + pluggable) |
Usage#
Build and run from the project root:
cargo run --release -- "https://example.com" --out-dir crawl_outShort form:
cargo run --release -- example.com -o crawl_outDefaults:
out_dir:crawl_out
Output#
crawl_out/<url_hash>.html—url_hashis derived from the normalized final URL- Stdout log events include:
seed,response,fetch,save,skip_links,link_skip,fetch_err,save_err
Configuration#
No configuration file. CLI args only.
Tests#
No test files are currently included. The library is unit-test-friendly via Crawler::with_logger and CrawlConfig.
Dependencies#
reqwest— HTTP clienttokio— async runtimeanyhow— error handlingclap— CLIurl— URL parsing
Extending#
- Add depth control (breadth-first / recursive crawl)
- Add robots.txt + rate limiting
- Add concurrency queue and dedupe URL set
- Add filter rules (patterns, content types)
- Instrument with structured logging / metrics
Notes#
The crawler is intentionally simple. It does not enforce politeness controls by default.