Scrape politely: honor robots.txt and a crawl delay (the part most skip)
Gate any scraper behind a robots.txt check and a crawl delay so you only fetch what a site allows, at a rate it allows, using nothing but the Python standard library.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone running a scraper who wants to stay on the right side of a site. CI runs the gate against a fixture robots.txt and asserts an allowed path passes, a disallowed path is blocked, and the declared crawl-delay is read and honored. Pure stdlib, no network, no install. Pointing it at a live site is fenced.
Not for
- Treating robots.txt as the only rule, a site's terms of service can forbid scraping even where robots.txt is silent; this gate is necessary, not sufficient
- High-rate crawling, the whole point is to go at the pace the site declares, not faster
The Stack
Tested Against
python@3.12 (urllib.robotparser, stdlib)Side effects & data flow
- Network
- none, local only
- Writes
- ./robots.txt
- Credentials
- none required
Prerequisites
- Python 3
- A scraper to wrap (Scrapy, Crawl4AI, requests) for real use
Steps
- 1
Build the permission-and-pace gate and test it on a fixture
Parse a robots.txt with urllib.robotparser, then check an allowed path passes, a disallowed path is blocked, and the crawl-delay is read so your scraper can wait between requests. CI runs this against a fixture. In real use you point set_url at the live robots.txt, which is the fenced step.
cat > robots.txt <<'TXT' User-agent: * Disallow: /private/ Allow: /public/ Crawl-delay: 2 TXT python3 - <<'PY' import sys from urllib.robotparser import RobotFileParser rp = RobotFileParser() with open("robots.txt") as f: rp.parse(f.read().splitlines()) rp.modified() # mark as read so can_fetch / crawl_delay are live ua = "FlowStacksBot" allowed = rp.can_fetch(ua, "https://site.example/public/page") blocked = rp.can_fetch(ua, "https://site.example/private/secret") delay = rp.crawl_delay(ua) or 0 if not allowed: print("BAD: an allowed path was reported as blocked"); sys.exit(1) if blocked: print("BAD: a disallowed path was not blocked"); sys.exit(1) if delay < 1: print("BAD: crawl-delay was not read / honored"); sys.exit(1) print("scraping OK: robots.txt respected (allowed /public, blocked /private) and crawl-delay of " + str(int(delay)) + "s honored") PY - 2
Wrap your real scraper (the network step, not checked by CI)
Point the parser at the live site's robots.txt (rp.set_url + rp.read()), call can_fetch before every URL, and sleep the crawl-delay between requests. Scrapy has ROBOTSTXT_OBEY and DOWNLOAD_DELAY for the same thing. The live fetches are fenced.
Eval, 2 fixtures
Last passed: verified todaypolite-okcontainstimeout 30s · max $0Expected:
scraping OK: robots.txt respected (allowed /public, blocked /private) and crawl-delay of 2s honoredclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
The difference between a scraper and a nuisance is two checks: is this path allowed, and am I going slow enough. Python's urllib.robotparser does both with no dependencies, so you can wrap any crawler (Scrapy, Crawl4AI, requests) in a permission-and-pace gate.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).