AutomationOpen SourceFreeActiveMachine-verified· beginner · ~10 min setup

Scrape politely: honor robots.txt and a crawl delay (the part most skip)

Gate any scraper behind a robots.txt check and a crawl delay so you only fetch what a site allows, at a rate it allows, using nothing but the Python standard library.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone running a scraper who wants to stay on the right side of a site. CI runs the gate against a fixture robots.txt and asserts an allowed path passes, a disallowed path is blocked, and the declared crawl-delay is read and honored. Pure stdlib, no network, no install. Pointing it at a live site is fenced.

Not for

  • Treating robots.txt as the only rule, a site's terms of service can forbid scraping even where robots.txt is silent; this gate is necessary, not sufficient
  • High-rate crawling, the whole point is to go at the pace the site declares, not faster

The Stack

Tested Against

python@3.12 (urllib.robotparser, stdlib)

Side effects & data flow

Network
none, local only
Writes
./robots.txt
Credentials
none required

Prerequisites

  • Python 3
  • A scraper to wrap (Scrapy, Crawl4AI, requests) for real use

Steps

  1. 1

    Build the permission-and-pace gate and test it on a fixture

    Parse a robots.txt with urllib.robotparser, then check an allowed path passes, a disallowed path is blocked, and the crawl-delay is read so your scraper can wait between requests. CI runs this against a fixture. In real use you point set_url at the live robots.txt, which is the fenced step.

    cat > robots.txt <<'TXT'
    User-agent: *
    Disallow: /private/
    Allow: /public/
    Crawl-delay: 2
    TXT
    python3 - <<'PY'
    import sys
    from urllib.robotparser import RobotFileParser
    
    rp = RobotFileParser()
    with open("robots.txt") as f:
        rp.parse(f.read().splitlines())
    rp.modified()  # mark as read so can_fetch / crawl_delay are live
    
    ua = "FlowStacksBot"
    allowed = rp.can_fetch(ua, "https://site.example/public/page")
    blocked = rp.can_fetch(ua, "https://site.example/private/secret")
    delay = rp.crawl_delay(ua) or 0
    
    if not allowed:
        print("BAD: an allowed path was reported as blocked"); sys.exit(1)
    if blocked:
        print("BAD: a disallowed path was not blocked"); sys.exit(1)
    if delay < 1:
        print("BAD: crawl-delay was not read / honored"); sys.exit(1)
    
    print("scraping OK: robots.txt respected (allowed /public, blocked /private) and crawl-delay of " + str(int(delay)) + "s honored")
    PY
  2. 2

    Wrap your real scraper (the network step, not checked by CI)

    Point the parser at the live site's robots.txt (rp.set_url + rp.read()), call can_fetch before every URL, and sleep the crawl-delay between requests. Scrapy has ROBOTSTXT_OBEY and DOWNLOAD_DELAY for the same thing. The live fetches are fenced.

Eval, 2 fixtures

Last passed: verified today
  • polite-okcontainstimeout 30s · max $0

    Expected: scraping OK: robots.txt respected (allowed /public, blocked /private) and crawl-delay of 2s honored

  • clean-exitexit_codetimeout 30s · max $0

    Expected: 0

Results

The difference between a scraper and a nuisance is two checks: is this path allowed, and am I going slow enough. Python's urllib.robotparser does both with no dependencies, so you can wrap any crawler (Scrapy, Crawl4AI, requests) in a permission-and-pace gate.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).