Web scraping & data extraction
Pull clean, structured data off the web, the right way.
Open-source crawlers, browser agents, and converters for turning pages into LLM-ready data, plus the politeness layer most guides skip. Scrape only what a site allows: respect robots.txt, terms of service, and rate limits. Every recipe here verifies its extraction logic against a local fixture in CI, the live crawl is always yours.
Start here
Scrape politely: honor robots.txt and a crawl delay (the part most skip)
Gate any scraper behind a robots.txt check and a crawl delay so you only fetch what a site allows, at a rate it allows, using nothing but the Python standard library.
Firecrawl: turn a page into the exact JSON you asked for
Author a Firecrawl extract request that returns schema-structured JSON (not just markdown) and validate the request shape before you spend a crawl on it.
Crawl4AI: a page to clean, LLM-ready markdown (no API key)
Write a Crawl4AI run script that turns a page into clean markdown with a cache mode set, and verify the script is valid and shaped right before you point it at a site.
The tools (9)
Points at a website, crawls its pages, renders JavaScript, and returns LLM-ready markdown or schema-structured JSON. Self-hostable or hosted API. License is AGPL-3.0 (SDKs MIT), so the core is strong copyleft, not permissive.
Open-source async crawler that turns a site into clean, LLM-ready markdown with no API key or account. Apache-2.0; uses a headless browser, so it handles JS-rendered pages.
An AI agent that drives a real browser like a person, clicking, scrolling, logging in, and filling forms, to reach data a plain crawler cannot. MIT. Pairs an LLM with a browser harness.
Professional scraping framework (Node/TS first, with a younger Python port): rotating proxies, automatic retries, browser fingerprinting, and request-queue management to keep crawls from getting blocked. Apache-2.0.
The veteran industrial-strength Python scraping framework: crawl millions of pages, extract with selectors, and export clean data. BSD-3-Clause, battle-tested for over a decade.
Microsoft's utility that converts files and pages (PDF, Office docs, HTML, images) into clean markdown an LLM can use. MIT. Note: it converts content you already have, it is not a web crawler.
Adaptive Python scraper that re-finds elements when a page's layout changes, with optional stealth (fingerprint spoofing, anti-bot bypass). BSD-3-Clause. The stealth features are dual-use, so use them only on sites whose terms and robots.txt allow it.
A curl build whose TLS and HTTP/2 handshake matches a real browser's fingerprint, so requests are not rejected for not looking like a browser. MIT. Dual-use; the actively maintained successor most people use now is curl_cffi (Python). Respect each site's terms and robots.txt.
Show it one example of what you want and it learns the pattern and extracts the rest, no selectors. MIT. Note: effectively unmaintained (last release 2022), so best for simple static pages.
Every recipe here ships with a CI badge that re-checks its extraction logic on each push. If a setup you bookmark stops working, the badge goes red before you do.
★ Star the awesome list on GitHubNewsletter · Tue · Thu · Sat
WebAfterAI
AI agents, automation, and the next internet.
Three issues a week, Tuesday, Thursday, and Saturday, on what builders are actually shipping. From the r/WebAfterAI community.