Web scraping & data extraction

Pull clean, structured data off the web, the right way.

Open-source crawlers, browser agents, and converters for turning pages into LLM-ready data, plus the politeness layer most guides skip. Scrape only what a site allows: respect robots.txt, terms of service, and rate limits. Every recipe here verifies its extraction logic against a local fixture in CI, the live crawl is always yours.

Start here

AutomationFreeMachine-verified

Scrape politely: honor robots.txt and a crawl delay (the part most skip)

Gate any scraper behind a robots.txt check and a crawl delay so you only fetch what a site allows, at a rate it allows, using nothing but the Python standard library.

Scrapy· beginner

AutomationFreeMachine-verified

Firecrawl: turn a page into the exact JSON you asked for

Author a Firecrawl extract request that returns schema-structured JSON (not just markdown) and validate the request shape before you spend a crawl on it.

Firecrawl· beginner

RAGFreeMachine-verified

Crawl4AI: a page to clean, LLM-ready markdown (no API key)

Write a Crawl4AI run script that turns a page into clean markdown with a cache mode set, and verify the script is valid and shaped right before you point it at a site.

Crawl4AI· beginner

The tools (9)

FirecrawlOpen Source

Points at a website, crawls its pages, renders JavaScript, and returns LLM-ready markdown or schema-structured JSON. Self-hostable or hosted API. License is AGPL-3.0 (SDKs MIT), so the core is strong copyleft, not permissive.

Framework· Free· 1 workflow

Crawl4AIOpen Source

Open-source async crawler that turns a site into clean, LLM-ready markdown with no API key or account. Apache-2.0; uses a headless browser, so it handles JS-rendered pages.

Library· Free· 1 workflow

Browser UseOpen Source

An AI agent that drives a real browser like a person, clicking, scrolling, logging in, and filling forms, to reach data a plain crawler cannot. MIT. Pairs an LLM with a browser harness.

Framework· Free

CrawleeOpen Source

Professional scraping framework (Node/TS first, with a younger Python port): rotating proxies, automatic retries, browser fingerprinting, and request-queue management to keep crawls from getting blocked. Apache-2.0.

Framework· Free

ScrapyOpen Source

The veteran industrial-strength Python scraping framework: crawl millions of pages, extract with selectors, and export clean data. BSD-3-Clause, battle-tested for over a decade.

Framework· Free· 1 workflow

MarkItDownOpen Source

Microsoft's utility that converts files and pages (PDF, Office docs, HTML, images) into clean markdown an LLM can use. MIT. Note: it converts content you already have, it is not a web crawler.

Library· Free

ScraplingOpen Source

Adaptive Python scraper that re-finds elements when a page's layout changes, with optional stealth (fingerprint spoofing, anti-bot bypass). BSD-3-Clause. The stealth features are dual-use, so use them only on sites whose terms and robots.txt allow it.

Library· Free

curl-impersonateOpen Source

A curl build whose TLS and HTTP/2 handshake matches a real browser's fingerprint, so requests are not rejected for not looking like a browser. MIT. Dual-use; the actively maintained successor most people use now is curl_cffi (Python). Respect each site's terms and robots.txt.

Infra· Free

AutoScraperOpen Source

Show it one example of what you want and it learns the pattern and extracts the rest, no selectors. MIT. Note: effectively unmaintained (last release 2022), so best for simple static pages.

Library· Free

Every recipe here ships with a CI badge that re-checks its extraction logic on each push. If a setup you bookmark stops working, the badge goes red before you do.

★ Star the awesome list on GitHub

Newsletter · Tue · Thu · Sat

WebAfterAI

AI agents, automation, and the next internet.

Three issues a week, Tuesday, Thursday, and Saturday, on what builders are actually shipping. From the r/WebAfterAI community.