← Collections

Web scraping & data extraction

Pull clean, structured data off the web, the right way.

Open-source crawlers, browser agents, and converters for turning pages into LLM-ready data, plus the politeness layer most guides skip. Scrape only what a site allows: respect robots.txt, terms of service, and rate limits. Every recipe here verifies its extraction logic against a local fixture in CI, the live crawl is always yours.

Start here

The tools (9)

FirecrawlOpen Source

Points at a website, crawls its pages, renders JavaScript, and returns LLM-ready markdown or schema-structured JSON. Self-hostable or hosted API. License is AGPL-3.0 (SDKs MIT), so the core is strong copyleft, not permissive.

Framework· Free· 1 workflow
Crawl4AIOpen Source

Open-source async crawler that turns a site into clean, LLM-ready markdown with no API key or account. Apache-2.0; uses a headless browser, so it handles JS-rendered pages.

Library· Free· 1 workflow
Browser UseOpen Source

An AI agent that drives a real browser like a person, clicking, scrolling, logging in, and filling forms, to reach data a plain crawler cannot. MIT. Pairs an LLM with a browser harness.

Framework· Free
CrawleeOpen Source

Professional scraping framework (Node/TS first, with a younger Python port): rotating proxies, automatic retries, browser fingerprinting, and request-queue management to keep crawls from getting blocked. Apache-2.0.

Framework· Free
ScrapyOpen Source

The veteran industrial-strength Python scraping framework: crawl millions of pages, extract with selectors, and export clean data. BSD-3-Clause, battle-tested for over a decade.

Framework· Free· 1 workflow
MarkItDownOpen Source

Microsoft's utility that converts files and pages (PDF, Office docs, HTML, images) into clean markdown an LLM can use. MIT. Note: it converts content you already have, it is not a web crawler.

Library· Free
ScraplingOpen Source

Adaptive Python scraper that re-finds elements when a page's layout changes, with optional stealth (fingerprint spoofing, anti-bot bypass). BSD-3-Clause. The stealth features are dual-use, so use them only on sites whose terms and robots.txt allow it.

Library· Free
curl-impersonateOpen Source

A curl build whose TLS and HTTP/2 handshake matches a real browser's fingerprint, so requests are not rejected for not looking like a browser. MIT. Dual-use; the actively maintained successor most people use now is curl_cffi (Python). Respect each site's terms and robots.txt.

Infra· Free
AutoScraperOpen Source

Show it one example of what you want and it learns the pattern and extracts the rest, no selectors. MIT. Note: effectively unmaintained (last release 2022), so best for simple static pages.

Library· Free

Every recipe here ships with a CI badge that re-checks its extraction logic on each push. If a setup you bookmark stops working, the badge goes red before you do.

★ Star the awesome list on GitHub

Newsletter · Tue · Thu · Sat

WebAfterAI

AI agents, automation, and the next internet.

Three issues a week, Tuesday, Thursday, and Saturday, on what builders are actually shipping. From the r/WebAfterAI community.