Crawl4AI: a page to clean, LLM-ready markdown (no API key)
Write a Crawl4AI run script that turns a page into clean markdown with a cache mode set, and verify the script is valid and shaped right before you point it at a site.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone building an LLM/RAG pipeline that needs clean markdown from pages they are allowed to scrape. CI compiles the run script (python3 py_compile, so no crawl4ai install needed) and asserts it uses AsyncWebCrawler.arun to produce markdown with a cache mode set. No browser, no network, no fetch. The live crawl is fenced.
Not for
- Crawling sites whose terms or robots.txt forbid it, set a cache mode and a delay and stay within what the site allows
- Expecting CI to prove the markdown is good, CI proves the script is valid and shaped right; the fetch and its output are the fenced step
The Stack
Tested Against
crawl4ai docs (2026-06)python@3.12 (py_compile, stdlib)Side effects & data flow
- Network
- none, local only
- Writes
- ./crawl.py
- Credentials
- none required
Prerequisites
- pip install crawl4ai + its headless browser (only to actually crawl)
Steps
- 1
Write the run script and structure-check it
Write crawl.py using AsyncWebCrawler with a cache mode (politeness and repeatability) that prints result.markdown. CI compiles it and checks it uses the right calls. Running it needs crawl4ai and a browser installed, so that step is fenced.
cat > crawl.py <<'PY' import asyncio from crawl4ai import AsyncWebCrawler, CacheMode async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com", cache_mode=CacheMode.ENABLED, ) print(result.markdown) asyncio.run(main()) PY python3 - <<'CHECK' import py_compile, sys src = open("crawl.py").read() try: py_compile.compile("crawl.py", doraise=True) except py_compile.PyCompileError: print("BAD: crawl.py does not compile"); sys.exit(1) def need(tok, msg): if tok not in src: print("BAD: " + msg); sys.exit(1) need("AsyncWebCrawler", "script does not use crawl4ai AsyncWebCrawler") need(".arun(", "script never runs a crawl (arun)") need("markdown", "script does not request markdown output") need("CacheMode", "script sets no cache mode (politeness / repeatability)") print("config OK: crawl4ai script compiles and uses AsyncWebCrawler.arun to produce markdown with a cache mode set") CHECK - 2
Run the crawl (the browser/network step, not checked by CI)
pip install crawl4ai, install its browser, and run the script against a site you are allowed to crawl. It renders the page and prints clean markdown ready to chunk into a knowledge base. The fetch is fenced.
Eval, 2 fixtures
Last passed: verified todayscript-okcontainstimeout 30s · max $0Expected:
config OK: crawl4ai script compiles and uses AsyncWebCrawler.arun to produce markdown with a cache mode setclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
Crawl4AI renders a JS-heavy page in a headless browser and hands back clean markdown an LLM can read, with no API key, account, or per-page fee. CI checks the script compiles and uses the right calls; the actual fetch runs on your machine against sites you are allowed to crawl.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).