RAGOpen SourceFreeActiveMachine-verified· beginner · ~10 min setup

Crawl4AI: a page to clean, LLM-ready markdown (no API key)

Write a Crawl4AI run script that turns a page into clean markdown with a cache mode set, and verify the script is valid and shaped right before you point it at a site.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone building an LLM/RAG pipeline that needs clean markdown from pages they are allowed to scrape. CI compiles the run script (python3 py_compile, so no crawl4ai install needed) and asserts it uses AsyncWebCrawler.arun to produce markdown with a cache mode set. No browser, no network, no fetch. The live crawl is fenced.

Not for

Crawling sites whose terms or robots.txt forbid it, set a cache mode and a delay and stay within what the site allows
Expecting CI to prove the markdown is good, CI proves the script is valid and shaped right; the fetch and its output are the fenced step

The Stack

Crawl4AIcrawler

Tested Against

crawl4ai docs (2026-06)python@3.12 (py_compile, stdlib)

Side effects & data flow

Network: none, local only
Writes: ./crawl.py
Credentials: none required

Prerequisites

pip install crawl4ai + its headless browser (only to actually crawl)

Steps

Write the run script and structure-check it

Write crawl.py using AsyncWebCrawler with a cache mode (politeness and repeatability) that prints result.markdown. CI compiles it and checks it uses the right calls. Running it needs crawl4ai and a browser installed, so that step is fenced.

cat > crawl.py <<'PY'
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode


async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            cache_mode=CacheMode.ENABLED,
        )
        print(result.markdown)


asyncio.run(main())
PY
python3 - <<'CHECK'
import py_compile, sys
src = open("crawl.py").read()
try:
    py_compile.compile("crawl.py", doraise=True)
except py_compile.PyCompileError:
    print("BAD: crawl.py does not compile"); sys.exit(1)
def need(tok, msg):
    if tok not in src:
        print("BAD: " + msg); sys.exit(1)
need("AsyncWebCrawler", "script does not use crawl4ai AsyncWebCrawler")
need(".arun(", "script never runs a crawl (arun)")
need("markdown", "script does not request markdown output")
need("CacheMode", "script sets no cache mode (politeness / repeatability)")
print("config OK: crawl4ai script compiles and uses AsyncWebCrawler.arun to produce markdown with a cache mode set")
CHECK

2
Run the crawl (the browser/network step, not checked by CI)
pip install crawl4ai, install its browser, and run the script against a site you are allowed to crawl. It renders the page and prints clean markdown ready to chunk into a knowledge base. The fetch is fenced.

Eval, 2 fixtures

Last passed: verified today

script-okcontainstimeout 30s · max $0
Expected: config OK: crawl4ai script compiles and uses AsyncWebCrawler.arun to produce markdown with a cache mode set
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

Crawl4AI renders a JS-heavy page in a headless browser and hands back clean markdown an LLM can read, with no API key, account, or per-page fee. CI checks the script compiles and uses the right calls; the actual fetch runs on your machine against sites you are allowed to crawl.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).