Firecrawl: turn a page into the exact JSON you asked for
Author a Firecrawl extract request that returns schema-structured JSON (not just markdown) and validate the request shape before you spend a crawl on it.
Run this workflow
CI-verified, 2/2 fixtures passing.
Build this with your agent
One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.
Intended Use
Anyone extracting structured data from pages they are allowed to scrape. CI parses the extract request JSON and asserts it targets a url, requests json output, and defines a schema with fields. No network, no key, no crawl. The live crawl (your key or self-hosted, AGPL) is fenced.
Not for
- Scraping sites whose terms or robots.txt forbid it, the schema is the easy part; permission is on you
- Assuming the core is permissively licensed, Firecrawl's core is AGPL-3.0 (only the SDKs are MIT), so a closed product built on the self-hosted core inherits copyleft obligations
The Stack
Tested Against
firecrawl.dev docs / extract API (2026-06)node@20Side effects & data flow
- Network
- none, local only
- Writes
- ./extract.json
- Credentials
- none required
Prerequisites
- A Firecrawl API key or a self-hosted instance (only to run the crawl)
Steps
- 1
Author the extract request and validate it
Write the extract request: the url, formats including json, and a jsonOptions.schema describing the fields you want. CI parses it and checks the shape. Sending it to Firecrawl needs a key or your self-hosted instance, so that step is fenced.
cat > extract.json <<'JSON' { "url": "https://example.com/pricing", "formats": ["markdown", "json"], "onlyMainContent": true, "jsonOptions": { "schema": { "type": "object", "properties": { "plan": { "type": "string" }, "priceUsd": { "type": "number" } }, "required": ["plan", "priceUsd"] } } } JSON node -e ' const fs = require("fs"); const c = JSON.parse(fs.readFileSync("extract.json", "utf8")); function bad(m) { console.error("BAD: " + m); process.exit(1); } if (!c.url || typeof c.url !== "string") bad("request has no url"); if (!Array.isArray(c.formats) || !c.formats.includes("json")) bad("formats must include json for structured extraction"); const sch = c.jsonOptions && c.jsonOptions.schema; if (!sch || sch.type !== "object" || !sch.properties) bad("no jsonOptions.schema object with properties"); const fields = Object.keys(sch.properties).length; if (fields < 1) bad("schema defines no fields"); console.log("config OK: Firecrawl extract targets a url, requests json output, and defines a " + fields + "-field schema"); ' - 2
Run the crawl (the network step, not checked by CI)
Send the request to the Firecrawl API with your key, or to your self-hosted instance. Firecrawl renders the page and returns JSON matching your schema. Only crawl pages you are allowed to; the network call and its data are fenced.
Eval, 2 fixtures
Last passed: verified todayschema-okcontainstimeout 30s · max $0Expected:
config OK: Firecrawl extract targets a url, requests json output, and defines a 2-field schemaclean-exitexit_codetimeout 30s · max $0Expected:
0
Results
Firecrawl renders JavaScript and can return content as LLM-ready markdown or as JSON matching a schema you define, so a page becomes typed fields instead of a wall of text. CI checks the request is well-formed; the crawl itself runs on your key or your self-hosted instance.
Did this work for you?
Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.
Liked this workflow?
Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).