AutomationOpen SourceFreeActiveMachine-verified· beginner · ~10 min setup

Firecrawl: turn a page into the exact JSON you asked for

Author a Firecrawl extract request that returns schema-structured JSON (not just markdown) and validate the request shape before you spend a crawl on it.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone extracting structured data from pages they are allowed to scrape. CI parses the extract request JSON and asserts it targets a url, requests json output, and defines a schema with fields. No network, no key, no crawl. The live crawl (your key or self-hosted, AGPL) is fenced.

Not for

Scraping sites whose terms or robots.txt forbid it, the schema is the easy part; permission is on you
Assuming the core is permissively licensed, Firecrawl's core is AGPL-3.0 (only the SDKs are MIT), so a closed product built on the self-hosted core inherits copyleft obligations

The Stack

Firecrawlcrawler / extractor

Tested Against

firecrawl.dev docs / extract API (2026-06)node@20

Side effects & data flow

Network: none, local only
Writes: ./extract.json
Credentials: none required

Prerequisites

A Firecrawl API key or a self-hosted instance (only to run the crawl)

Steps

Author the extract request and validate it

Write the extract request: the url, formats including json, and a jsonOptions.schema describing the fields you want. CI parses it and checks the shape. Sending it to Firecrawl needs a key or your self-hosted instance, so that step is fenced.

cat > extract.json <<'JSON'
{
  "url": "https://example.com/pricing",
  "formats": ["markdown", "json"],
  "onlyMainContent": true,
  "jsonOptions": {
    "schema": {
      "type": "object",
      "properties": {
        "plan": { "type": "string" },
        "priceUsd": { "type": "number" }
      },
      "required": ["plan", "priceUsd"]
    }
  }
}
JSON
node -e '
const fs = require("fs");
const c = JSON.parse(fs.readFileSync("extract.json", "utf8"));
function bad(m) { console.error("BAD: " + m); process.exit(1); }
if (!c.url || typeof c.url !== "string") bad("request has no url");
if (!Array.isArray(c.formats) || !c.formats.includes("json")) bad("formats must include json for structured extraction");
const sch = c.jsonOptions && c.jsonOptions.schema;
if (!sch || sch.type !== "object" || !sch.properties) bad("no jsonOptions.schema object with properties");
const fields = Object.keys(sch.properties).length;
if (fields < 1) bad("schema defines no fields");
console.log("config OK: Firecrawl extract targets a url, requests json output, and defines a " + fields + "-field schema");
'

2
Run the crawl (the network step, not checked by CI)
Send the request to the Firecrawl API with your key, or to your self-hosted instance. Firecrawl renders the page and returns JSON matching your schema. Only crawl pages you are allowed to; the network call and its data are fenced.

Eval, 2 fixtures

Last passed: verified today

schema-okcontainstimeout 30s · max $0
Expected: config OK: Firecrawl extract targets a url, requests json output, and defines a 2-field schema
clean-exitexit_codetimeout 30s · max $0
Expected: 0

Results

Firecrawl renders JavaScript and can return content as LLM-ready markdown or as JSON matching a schema you define, so a page becomes typed fields instead of a wall of text. CI checks the request is well-formed; the crawl itself runs on your key or your self-hosted instance.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).