AutomationOpen SourceFreeActiveMachine-verified· beginner · ~10 min setup

Firecrawl: turn a page into the exact JSON you asked for

Author a Firecrawl extract request that returns schema-structured JSON (not just markdown) and validate the request shape before you spend a crawl on it.

by Shilpa Mitra· verified today· v1.0.0

Run this workflow

CI-verified, 2/2 fixtures passing.

Build this with your agent

One copy-paste hands Claude Code, Codex, or Cursor the full recipe, steps included, nothing to fetch.

Intended Use

Anyone extracting structured data from pages they are allowed to scrape. CI parses the extract request JSON and asserts it targets a url, requests json output, and defines a schema with fields. No network, no key, no crawl. The live crawl (your key or self-hosted, AGPL) is fenced.

Not for

  • Scraping sites whose terms or robots.txt forbid it, the schema is the easy part; permission is on you
  • Assuming the core is permissively licensed, Firecrawl's core is AGPL-3.0 (only the SDKs are MIT), so a closed product built on the self-hosted core inherits copyleft obligations

The Stack

Tested Against

firecrawl.dev docs / extract API (2026-06)node@20

Side effects & data flow

Network
none, local only
Writes
./extract.json
Credentials
none required

Prerequisites

  • A Firecrawl API key or a self-hosted instance (only to run the crawl)

Steps

  1. 1

    Author the extract request and validate it

    Write the extract request: the url, formats including json, and a jsonOptions.schema describing the fields you want. CI parses it and checks the shape. Sending it to Firecrawl needs a key or your self-hosted instance, so that step is fenced.

    cat > extract.json <<'JSON'
    {
      "url": "https://example.com/pricing",
      "formats": ["markdown", "json"],
      "onlyMainContent": true,
      "jsonOptions": {
        "schema": {
          "type": "object",
          "properties": {
            "plan": { "type": "string" },
            "priceUsd": { "type": "number" }
          },
          "required": ["plan", "priceUsd"]
        }
      }
    }
    JSON
    node -e '
    const fs = require("fs");
    const c = JSON.parse(fs.readFileSync("extract.json", "utf8"));
    function bad(m) { console.error("BAD: " + m); process.exit(1); }
    if (!c.url || typeof c.url !== "string") bad("request has no url");
    if (!Array.isArray(c.formats) || !c.formats.includes("json")) bad("formats must include json for structured extraction");
    const sch = c.jsonOptions && c.jsonOptions.schema;
    if (!sch || sch.type !== "object" || !sch.properties) bad("no jsonOptions.schema object with properties");
    const fields = Object.keys(sch.properties).length;
    if (fields < 1) bad("schema defines no fields");
    console.log("config OK: Firecrawl extract targets a url, requests json output, and defines a " + fields + "-field schema");
    '
  2. 2

    Run the crawl (the network step, not checked by CI)

    Send the request to the Firecrawl API with your key, or to your self-hosted instance. Firecrawl renders the page and returns JSON matching your schema. Only crawl pages you are allowed to; the network call and its data are fenced.

Eval, 2 fixtures

Last passed: verified today
  • schema-okcontainstimeout 30s · max $0

    Expected: config OK: Firecrawl extract targets a url, requests json output, and defines a 2-field schema

  • clean-exitexit_codetimeout 30s · max $0

    Expected: 0

Results

Firecrawl renders JavaScript and can return content as LLM-ready markdown or as JSON matching a schema you define, so a page becomes typed fields instead of a wall of text. CI checks the request is well-formed; the crawl itself runs on your key or your self-hosted instance.

Did this work for you?

Our CI checks the setup runs. You tell us if the whole thing worked. Tell us straight.

Liked this workflow?

Get new verified workflows in WebAfterAI, three issues a week (Tue, Thu, Sat).