Content Protection

Stop web scraping with FormShield — a beacon plus edge model that fingerprints visitors and names scrapers and AI crawlers, including non-JS bots via server reporting

Scrapers and AI crawlers fetch your pages all day, and the worst of them never run JavaScript — so client-only analytics and most bot tools never see them. You end up guessing which traffic is real and which is quietly draining your content into someone else’s index or training set.

Content Protection pairs a lightweight beacon with an edge model that fingerprints visitors and names the scrapers and AI crawlers hitting your content — including the ones that never run a single line of JavaScript. Every hit becomes a scored observation with a decision and a list of reasons you read in the dashboard Logs.

When to use it

Reach for Content Protection when you want to know who is fetching your pages, not just how many requests came in.

Beacon (in the browser)

One async <script> tag fingerprints each visitor that runs JavaScript and posts the signals on every pageview. Fastest start, but blind to pure crawlers.

Server reporting (from your origin)

POST /v1/report from a Cloudflare Worker, edge middleware, or any backend captures the declared AI agents — GPTBot, ClaudeBot, PerplexityBot, Bytespider — that fetch HTML without running scripts.

The beacon and server reporting share one publishable key and feed the same observation stream. Crawler-heavy content sites run both: the beacon scores real humans, server reporting catches everything that skips JavaScript.

How it works

Drop the beacon on your pages
Add one async <script> tag pointing at https://api.formshield.dev/js/formshield.js with your publishable key and data-fs-mode="pageload". It auto-initializes from the data-fs-* attributes — no extra code.
html
```
<script
  async
  src="https://api.formshield.dev/js/formshield.js"
  data-fs-project-key="fs_pub_live_…"
  data-fs-action="pageview"
  data-fs-mode="pageload"
></script>
```
On load it performs a signed handshake (POST /v1/handshake), then posts browser fingerprint and automation signals to POST /v1/collect on each pageview.
Catch the non-JS crawlers server-side

Pure crawlers fetch your HTML without running scripts, so the beacon never sees them. Report each request from your origin worker or backend with POST /v1/report, passing the visitor’s UA and IP.
Fire it with ctx.waitUntil (fire-and-forget) so it adds zero latency and your page never depends on FormShield being up. See server reporting for the complete Worker example.
Read named, verified verdicts in your logs

The edge model scores every hit, classifies the user agent, and checks IP reputation. It names the bot (bot_id like gptbot or googlebot, plus the operating company) and, for operators that publish IP ranges — Google, Microsoft, OpenAI, DuckDuckGo — verifies the request really came from them.
A forged Googlebot from the wrong IP is flagged bot:spoofed. View the score, decision, and reasons per observation in the dashboard Logs.

Quickstart

Once the beacon is on your pages, add server reporting to catch the crawlers it can’t see. Report each origin request with the visitor’s UA and IP.

bash

curl -X POST https://api.formshield.dev/v1/report \
  -H "Authorization: Bearer fs_pub_live_…" \
  -H "Content-Type: application/json" \
  -d '{
    "ua": "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)",
    "ip": "203.0.113.42",
    "hostname": "example.com",
    "path": "/pricing",
    "action": "pageview"
  }'

Response:

json

{ "ok": true, "request_id": "rpt_a1b2c3d4e5f6" }

/v1/report returns an acknowledgement, not a verdict. Scoring happens server-side and the score, decision, and reasons land on the observation — read them in Logs. Never gate your response on this call.

Endpoints

The beacon and server reporting use one publishable key (fs_pub_live_…), safe to expose in the browser.

Endpoint	Caller	Purpose
`GET /js/formshield.js`	browser	The beacon. Auto-initializes from `data-fs-*` attributes.
`POST /v1/handshake`	beacon	Signed handshake that proves a real browser ran the beacon.
`POST /v1/collect`	beacon	Posts fingerprint and automation signals on each pageview.
`POST /v1/report`	your origin	Server-side report of a request the beacon can’t see.

Signals

Each observation is scored from these signals. They combine user-agent classification with self-hosted IP intelligence.

AI and search crawler identification

Declared agents get a bot:ai_crawler or bot:search_crawler reason plus the named operator. GPTBot, ClaudeBot, PerplexityBot, and Bytespider are recognized; verified benign search crawlers are credited toward allow while AI crawlers stay visible.

Spoof detection via IP verification

For operators that publish their ranges, a request whose UA claims a crawler but whose IP is out of range is flagged bot:spoofed and scored high. A real crawler is confirmed (bot:verified); a forged one is caught.

Automation and missing-token tells

The signed handshake token proves a real browser ran the beacon. Its absence (client_token_missing) plus webdriver and headless markers (automation_detected) push the score toward block on the client path. Server reports correctly skip the missing-client penalty.

IP reputation on every hit

Datacenter, VPN, proxy, residential-proxy, and scanner flags plus country and ASN. A human UA from a datacenter range, or a desktop UA on a mobile IP, raises a consistency flag.

Key reasons

bot:ai_crawler reason path

The user agent declares an AI crawler (GPTBot, ClaudeBot, PerplexityBot, Bytespider). The operator is named on the observation. AI crawlers stay visible rather than being silently allowed.

bot:search_crawler reason path

The user agent declares a search crawler (Googlebot, Bingbot). A verified benign search crawler is credited toward allow.

bot:verified reason path

The UA claims a crawler whose operator publishes IP ranges, and the request’s IP falls inside them. A real crawler is confirmed.

bot:spoofed reason path

The UA claims a crawler but the IP is out of the operator’s published range. The hit is scored high — a forged Googlebot is caught.

automation_detected reason path

Browser fingerprint flags webdriver or headless automation — a strong tell that pushes a client-path hit toward block.

client_token_missing reason path

The signed handshake token is absent on a client-path hit, so no real browser ran the beacon. Server reports skip this penalty by design.

Common questions

How do I stop web scraping when the scraper does not run JavaScript? question path

The JS beacon only sees clients that execute scripts, so pure crawlers slip past it. Report each request from your origin with POST /v1/report, passing the visitor’s UA and IP from the incoming request (on Cloudflare, CF-Connecting-IP and User-Agent). FormShield classifies and scores it server-side, naming AI agents like GPTBot and ClaudeBot and verifying or flagging crawlers by IP range. Send it fire-and-forget with ctx.waitUntil so it adds zero latency.

Does /v1/report return a block decision I can act on inline? question path

No. /v1/report returns only { ok: true, request_id }; scoring happens server-side and the score, decision, and reasons are stored on the observation. View them in the dashboard Logs. Never gate your response on this call, and always wrap the fetch in try/catch so a FormShield outage can never break your page.

What does Content Protection cost? question path

The passive beacon (handshake / collect / report) is free — it costs zero credits, so you can instrument every pageview. Deep analysis costs 4 credits per request. Billing is in credits across all products; see Billing & Pricing for plans and overage.

Next steps

Server reporting

The full /v1/report reference — request body, fire-and-forget patterns, and a complete Cloudflare Worker.

Bot detection

Name and IP-verify crawlers, and allow or block them per project.

Pageview tracking

Every beacon attribute, the three modes, and the full observation shape.

Quickstart

Get scored pageviews flowing into your dashboard in minutes.

Edit this page