Content Protection
Stop web scraping with FormShield — a beacon plus edge model that fingerprints visitors and names scrapers and AI crawlers, including non-JS bots via server reporting
Scrapers and AI crawlers fetch your pages all day, and the worst of them never run JavaScript — so client-only analytics and most bot tools never see them. You end up guessing which traffic is real and which is quietly draining your content into someone else’s index or training set.
Content Protection pairs a lightweight beacon with an edge model that fingerprints visitors and names the scrapers and AI crawlers hitting your content — including the ones that never run a single line of JavaScript. Every hit becomes a scored observation with a decision and a list of reasons you read in the dashboard Logs.
When to use it
Reach for Content Protection when you want to know who is fetching your pages, not just how many requests came in.
One async <script> tag fingerprints each visitor that runs JavaScript and posts the signals on every pageview. Fastest start, but blind to pure crawlers.
POST /v1/report from a Cloudflare Worker, edge middleware, or any backend captures the declared AI agents — GPTBot, ClaudeBot, PerplexityBot, Bytespider — that fetch HTML without running scripts.
The beacon and server reporting share one publishable key and feed the same observation stream. Crawler-heavy content sites run both: the beacon scores real humans, server reporting catches everything that skips JavaScript.
How it works
-
Drop the beacon on your pages
Add one async
<script>tag pointing athttps://api.formshield.dev/js/formshield.jswith your publishable key anddata-fs-mode="pageload". It auto-initializes from thedata-fs-*attributes — no extra code.html <script async src="https://api.formshield.dev/js/formshield.js" data-fs-project-key="fs_pub_live_…" data-fs-action="pageview" data-fs-mode="pageload" ></script>On load it performs a signed handshake (
POST /v1/handshake), then posts browser fingerprint and automation signals toPOST /v1/collecton each pageview. -
Catch the non-JS crawlers server-side
Pure crawlers fetch your HTML without running scripts, so the beacon never sees them. Report each request from your origin worker or backend with
POST /v1/report, passing the visitor’s UA and IP.Fire it with
ctx.waitUntil(fire-and-forget) so it adds zero latency and your page never depends on FormShield being up. See server reporting for the complete Worker example. -
Read named, verified verdicts in your logs
The edge model scores every hit, classifies the user agent, and checks IP reputation. It names the bot (
bot_idlikegptbotorgooglebot, plus the operating company) and, for operators that publish IP ranges — Google, Microsoft, OpenAI, DuckDuckGo — verifies the request really came from them.A forged Googlebot from the wrong IP is flagged
bot:spoofed. View the score, decision, and reasons per observation in the dashboard Logs.
Quickstart
Once the beacon is on your pages, add server reporting to catch the crawlers it can’t see. Report each origin request with the visitor’s UA and IP.
curl -X POST https://api.formshield.dev/v1/report \
-H "Authorization: Bearer fs_pub_live_…" \
-H "Content-Type: application/json" \
-d '{
"ua": "Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)",
"ip": "203.0.113.42",
"hostname": "example.com",
"path": "/pricing",
"action": "pageview"
}'Response:
{ "ok": true, "request_id": "rpt_a1b2c3d4e5f6" }/v1/report returns an acknowledgement, not a verdict. Scoring happens server-side and the score, decision, and reasons land on the observation — read them in Logs. Never gate your response on this call.
Endpoints
The beacon and server reporting use one publishable key (fs_pub_live_…), safe to expose in the browser.
| Endpoint | Caller | Purpose |
|---|---|---|
GET /js/formshield.js | browser | The beacon. Auto-initializes from data-fs-* attributes. |
POST /v1/handshake | beacon | Signed handshake that proves a real browser ran the beacon. |
POST /v1/collect | beacon | Posts fingerprint and automation signals on each pageview. |
POST /v1/report | your origin | Server-side report of a request the beacon can’t see. |
Signals
Each observation is scored from these signals. They combine user-agent classification with self-hosted IP intelligence.
Declared agents get a bot:ai_crawler or bot:search_crawler reason plus the named operator. GPTBot, ClaudeBot, PerplexityBot, and Bytespider are recognized; verified benign search crawlers are credited toward allow while AI crawlers stay visible.
For operators that publish their ranges, a request whose UA claims a crawler but whose IP is out of range is flagged bot:spoofed and scored high. A real crawler is confirmed (bot:verified); a forged one is caught.
The signed handshake token proves a real browser ran the beacon. Its absence (client_token_missing) plus webdriver and headless markers (automation_detected) push the score toward block on the client path. Server reports correctly skip the missing-client penalty.
Datacenter, VPN, proxy, residential-proxy, and scanner flags plus country and ASN. A human UA from a datacenter range, or a desktop UA on a mobile IP, raises a consistency flag.
Key reasons
bot:ai_crawler reason path The user agent declares an AI crawler (GPTBot, ClaudeBot, PerplexityBot, Bytespider). The operator is named on the observation. AI crawlers stay visible rather than being silently allowed.
bot:search_crawler reason path The user agent declares a search crawler (Googlebot, Bingbot). A verified benign search crawler is credited toward allow.
bot:verified reason path The UA claims a crawler whose operator publishes IP ranges, and the request’s IP falls inside them. A real crawler is confirmed.
bot:spoofed reason path The UA claims a crawler but the IP is out of the operator’s published range. The hit is scored high — a forged Googlebot is caught.
automation_detected reason path Browser fingerprint flags webdriver or headless automation — a strong tell that pushes a client-path hit toward block.
client_token_missing reason path The signed handshake token is absent on a client-path hit, so no real browser ran the beacon. Server reports skip this penalty by design.
Common questions
How do I stop web scraping when the scraper does not run JavaScript? question path The JS beacon only sees clients that execute scripts, so pure crawlers slip past it. Report each request from your origin with POST /v1/report, passing the visitor’s UA and IP from the incoming request (on Cloudflare, CF-Connecting-IP and User-Agent). FormShield classifies and scores it server-side, naming AI agents like GPTBot and ClaudeBot and verifying or flagging crawlers by IP range. Send it fire-and-forget with ctx.waitUntil so it adds zero latency.
Does /v1/report return a block decision I can act on inline? question path No. /v1/report returns only { ok: true, request_id }; scoring happens server-side and the score, decision, and reasons are stored on the observation. View them in the dashboard Logs. Never gate your response on this call, and always wrap the fetch in try/catch so a FormShield outage can never break your page.
What does Content Protection cost? question path The passive beacon (handshake / collect / report) is free — it costs zero credits, so you can instrument every pageview. Deep analysis costs 4 credits per request. Billing is in credits across all products; see Billing & Pricing for plans and overage.
Next steps
The full /v1/report reference — request body, fire-and-forget patterns, and a complete Cloudflare Worker.
Name and IP-verify crawlers, and allow or block them per project.
Every beacon attribute, the three modes, and the full observation shape.
Get scored pageviews flowing into your dashboard in minutes.