URL: /guides/bot-detection

---
title: Bot detection
description: How FormShield names, IP-verifies, and lets you allow or block crawlers and AI agents
---

FormShield identifies the bots that hit your site by their user agent, and for the
operators that publish their IP ranges, it goes further: it **verifies** that the
request actually came from that operator. A user agent is trivial to forge, so a
forged "Googlebot" from an unrelated IP is flagged as **spoofed** and scored high —
the opposite of a real one.

This page covers what FormShield identifies, the verified-versus-spoofed
distinction, the fields and reasons it adds to each observation, and how to allow
or block bots per project.

## What FormShield identifies

Every pageview and [server-reported](/guides/server-reporting) request runs through
a registry of known bots across three groups:

<CardGroup cols={3}>
  <Card title="AI Crawlers" icon="bot">
    GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, CCBot, Amazonbot, and more.
  </Card>
  <Card title="Search Engines" icon="magnifying-glass">
    Googlebot, Bingbot, DuckDuckBot, Applebot, YandexBot, Baiduspider, and others.
  </Card>
  <Card title="SEO Tools" icon="chart-line">
    AhrefsBot, SemrushBot, MJ12bot, DotBot, DataForSeoBot, Screaming Frog.
  </Card>
</CardGroup>

A matched bot adds its identity to the observation: a stable `bot_id`, the
`operator` (Google, OpenAI, Anthropic, …), the `category`, and the group.

## Verified vs spoofed

Some operators publish the IP ranges their crawlers run from. For those, FormShield
checks that the request IP is actually inside the operator's ranges and records a
three-state result:

| `verified` | Meaning |
| --- | --- |
| `true` | The user agent matches **and** the IP is in the operator's published ranges. A genuine crawler. |
| `false` | **Spoofed.** The user agent claims a crawler whose operator publishes ranges, but the IP is not in them. The classic impersonation pattern — scored high. |
| `null` | Unverifiable: the bot's operator publishes no ranges (the bot is named on user agent alone), or no IP was available. |

### Which bots are IP-verified

FormShield verifies against the operators that publish authoritative IP ranges:

| Operator | Verified bots |
| --- | --- |
| Google | Googlebot and its family (Googlebot-Image, Storebot-Google, Google-InspectionTool) |
| Microsoft | Bingbot, BingPreview |
| OpenAI | GPTBot, ChatGPT-User, OAI-SearchBot |
| DuckDuckGo | DuckDuckBot |

Every other crawler — ClaudeBot, PerplexityBot, Bytespider, Google-Extended,
Applebot, CCBot, Amazonbot, the SEO tools, and the rest — is named on its user
agent alone. These carry `verified: null`: FormShield can tell you the request
*claims* to be that bot, but it can't prove the IP. As more operators publish IP
ranges, more bots move into the verified set.

<Note>
  A spoofed crawler is a real signal of abuse. A forged "Googlebot" from a
  datacenter IP that isn't Google's is exactly what a scraper or attacker sends to
  slip past naive user-agent allowlists. FormShield scores it like the
  impersonation it is.
</Note>

## What it adds to an observation

Bot detection adds fields to the observation's metadata and reasons to its
`reasons` array.

| Field | Meaning |
| --- | --- |
| `bot_id` | The matched bot's stable id, e.g. `googlebot`, `gptbot`. `null` when no bot matched. |
| `bot_operator` | The operating company, e.g. `Google`, `OpenAI`. |
| `bot_category` | `ai_crawler`, `search_crawler`, or `seo_tool`. |
| `bot_verified` | `true` (IP-verified), `false` (spoofed), or `null` (unverifiable / no IP). |

| Reason | Meaning |
| --- | --- |
| `bot:id:<id>` | A bot was identified, e.g. `bot:id:googlebot`. |
| `bot:verified` | The bot is IP-verified — user agent and IP agree. |
| `bot:spoofed:<id>` | The user agent claims this bot, but the IP is out of the operator's ranges. |
| `bot:unverified` | The bot is named on user agent alone (no published ranges, or no IP). |

These sit alongside the user-agent reasons (`bot:ai_crawler`, `bot:search_crawler`,
and the agent name) described in [pageview tracking](/guides/pageview-tracking).

## How bots affect the score

A confirmed search crawler and a spoofed one land at opposite ends:

- **Verified search crawlers** (Googlebot, Bingbot, DuckDuckBot) default to
  **allow**. Their bot user agent and datacenter IP are expected, not risk, so
  FormShield credits them back rather than penalizing them.
- **Verified AI crawlers** (GPTBot, OAI-SearchBot, …) stay visible at **review**.
  They are benign but whether you serve them is a business decision — so FormShield
  surfaces them for you to allow or block per project rather than waving them
  through.
- **Spoofed crawlers** are scored high — a spoofed user agent on its own is enough
  to reach **block**.

## Allow or block bots per project

Open the project's **Settings → Bots** in the [dashboard](https://formshield.dev/app).
Each group (AI Crawlers, Search Engines, SEO Tools) and each individual bot has a
three-state control:

| Rule | Effect |
| --- | --- |
| **Default** | No override — the score decides. |
| **Allow** | The bot's traffic is allowed past the score. |
| **Block** | The bot's traffic is blocked. |

A per-bot rule overrides its group's rule, so you can block a whole group and
allow one bot within it, or vice versa.

A shield marks bots that are IP-verifiable. An **Allow** rule only ever takes
effect for those, and only when the request is genuinely verified:

<Warning>
  An **Allow** rule applies **only to a cryptographically IP-verified bot**. A
  spoofed user agent — or any bot FormShield can't IP-verify — is never
  allow-listed. A forged "Googlebot" can't smuggle itself onto your allow rule; if
  it tries, FormShield records it and leaves the block in place. A **Block** rule,
  by contrast, applies to any matching user agent, verified or not, because
  blocking a claimed bot is always safe.
</Warning>

So a practical setup for crawler-heavy traffic: leave verified search crawlers on
**Default** (they already allow), and decide per AI crawler whether to **Allow**
(you welcome it) or **Block** (you don't want it training on your content).

## Next steps

<CardGroup cols={2}>
  <Card title="Pageview tracking" icon="chart-line" href="/guides/pageview-tracking">
    The client-side beacon and the observation it produces.
  </Card>
  <Card title="Server reporting" icon="server" href="/guides/server-reporting">
    Capture and verify crawlers that never run JavaScript.
  </Card>
</CardGroup>