Skip to content

Commit de7dbfe

Browse files
tabletcorryXe
andauthored
Split up AI filtering files (#592)
* Split up AI filtering files Create aggressive/moderate/permissive policies to allow administrators to choose their AI/LLM stance. Aggressive policy matches existing default in Anubis. Removes `Google-Extended` flag from `ai-robots-txt.yaml` as it doesn't exist in requests. Rename `ai-robots-txt.yaml` to `ai-catchall.yaml` as the file is no longer a copy of the source repo/file. * chore: spelling * chore: fix embeds * chore: fix data includes * chore: fix file name typo * chore: Ignore READMEs in configs * chore(lib/policy/config): go tool goimports -w Signed-off-by: Xe Iaso <[email protected]> --------- Signed-off-by: Xe Iaso <[email protected]> Co-authored-by: Xe Iaso <[email protected]>
1 parent 77e0bbb commit de7dbfe

File tree

19 files changed

+107
-18
lines changed

19 files changed

+107
-18
lines changed

.github/actions/spelling/expect.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@ blueskybot
1818
boi
1919
botnet
2020
BPort
21+
Brightbot
2122
broked
23+
Bytespider
2224
cachebuster
2325
Caddyfile
2426
caninetools
@@ -41,6 +43,7 @@ cloudflare
4143
confd
4244
containerbuild
4345
coreutils
46+
Cotoyogi
4447
CRDs
4548
crt
4649
daemonizing
@@ -49,6 +52,7 @@ Debian
4952
debrpm
5053
decaymap
5154
decompiling
55+
Diffbot
5256
discordapp
5357
discordbot
5458
distros
@@ -66,11 +70,15 @@ everyones
6670
evilbot
6771
evilsite
6872
expressionorlist
73+
externalagent
74+
externalfetcher
6975
extldflags
7076
facebookgo
77+
Factset
7178
fastcgi
7279
fediverse
7380
finfos
81+
Firecrawl
7482
flagenv
7583
Fordola
7684
forgejo
@@ -86,6 +94,7 @@ googlebot
8694
govulncheck
8795
GPG
8896
GPT
97+
gptbot
8998
grw
9099
Hashcash
91100
hashrate
@@ -97,8 +106,11 @@ hostable
97106
htmx
98107
httpdebug
99108
hypertext
109+
iaskspider
100110
iat
101111
ifm
112+
Imagesift
113+
imgproxy
102114
inp
103115
iss
104116
isset
@@ -146,11 +158,15 @@ nginx
146158
nobots
147159
NONINFRINGEMENT
148160
nosleep
161+
OCOB
149162
ogtags
163+
omgili
164+
omgilibot
150165
onionservice
151166
openai
152167
openrc
153168
pag
169+
Pangu
154170
parseable
155171
passthrough
156172
Patreon
@@ -185,18 +201,22 @@ RUnlock
185201
sas
186202
sasl
187203
Scumm
204+
searchbot
188205
searx
189206
sebest
190207
secretplans
191208
selfsigned
209+
Semrush
192210
setsebool
193211
shellcheck
212+
Sidetrade
194213
sitemap
195214
sls
196215
sni
197216
Sourceware
198217
Spambot
199218
sparkline
219+
spyderbot
200220
srv
201221
stackoverflow
202222
startprecmd
@@ -212,12 +232,15 @@ techarohq
212232
templ
213233
templruntime
214234
testarea
235+
Tik
236+
Timpibot
215237
torproject
216238
traefik
217239
unixhttpd
218240
unmarshal
219241
uvx
220242
Varis
243+
Velen
221244
vendored
222245
vhosts
223246
videotest
@@ -227,9 +250,11 @@ webmaster
227250
webpage
228251
websecure
229252
websites
253+
Webzio
230254
wordpress
231255
Workaround
232256
workdir
257+
wpbot
233258
xcaddy
234259
Xeact
235260
xeiaso

data/botPolicies.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"import": "(data)/bots/_deny-pathological.yaml"
55
},
66
{
7-
"import": "(data)/bots/ai-robots-txt.yaml"
7+
"import": "(data)/meta/ai-block-aggressive.yaml"
88
},
99
{
1010
"import": "(data)/crawlers/_allow-good.yaml"

data/botPolicies.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,12 @@ bots:
1717
import: (data)/bots/_deny-pathological.yaml
1818
- import: (data)/bots/aggressive-brazilian-scrapers.yaml
1919

20-
# Enforce https://github.com/ai-robots-txt/ai.robots.txt
21-
- import: (data)/bots/ai-robots-txt.yaml
20+
# Aggressively block AI/LLM related bots/agents by default
21+
- import: (data)/meta/ai-block-aggressive.yaml
22+
23+
# Consider replacing the aggressive AI policy with more selective policies:
24+
# - import: (data)/meta/ai-block-moderate.yaml
25+
# - import: (data)/meta/ai-block-permissive.yaml
2226

2327
# Search engine crawlers to allow, defaults to:
2428
# - Google (so they don't try to bypass Anubis)

data/bots/ai-catchall.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Extensive list of AI-affiliated agents based on https://github.com/ai-robots-txt/ai.robots.txt
2+
# Add new/undocumented agents here. Where documentation exists, consider moving to dedicated policy files.
3+
# Notes on various agents:
4+
# - Amazonbot: Well documented, but they refuse to state which agent collects training data.
5+
# - anthropic-ai/Claude-Web: Undocumented by Anthropic. Possibly deprecated or hallucinations?
6+
# - Perplexity*: Well documented, but they refuse to state which agent collects training data.
7+
# Warning: May contain user agents that _must_ be blocked in robots.txt, or the opt-out will have no effect.
8+
- name: "ai-catchall"
9+
user_agent_regex: >-
10+
AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|anthropic-ai|Brightbot 1.0|Bytespider|CCBot|Claude-Web|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|GoogleOther|GoogleOther-Image|GoogleOther-Video|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|NovaAct|omgili|omgilibot|Operator|PanguBot|Perplexity-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YouBot
11+
action: DENY

data/bots/ai-robots-txt.yaml

Lines changed: 0 additions & 6 deletions
This file was deleted.

data/clients/ai.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# User agents that act on behalf of humans in AI tools, e.g. searching the web.
2+
# Each entry should have a positive/ALLOW entry created as well, with further documentation.
3+
# Exceptions:
4+
# - Claude-User: No published IP allowlist
5+
- name: "ai-clients"
6+
user_agent_regex: >-
7+
ChatGPT-User|Claude-User|MistralAI-User
8+
action: DENY

data/crawlers/ai-search.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# User agents that index exclusively for search in for AI systems.
2+
# Each entry should have a positive/ALLOW entry created as well, with further documentation.
3+
# Exceptions:
4+
# - Claude-SearchBot: No published IP allowlist
5+
- name: "ai-crawlers-search"
6+
user_agent_regex: >-
7+
OAI-SearchBot|Claude-SearchBot
8+
action: DENY

data/crawlers/ai-training.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# User agents that crawl for training AI/LLM systems
2+
# Each entry should have a positive/ALLOW entry created as well, with further documentation.
3+
# Exceptions:
4+
# - ClaudeBot: No published IP allowlist
5+
- name: "ai-crawlers-training"
6+
user_agent_regex: >-
7+
GPTBot|ClaudeBot
8+
action: DENY

data/embed.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@ package data
33
import "embed"
44

55
var (
6-
//go:embed botPolicies.yaml botPolicies.json all:apps all:bots all:clients all:common all:crawlers
6+
//go:embed botPolicies.yaml botPolicies.json all:apps all:bots all:clients all:common all:crawlers all:meta
77
BotPolicies embed.FS
88
)

data/meta/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# meta policies
2+
3+
Contains policies that exclusively reference policies in _multiple_ other data folders.
4+
5+
Akin to "stances" that the administrator can take, with reference to various topics, such as AI/LLM systems.

0 commit comments

Comments
 (0)