Skip to content

feat: implement a DENY_AND_REROUTE action to redirect to tarpits #705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/actions/spelling/expect.txt
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ healthcheck
hebis
hec
hmc
honeypots
hostable
htmlc
htmx
Expand All @@ -135,6 +136,7 @@ Imagesift
imgproxy
impressum
inp
Iocaine
IPTo
iptoasn
iss
Expand Down Expand Up @@ -264,6 +266,7 @@ subrequest
SVCNAME
tagline
tarballs
tarpit
tarrif
techaro
techarohq
Expand Down
2 changes: 2 additions & 0 deletions docs/docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

- Added `DENY_AND_REROUTE` action for redirecting denied requests to external AI tarpits ([#61](https://github.com/json-kyle/anubis/issues/61))
- Fix OpenGraph passthrough ([#717](https://github.com/TecharoHQ/anubis/issues/717))
- Determine the `BIND_NETWORK`/`--bind-network` value from the bind address ([#677](https://github.com/TecharoHQ/anubis/issues/677))
- Implement localization system. Find locale files in lib/localization/locales/.
- Fix dynamic cookie domains functionality ([#731](https://github.com/TecharoHQ/anubis/pull/731))
Expand Down
75 changes: 67 additions & 8 deletions docs/docs/admin/policies.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -116,13 +116,14 @@

## Writing your own rules

There are three actions that can be returned from a rule:
There are four actions that can be returned from a rule:

| Action | Effects |
| :---------- | :-------------------------------------------------------------------------------- |
| `ALLOW` | Bypass all further checks and send the request to the backend. |
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |
| Action | Effects |
|:-------------------|:----------------------------------------------------------------------------------|
| `ALLOW` | Bypass all further checks and send the request to the backend. |
| `DENY` | Deny the request and send back an error message that scrapers think is a success. |
| `DENY_AND_REROUTE` | Deny the request and redirect it to an external URL (e.g. a [tarpit](#tarpits))). |

Check notice on line 125 in docs/docs/admin/policies.mdx

View workflow job for this annotation

GitHub Actions / Check Spelling

Line matches candidate pattern `\(#\S*?[a-zA-Z]\S*?\)` (candidate-pattern)
| `CHALLENGE` | Show a challenge page and/or validate that clients have passed a challenge. |

Name your rules in lower case using kebab-case. Rule names will be exposed in Prometheus metrics.

Expand Down Expand Up @@ -170,7 +171,7 @@
Challenges can be configured with these settings:

| Key | Example | Description |
| :----------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|:-------------|:---------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `difficulty` | `4` | The challenge difficulty (number of leading zeros) for proof-of-work. See [Why does Anubis use Proof-of-Work?](/docs/design/why-proof-of-work) for more details. |
| `report_as` | `4` | What difficulty the UI should report to the user. Useful for messing with industrial-scale scraping efforts. |
| `algorithm` | `"fast"` | The algorithm used on the client to run proof-of-work calculations. This must be set to `"fast"` or `"slow"`. See [Proof-of-Work Algorithm Selection](./algorithm-selection) for more details. |
Expand Down Expand Up @@ -242,13 +243,71 @@
In case your service needs it for risk calculation reasons, Anubis exposes information about the rules that any requests match using a few headers:

| Header | Explanation | Example |
| :---------------- | :--------------------------------------------------- | :--------------- |
|:------------------|:-----------------------------------------------------|:-----------------|
| `X-Anubis-Rule` | The name of the rule that was matched | `bot/lightpanda` |
| `X-Anubis-Action` | The action that Anubis took in response to that rule | `CHALLENGE` |
| `X-Anubis-Status` | The status and how strict Anubis was in its checks | `PASS` |

Policy rules are matched using [Go's standard library regular expressions package](https://pkg.go.dev/regexp). You can mess around with the syntax at [regex101.com](https://regex101.com), make sure to select the Golang option.

### Deny and Reroute Configuration {#tarpits}

The `DENY_AND_REROUTE` action allows you to redirect denied requests to external AI tarpits or honeypots. This is useful for sending bot traffic to services like [Nepenthes](https://zadzmo.org/code/nepenthes/) or [Iocaine](https://iocaine.madhouse-project.org/) that specialize in wasting bots' time and resources.

<Tabs>
<TabItem value="json" label="JSON" default>

```json
{
"name": "ai-scrapers-to-tarpit",
"user_agent_regex": "(ChatGPT|GPTBot|Claude-Web|OpenAI|Anthropic)",
"action": "DENY_AND_REROUTE",
"reroute_to": "https://tarpit.example.com/honeypot"
}
```

</TabItem>
<TabItem value="yaml" label="YAML">

```yaml
- name: ai-scrapers-to-tarpit
user_agent_regex: (ChatGPT|GPTBot|Claude-Web|OpenAI|Anthropic)
action: DENY_AND_REROUTE
reroute_to: https://tarpit.example.com/honeypot
```

</TabItem>
</Tabs>

The `reroute_to` field must contain an absolute URL (including the scheme like `http://` or `https://`). When this rule matches, Anubis will send a `307 Temporary Redirect` response to redirect the client to the specified URL.

#### Usage

**Requirements:**
- `reroute_to` must be an absolute URL with scheme (`http://` or `https://`)
- Returns HTTP 307 Temporary Redirect to the specified URL

**Examples:**
```yaml
# Redirect suspicious AI scrapers with high weight
- name: ai-to-tarpit
action: DENY_AND_REROUTE
expression:
all:
- userAgent.contains("GPT") || userAgent.contains("Claude")
- weight > 10
reroute_to: https://tarpit.example.com/honeypot

# Reroute scrapers trying to access unsecured PHP files. Would be useful for sites that don't use PHP.
- name: php-scraper-reroute
action: DENY_AND_REROUTE
expression:
all:
- path.endsWith(".php") # PHP files are often targeted by bots
- weight > 5
reroute_to: https://example.com/not-found
```

## Request Weight

Anubis rules can also add or remove "weight" from requests, allowing administrators to configure custom levels of suspicion. For example, if your application uses session tokens named `i_love_gitea`:
Expand Down
1 change: 1 addition & 0 deletions docs/docs/admin/robots2policy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
title: robots2policy CLI Tool
sidebar_position: 50
---
> "LET'S MAKE ROBOTS.TXT GREAT AGAIN!" - [Jason Cameron](https://jsn.cam/)

The `robots2policy` tool converts robots.txt files into Anubis challenge policies. It reads robots.txt rules and generates equivalent CEL expressions for path matching and user-agent filtering.

Expand Down
22 changes: 22 additions & 0 deletions lib/anubis.go
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,17 @@ func (s *Server) checkRules(w http.ResponseWriter, r *http.Request, cr policy.Ch
lg.Debug("rule hash", "hash", hash)
s.respondWithStatus(w, r, fmt.Sprintf("%s %s", localizer.T("access_denied"), hash), s.policy.StatusCodes.Deny)
return true
case config.RuleDenyAndReroute:
s.ClearCookie(w, s.cookieName, cookiePath)
lg.Info("deny and reroute", "reroute_to", cr.RerouteTo)
if cr.RerouteTo == nil || *cr.RerouteTo == "" {
lg.Error("reroute URL is missing for DENY_AND_REROUTE action")
s.respondWithError(w, r, "Internal Server Error: administrator has misconfigured Anubis. Please contact the administrator and ask them to look for the logs around \"maybeReverseProxy.RuleDenyAndReroute\"")
return true
}
// note for others, would it be better to be reverse proxying here?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had some concerns related to anubis performance + how constantly the scrapers would end up this path

http.Redirect(w, r, *cr.RerouteTo, http.StatusTemporaryRedirect)
return true
case config.RuleChallenge:
lg.Debug("challenge requested")
case config.RuleBenchmark:
Expand Down Expand Up @@ -432,6 +443,15 @@ func cr(name string, rule config.Rule, weight int) policy.CheckResult {
}
}

func crWithReroute(name string, rule config.Rule, weight int, rerouteTo *string) policy.CheckResult {
return policy.CheckResult{
Name: name,
Rule: rule,
Weight: weight,
RerouteTo: rerouteTo,
}
}

// Check evaluates the list of rules, and returns the result
func (s *Server) check(r *http.Request) (policy.CheckResult, *policy.Bot, error) {
host := r.Header.Get("X-Real-Ip")
Expand All @@ -456,6 +476,8 @@ func (s *Server) check(r *http.Request) (policy.CheckResult, *policy.Bot, error)
switch b.Action {
case config.RuleDeny, config.RuleAllow, config.RuleBenchmark, config.RuleChallenge:
return cr("bot/"+b.Name, b.Action, weight), &b, nil
case config.RuleDenyAndReroute:
return crWithReroute("bot/"+b.Name, b.Action, weight, b.RerouteTo), &b, nil
case config.RuleWeigh:
slog.Debug("adjusting weight", "name", b.Name, "delta", b.Weight.Adjust)
weight += b.Weight.Adjust
Expand Down
73 changes: 73 additions & 0 deletions lib/anubis_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,79 @@ func TestRuleChange(t *testing.T) {
}
}

func TestDenyAndRerouteAction(t *testing.T) {
h := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
t.Log(r.UserAgent())
w.WriteHeader(http.StatusOK)
fmt.Fprintln(w, "OK")
})

pol := loadPolicies(t, "./testdata/deny_and_reroute_test.yaml", 4)

srv := spawnAnubis(t, Options{
Next: h,
Policy: pol,
})

ts := httptest.NewServer(internal.RemoteXRealIP(true, "tcp", srv))
defer ts.Close()

testCases := []struct {
userAgent string
expectedCode int
expectedURL string
}{
{
userAgent: "REROUTE_ME",
expectedCode: http.StatusTemporaryRedirect,
expectedURL: "https://example.com/tarpit",
},
{
userAgent: "DENY_ME",
expectedCode: http.StatusOK, // From status_codes config
},
{
userAgent: "ALLOW_ME",
expectedCode: http.StatusOK,
},
}

for _, tc := range testCases {
t.Run(tc.userAgent, func(t *testing.T) {
client := &http.Client{
CheckRedirect: func(req *http.Request, via []*http.Request) error {
// Don't follow redirects, we want to test the redirect response
return http.ErrUseLastResponse
},
}

req, err := http.NewRequestWithContext(t.Context(), http.MethodGet, ts.URL, nil)
if err != nil {
t.Fatal(err)
}

req.Header.Set("User-Agent", tc.userAgent)

resp, err := client.Do(req)
if err != nil {
t.Fatal(err)
}
defer resp.Body.Close()

if resp.StatusCode != tc.expectedCode {
t.Errorf("wanted status code %d but got: %d", tc.expectedCode, resp.StatusCode)
}

if tc.expectedURL != "" {
location := resp.Header.Get("Location")
if location != tc.expectedURL {
t.Errorf("wanted Location header %q but got: %q", tc.expectedURL, location)
}
}
})
}
}

func TestStripBasePrefixFromRequest(t *testing.T) {
testCases := []struct {
name string
Expand Down
1 change: 1 addition & 0 deletions lib/policy/bot.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ type Bot struct {
Weight *config.Weight
Name string
Action config.Rule
RerouteTo *string
}

func (b Bot) Hash() string {
Expand Down
7 changes: 4 additions & 3 deletions lib/policy/checkresult.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ import (
)

type CheckResult struct {
Name string
Rule config.Rule
Weight int
Name string
Rule config.Rule
Weight int
RerouteTo *string
}

func (cr CheckResult) LogValue() slog.Value {
Expand Down
36 changes: 27 additions & 9 deletions lib/policy/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"io/fs"
"net"
"net/http"
"net/url"
"os"
"regexp"
"strings"
Expand All @@ -31,22 +32,25 @@ var (
ErrCantSetBotAndImportValuesAtOnce = errors.New("config.BotOrImport: can't set bot rules and import values at the same time")
ErrMustSetBotOrImportRules = errors.New("config.BotOrImport: rule definition is invalid, you must set either bot rules or an import statement, not both")
ErrStatusCodeNotValid = errors.New("config.StatusCode: status code not valid, must be between 100 and 599")
ErrRerouteURLRequired = errors.New("config.Bot: reroute_to URL is required when using DENY_AND_REROUTE action")
ErrInvalidRerouteURL = errors.New("config.Bot: invalid reroute_to URL")
)

type Rule string

const (
RuleUnknown Rule = ""
RuleAllow Rule = "ALLOW"
RuleDeny Rule = "DENY"
RuleChallenge Rule = "CHALLENGE"
RuleWeigh Rule = "WEIGH"
RuleBenchmark Rule = "DEBUG_BENCHMARK"
RuleUnknown Rule = ""
RuleAllow Rule = "ALLOW"
RuleDeny Rule = "DENY"
RuleDenyAndReroute Rule = "DENY_AND_REROUTE"
RuleChallenge Rule = "CHALLENGE"
RuleWeigh Rule = "WEIGH"
RuleBenchmark Rule = "DEBUG_BENCHMARK"
)

func (r Rule) Valid() error {
switch r {
case RuleAllow, RuleDeny, RuleChallenge, RuleWeigh, RuleBenchmark:
case RuleAllow, RuleDeny, RuleDenyAndReroute, RuleChallenge, RuleWeigh, RuleBenchmark:
return nil
default:
return ErrUnknownAction
Expand All @@ -65,6 +69,7 @@ type BotConfig struct {
Name string `json:"name" yaml:"name"`
Action Rule `json:"action" yaml:"action"`
RemoteAddr []string `json:"remote_addresses,omitempty" yaml:"remote_addresses,omitempty"`
RerouteTo *string `json:"reroute_to,omitempty" yaml:"reroute_to,omitempty"`

// Thoth features
GeoIP *GeoIP `json:"geoip,omitempty"`
Expand All @@ -80,6 +85,7 @@ func (b BotConfig) Zero() bool {
b.Action != "",
len(b.RemoteAddr) != 0,
b.Challenge != nil,
b.RerouteTo != nil,
b.GeoIP != nil,
b.ASNs != nil,
} {
Expand Down Expand Up @@ -163,8 +169,8 @@ func (b *BotConfig) Valid() error {
}
}

switch b.Action {
case RuleAllow, RuleBenchmark, RuleChallenge, RuleDeny, RuleWeigh:
switch b.Action { // todo(json) refactor to use method above
case RuleAllow, RuleBenchmark, RuleChallenge, RuleDeny, RuleDenyAndReroute, RuleWeigh:
// okay
default:
errs = append(errs, fmt.Errorf("%w: %q", ErrUnknownAction, b.Action))
Expand All @@ -180,6 +186,18 @@ func (b *BotConfig) Valid() error {
b.Weight = &Weight{Adjust: 5}
}

if b.Action == RuleDenyAndReroute {
if b.RerouteTo == nil || *b.RerouteTo == "" {
errs = append(errs, ErrRerouteURLRequired)
} else {
if u, err := url.Parse(*b.RerouteTo); err != nil {
errs = append(errs, fmt.Errorf("%w: %v", ErrInvalidRerouteURL, err))
} else if !u.IsAbs() {
errs = append(errs, fmt.Errorf("%w: URL must be absolute (include scheme)", ErrInvalidRerouteURL))
}
}
}

if len(errs) != 0 {
return fmt.Errorf("config: bot entry for %q is not valid:\n%w", b.Name, errors.Join(errs...))
}
Expand Down
Loading
Loading