Skip to content

feat(config): Add FCrDNS checker #682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Conversation

Axelen123
Copy link

@Axelen123 Axelen123 commented Jun 17, 2025

Closes #431. This PR implements dynamic verification of bot IPs using DNS records. For details regarding how it works, see the documentation I have added. I am not sure if the way I added a new algorithm is the best way to implement this. Let me know if there is a better way.

  • Added a description of the changes to the [Unreleased] section of docs/docs/CHANGELOG.md
  • Added test cases to the relevant parts of the codebase
  • Ran integration tests npm run test:integration (unsupported on Windows, please use WSL)

@Axelen123 Axelen123 changed the title Fcrdns feat(lib/challenge): FCrDNS challenge method Jun 17, 2025
@Xe
Copy link
Contributor

Xe commented Jun 17, 2025

Hey, thanks for the contribution!

The CHALLENGE rule is more meant for client-facing challenges. You probably want something like the checker.Impl interface here. This will let you add an ALLOW rule for AppleBot et.al.

@Axelen123 Axelen123 changed the title feat(lib/challenge): FCrDNS challenge method feat(config): Add FCrDNS checker Jun 22, 2025
@Axelen123
Copy link
Author

I have finished converting the implementation to a checker.

Copy link
Contributor

@Xe Xe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved modulo the change to checker.List#Check

@Axelen123 Axelen123 marked this pull request as draft June 25, 2025 21:17
@Axelen123 Axelen123 marked this pull request as ready for review June 26, 2025 19:24
@Axelen123
Copy link
Author

I have added CEL bindings and reverted to the previous checker behavior.

Axelen123 and others added 5 commits June 26, 2025 21:27
If a client claims to be Googlebot but isn't from Google, that's kinda
suspicious and should be treated as such.

Signed-off-by: Xe Iaso <[email protected]>
@Xe Xe enabled auto-merge (squash) June 27, 2025 18:15
@Xe
Copy link
Contributor

Xe commented Jun 27, 2025

Thanks much! This is gonna let us do a lot of fun things :)

Signed-off-by: Xe Iaso <[email protected]>
@Xe Xe disabled auto-merge June 27, 2025 18:20
@Xe Xe enabled auto-merge (squash) June 27, 2025 18:23
herrbischoff added a commit to herrbischoff/anubis that referenced this pull request Jul 16, 2025
Ahrefs is a large SEO company used by single bloggers to large
enterprises. It may be beneficial to allow (or deny) them in Anubis. They
do publish rDNS entries, so once an Anubis version with TecharoHQ#682 is released,
this policy would benefit from setting up that check.

Further information: https://ahrefs.com/robot
herrbischoff added a commit to herrbischoff/anubis that referenced this pull request Jul 16, 2025
Ahrefs is a large SEO company used by single bloggers to large
enterprises. It may be beneficial to allow (or deny) them in Anubis. They
do publish rDNS entries, so once an Anubis version with TecharoHQ#682 is released,
this policy would benefit from setting up that check.

Crawler information: https://ahrefs.com/robot

Majestic is a UK based specialist search engine and commercial SEO
entity. They claim to "spider the Web for the purpose of building a
search engine" with a distributed crawler. Defaults to allow as it'd be
caught with the generic browser policy definition.

Crawler information: https://mj12bot.com

Screaming Frog is a smaller actor in the SEO space and their crawler
occasionally attempts to access content despite being explicitly
excluded via robots.txt directives. As far as I could research they
neither publish their IP ranges nor provide an information page for
their crawler. That's why this defaults to deny.

Company website: https://www.screamingfrog.co.uk

Checkmark Network is a brand and intellectual property protection
company. If you have no direct business with them, it is likely they
shouldn't be crawling your content in the first place. Defaults to deny
for this reason.

Crawler information: https://www.checkmarknetwork.com/spider.html/

Domainsbot collects information on domains and website data for
intellectual property disputes. Unless you have direct business with
them, there's likely no reason for them to be accessing your content.
Defaults to deny.

Crawler information: https://domainsbot.com/pandalytics/

zoominfo is a data mining and sales platform for enterprise use, feeding
the gathered information into a machine learning model. It is unlikely
to be of value to anyone else. Therefore, this defaults to deny.

Company website: https://www.zoominfo.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic validation of good bot IP addresses
2 participants