-
Notifications
You must be signed in to change notification settings - Fork 373
Description
The solution which was chosen by the author after a heated discussion in #293 was to support an opt-out expressed in http headers, via well know values "noindex" and "noimageindex" plus the ad-hoc values "noai" and "noimageai".
This is already a good move: in Europe, any crawler associated with TDM and AI technologies MUST support opt-out, as stipulated by the European DSM Directive. You'll get more information here about that legal requirement. Because this soft is gathering images available for AI training, it should not integrate in its dataset images for which authors have decided an opt-out.
But "noai" and "noimageia" are not well known tokens (even if you're not alone trying them), there is nothing standard in them so far. And robots.txt is not only about http headers. Directives can be in a file stored at the root of the web site (and as html meta, but this is not interesting here). Therefore your move does not really help the community establishing trusted relationships between AI solutions and content providers (which is a requirement if you want content providers to see AI actors as partners, not enemies).
For this reason, a W3C Community Group constituted of content providers and TDM actors decided to create an open specification two years ago, and released this specification called TDMRep (for TDM Reservation Protocol). The home page of the group is there; 42 participants.
For those wondering, this specification also covers AI solutions. And this group didn't use robots.txt for clear reasons.
Adding the support of a new property in the http header, called "tdm-reservation", and filtering images if its value is 1 (number) is a no-brainer. Adding the support of a JSON file named tdmrep.json, hosted in the /.well-known repository of the Web server on which the image is stored, is a bit more complex, but still easy in Python (it is identical to the processing of the robots.txt file); and its is mandatory even if less performant.