Skip to content

page can not be crawled due to robots.txt #4

@pvgenuchten

Description

@pvgenuchten

We have set up a ggl crawling on demo.pygeoapi.io to research crawler behaviour on pygeoapi. First results are available, but it puzzels me a bit.

Ggl generally crawls pygeoapi pages in a correct way. One can indeed find demo pygeoapi results at for example https://www.google.com/search?q=site%3Ademo.pygeoapi.io. however no results yet at https://toolbox.google.com/datasetsearch/search?query=site%3Ademo.pygeoapi.io

A weird thing is that when doing 'live test' (a feature on ggl search console) on this url https://demo.pygeoapi.io/master/collections/lakes i get this error: "url not available to google, blocked by robots.txt"

image

However https://demo.pygeoapi.io/master/collections/lakes?f=html runs fine in 'live test'. This makes me wonder, does the 'live-test' crawler use the proper accept header?

Another thing to improve is the fact that https://demo.pygeoapi.io/robots.txt does not return a proper robots.txt file, but in stead a custom file-not-found page (with http status 200!)

let me now if you have any ideas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions