Skip to content

Why do we not use commoncrawl indices, and then possibly build upon them? #263

@sga-13

Description

@sga-13

I do not understand much about search engines, so I was reading about them. Then I stumbled upon commoncrawl. I know that stract uses it's own crawler, but I have found the index still smaller than I would like. I also searched commoncrawl in github issues, and found 2 issues, where it has been recommended to the local hosters to use the commoncrawl's warc files. So why does not stract use them? Are they lacking in something that I do not know if, or is there a limit in using them (like not to be used for commercial projects (I hope that is not the case, since they used can be used by everyone multiple times on their pages)), or is it purely a choice based on quality or some other thing (maybe the averge result quality is not that good, or does not meet stracts expectation in the data/metadata provided).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions