-
-
Notifications
You must be signed in to change notification settings - Fork 55
feat: Add shortcut to copy first seed + fix incorrect crawl URL labels #2803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I think elsewhere in the application and our documentation where we use "Page URL", it typically refers to pages that have been crawled, rather than the Crawl Start URL(s) (i.e. seeds, though we don't use that vocabulary in the Browsertrix frontend). In other words, the results of crawling vs. the URLs used to configure crawl scope. Personally think that's a useful distinction worth keeping, so I'm hesitant to use Page URL in both contexts as proposed here. [edit: it was pointed out to me that I'm incorrect and we do currently use "Page URL" in the Page and List of Pages workflow config form, though I think the seed distinction is still a useful one] For instance, even a single page scoped workflow could have a single Crawl Start URL but multiple crawled Page URLs if the box to include linked pages is checked. List of Pages is a little more ambiguous, but similarly, I suppose each could be considered a Crawl Start URL that depending on workflow configuration options might result in additional Page URLs being crawled for each. Also perhaps worth noting that it's possible via the backend API to configure multiple seeds/Crawl Start URLs that each have their own scope types and additional configuration (e.g. includes/excludes) that override the default, though we have not made it possible to configure workflows that way via the frontend. That is a pattern that will be common to many experienced web archivists, as it's a common practice with Browsertrix Crawler and other crawlers like Heritrix. It seems to me that we're creating some confusion for ourselves (myself included!) by avoiding use of the word "seed" in some ways, though I understand that we made that decision to try to make the interface friendly to people who aren't already web archiving experts. Maybe the best solution would be to consistently use "Crawl Start URL" interchangeably with seed as we at least mostly currently do, and add "First" as a prefix as necessary within the UI when there are multiple. So for instance, "Copy Crawl Start URL" -> "Copy First Crawl Start URL" for workflows with list of pages. I'd be happy to make any backend changes necessary to facilitate that - i ooks like the |
Resolves #2751
Partially addresses #2801
Changes
Manual testing
Screenshots
Follow-ups
This PR partially fixes an inconsistency in "Crawl Start URL" usage for all workflows, even though this label technically does not apply to
Single Page
andList of Pages
crawl scopes. I created #2801 to address any remaining work needed to update docs.An alternative solution would be to add
scopeType
to crawlconfigs list JSON response (cc @tw4l.) This way, we could conditionally display "Copy Crawl Start URL" or "Copy First Page URL" depending on the scope type.