Skip to content

feat: Add shortcut to copy first seed + fix incorrect crawl URL labels #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

SuaYoo
Copy link
Member

@SuaYoo SuaYoo commented Aug 13, 2025

Resolves #2751
Partially addresses #2801

Changes

  • Adds option to copy first seed URL to clipboard to workflow actions
  • Switches instances of "Crawl Start URL" to "First Page URL" when the label can apply to any workflow scope type
  • Updates user guide to clarify "Page URL" vs. "Crawl Start URL"

Manual testing

  1. Log in
  2. Go to "Crawling"
  3. Select three dots overflow menu icon for a workflow. Verify "Copy Crawl Start URL" option is shown
  4. Choose "Copy Crawl Start URL". Verify URL is copied to clipboard
  5. Go into workflow
  6. Select "Actions" dropdown. Verify "Copy Crawl Start URL" option is shown and works as expected

Screenshots

Page Image/video
Crawl Workflows Screenshot 2025-08-14 at 11 51 04 AM
Crawl Workflows (workflow action menu) Screenshot 2025-08-14 at 11 50 34 AM
Workflow Editor Screenshot 2025-08-12 at 7 01 47 PM
User Guide / Crawl Workflow Settings Screenshot 2025-08-12 at 6 58 00 PM
Screenshot 2025-08-12 at 7 00 55 PM

Follow-ups

This PR partially fixes an inconsistency in "Crawl Start URL" usage for all workflows, even though this label technically does not apply to Single Page and List of Pages crawl scopes. I created #2801 to address any remaining work needed to update docs.

An alternative solution would be to add scopeType to crawlconfigs list JSON response (cc @tw4l.) This way, we could conditionally display "Copy Crawl Start URL" or "Copy First Page URL" depending on the scope type.

@SuaYoo SuaYoo marked this pull request as ready for review August 14, 2025 18:42
@SuaYoo SuaYoo changed the title feat: Add shortcut to copy first seed + update docs feat: Add shortcut to copy first seed + fix incorrect crawl URL labels Aug 14, 2025
Copy link
Member

@emma-sg emma-sg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@tw4l
Copy link
Member

tw4l commented Aug 14, 2025

I think elsewhere in the application and our documentation where we use "Page URL", it typically refers to pages that have been crawled, rather than the Crawl Start URL(s) (i.e. seeds, though we don't use that vocabulary in the Browsertrix frontend). In other words, the results of crawling vs. the URLs used to configure crawl scope. Personally think that's a useful distinction worth keeping, so I'm hesitant to use Page URL in both contexts as proposed here. [edit: it was pointed out to me that I'm incorrect and we do currently use "Page URL" in the Page and List of Pages workflow config form, though I think the seed distinction is still a useful one]

For instance, even a single page scoped workflow could have a single Crawl Start URL but multiple crawled Page URLs if the box to include linked pages is checked. List of Pages is a little more ambiguous, but similarly, I suppose each could be considered a Crawl Start URL that depending on workflow configuration options might result in additional Page URLs being crawled for each.

Also perhaps worth noting that it's possible via the backend API to configure multiple seeds/Crawl Start URLs that each have their own scope types and additional configuration (e.g. includes/excludes) that override the default, though we have not made it possible to configure workflows that way via the frontend. That is a pattern that will be common to many experienced web archivists, as it's a common practice with Browsertrix Crawler and other crawlers like Heritrix.

It seems to me that we're creating some confusion for ourselves (myself included!) by avoiding use of the word "seed" in some ways, though I understand that we made that decision to try to make the interface friendly to people who aren't already web archiving experts. Maybe the best solution would be to consistently use "Crawl Start URL" interchangeably with seed as we at least mostly currently do, and add "First" as a prefix as necessary within the UI when there are multiple. So for instance, "Copy Crawl Start URL" -> "Copy First Crawl Start URL" for workflows with list of pages. I'd be happy to make any backend changes necessary to facilitate that - i ooks like the crawlconfigs/ list endpoint isn't returning the global scopeType for workflows currently because we're excluding the entire config to avoid slowing the response down, but it would be easy to add that field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add options to copy workflow settings
3 participants