Skip to content

feat(website): add feature to search multiple string fields#5881

Open
chaoran-chen wants to merge 3 commits intomainfrom
multifieldsearch
Open

feat(website): add feature to search multiple string fields#5881
chaoran-chen wants to merge 3 commits intomainfrom
multifieldsearch

Conversation

@chaoran-chen
Copy link
Copy Markdown
Member

@chaoran-chen chaoran-chen commented Jan 26, 2026

resolves nothing

This PR adds a feature that allows the configuration of search fields that search multiple fields by constructing a LAPIS advanced query. In the preview, there are now two new search fields

  • identifier: searches accessionVersion, submissionId, insdcAccessionFull, bioprojectAccession, gcaAccession, biosampleAccession, insdcRawReadsAccession
  • contributor: searches authors, authorAffiliations, sequencedByOrganization, sequencedByContactName, submitter, groupName

This can be used to resolve [gap to prevent automated linking] pathoplexus/pathoplexus#766.

Screenshot

image

PR Checklist

  • [ ] All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)
    • Filter WNV (here) (also verified that refreshing the page correctly fills the form), download metadata, copy download download link, linkout to Nextclade

🚀 Preview: Add preview label to enable

@chaoran-chen chaoran-chen self-assigned this Jan 26, 2026
@chaoran-chen chaoran-chen added the preview Triggers a deployment to argocd label Jan 26, 2026
@claude claude bot added website Tasks related to the web application deployment Code changes targetting the deployment infrastructure labels Jan 26, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jan 26, 2026

This PR may be related to #1985 (Search: standardised approach for multi-entry search)

@theosanderson
Copy link
Copy Markdown
Member

Very cool - but I think this actually doesn't relate to #1985 unless I misunderstand: the idea there is about the type of search where you can enter multiple accessions and you're not trying to find all of them in a single sequence entry, but each one in a different sequence entry

@chaoran-chen
Copy link
Copy Markdown
Member Author

aah, right, I misread, thanks @theosanderson!

@chaoran-chen chaoran-chen force-pushed the multifieldsearch branch 2 times, most recently from 46cd32e to 8fed763 Compare January 26, 2026 21:28
@chaoran-chen chaoran-chen marked this pull request as ready for review January 26, 2026 21:52
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jan 26, 2026

Claude finished @chaoran-chen's task in 3m 9s —— View job


Code Review Complete ✓

I've reviewed this PR and found it to be a well-implemented feature with comprehensive testing. The implementation is clean and follows existing patterns in the codebase. Below is my detailed assessment:


Summary

This PR adds multi-field search functionality that allows users to search across multiple metadata fields using a single input. Two search fields are configured:

  • identifier: searches 7 accession-related fields
  • contributor: searches 6 contributor-related fields

The implementation constructs LAPIS advanced queries using regex patterns to search across multiple fields with OR logic.


Strengths

Good test coverage: Comprehensive Playwright tests covering search, URL params, filter removal, and downloads
Schema validation: Proper Zod schema definitions and JSON schema validation
Follows existing patterns: Integrates well with existing search infrastructure
URL state management: Correctly syncs with URL query parameters
User experience: Active filters display properly, fields can be removed via UI
Accessibility: Uses DisabledUntilHydrated to prevent Playwright race conditions


Issues Found

I've left inline comments on specific areas that need attention:

Security (1 issue):

  • Potential regex injection risk in field name handling (SequenceFilters.tsx:131)

Code Quality (3 issues):

  • Code duplication in search.ts query extraction logic
  • Advanced query construction could be extracted to a separate method
  • Performance optimization opportunity for Set creation

Testing (3 issues):

  • Hard-coded timeouts in Playwright tests (multiple locations)
  • Fragile CSV assertion that could break with format changes

Documentation (2 issues):

  • Missing documentation on regex escaping logic
  • Configuration could benefit from usage examples

Accessibility (1 issue):

  • Missing aria-describedby for multi-field search context

Recommendations

Must address before merge:

  1. Review the regex injection security concern and add validation for field names
  2. Replace waitForTimeout() calls with deterministic waits in tests

Should address:
3. Extract advanced query construction to a separate method for testability
4. Add documentation to makeCaseInsensitiveLiteralSubstringRegex explaining security measures
5. Make CSV download assertion more robust

Nice to have:
6. Add aria-describedby to help screen reader users understand which fields are searched
7. Add configuration comments explaining the multi-field search feature
8. Extract duplicate query extraction logic in search.ts


Testing Notes

The PR includes thorough integration tests that verify:

  • Search functionality with both identifier and contributor fields
  • URL parameter synchronization
  • Filter removal via UI
  • Metadata download with filtered results

Manual testing was performed on the preview environment for WNV searches.


@theosanderson
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: affc28b6cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +129 to +131
if (value && typeof value === 'string' && value.trim()) {
const regex = makeCaseInsensitiveLiteralSubstringRegex(value.trim());
const fieldQueries = mfs.fields.map((f) => `${f}.regex='${regex}'`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Escape quotes in advancedQuery regex literals

The new multi-field search builds advancedQuery by interpolating the user’s raw input into a single-quoted string (${f}.regex='${regex}'). makeCaseInsensitiveLiteralSubstringRegex escapes regex metacharacters but does not escape single quotes, so inputs containing apostrophes (e.g., “O'Connor”, “King’s College”) will terminate the string literal and produce a malformed advancedQuery, causing the search to fail or be parsed incorrectly. Please escape ' (or switch to a safer encoding/quoting strategy) before embedding user input in the advanced query string.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like it could be important - using O'Connor in the field does indeed crash the page!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or well not crash but we get the red error bar and lapis returns

"error": {
        "type": "about:blank",
        "title": "Bad Request",
        "status": 400,
        "detail": "Failed to parse advanced query (line 1:282): mismatched input 'C' expecting {')', '|', '&', AND, OR}."
    },

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed a very good point, I'm not sure whether we have a way to escape in LAPIS, I'll check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: LAPIS 0.8 has been released which supports escaping. Let's first update to 0.7 in #6146, then update to 0.8 and then come back to this PR.

Copy link
Copy Markdown
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this out by pasting an accession into the identifier search field for CCHf and I think it is sadly broken for multi segmented organisms! The page freezes:
image
image

and then I get an axios error

{"message":"Request failed with status code 400","name":"AxiosError","stack":"AxiosError: Request failed with status code 400\n at It (https://multifieldsearch.loculus.org/_astro/createAuthorizationHeader.CbxMgGzV.js:3:1073)\n at XMLHttpRequest.w (https://multifieldsearch.loculus.org/_astro/createAuthorizationHeader.CbxMgGzV.js:3:5793)\n at _.request (https://multifieldsearch.loculus.org/_astro/createAuthorizationHeader.CbxMgGzV.js:5:2079)\n at async Gn.request (https://multifieldsearch.loculus.org/_astro/createAuthorizationHeader.CbxMgGzV.js:11:4348)","config":{"transitional":{"silentJSONParsing":true,"forcedJSONParsing":true,"clarifyTimeoutError":false},"adapter":["xhr","http","fetch"],"transformRequest":[null],"transformResponse":[null],"timeout":0,"xsrfCookieName":"XSRF-TOKEN","xsrfHeaderName":"X-XSRF-TOKEN","maxContentLength":-1,"maxBodyLength":-1,"env":{},"headers":{"Accept":"application/json, text/plain, */*","Content-Type":"application/json"},"baseURL":"https://lapis-multifieldsearch.loculus.org/cchf","method":"post","url":"/sample/details","data":"{\"versionStatus\":\"LATEST_VERSION\",\"isRevocation\":\"false\",\"nucleotideMutations\":[],\"aminoAcidMutations\":[],\"nucleotideInsertions\":[],\"aminoAcidInsertions\":[],\"advancedQuery\":\"(accessionVersion.regex='(?i)PV920624' or submissionId.regex='(?i)PV920624' or insdcAccessionFull.regex='(?i)PV920624' or bioprojectAccession.regex='(?i)PV920624' or gcaAccession.regex='(?i)PV920624' or biosampleAccession.regex='(?i)PV920624' or insdcRawReadsAccession.regex='(?i)PV920624')\",\"fields\":[\"authorAffiliations\",\"authors\",\"geoLocAdmin1\",\"geoLocCountry\",\"hostNameScientific\",\"length_L\",\"length_M\",\"length_S\",\"lineage\",\"ncbiReleaseDate\",\"sampleCollectionDate\",\"accessionVersion\"],\"limit\":100,\"offset\":0,\"orderBy\":[{\"field\":\"sampleCollectionDate\",\"type\":\"descending\"}]}","params":{},"allowAbsoluteUrls":true},"code":"ERR_BAD_REQUEST","status":400}

Update: sorry when I first copied the error I somehow copied the whole page with all the sequence metadata

@chaoran-chen
Copy link
Copy Markdown
Member Author

Thanks for identifying the bug, @anna-parker! The perSegment=true case wasn't properly handled, this has now been addressed in 472d72a and I also added a test in 0e78b38.

@anna-parker anna-parker dismissed their stale review February 4, 2026 07:46

I am dismissing my blocker :-)

defaultOrder: descending
multiFieldSearches:
- name: identifier
displayName: Identifier
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Sample Identifier would be clearer?

Copy link
Copy Markdown
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking really great but it would be good to fix the ' escape issue before merging. I also think it would be nice to add a little info button next to the field (like for mutations) and explain which fields are included in this search

@theosanderson theosanderson removed the preview Triggers a deployment to argocd label Feb 12, 2026
@chaoran-chen chaoran-chen added the preview Triggers a deployment to argocd label Mar 15, 2026
const value = this.fieldValues[mfs.name];
if (value && typeof value === 'string' && value.trim()) {
const regex = makeCaseInsensitiveLiteralSubstringRegex(value.trim());
const escapedRegex = regex.replace(/'/g, "\\'");

Check failure

Code scanning / CodeQL

Incomplete string escaping or encoding High

This does not escape backslash characters in the input.

Copilot Autofix

AI 14 days ago

In general, the problem should be fixed by ensuring that when we escape characters for embedding in a quoted context, we escape both the quote character and any backslashes, so that we do not end up with sequences where an attacker-controlled backslash changes the meaning of the following character. Instead of only replacing ' with \', we should first escape \ as \\, then escape ' as \', or use a small helper that does this in one step with a regular expression that covers both characters.

In this specific file, the only problematic code is on line 154:

const escapedRegex = regex.replace(/'/g, "\\'");

We should change it so that both backslashes and single quotes are escaped. To avoid partial escaping (e.g. double-escaping already escaped quotes), we can consistently transform the raw regex string in one pass. A simple, clear solution that does not change existing behavior except to handle backslashes correctly is:

const escapedRegex = regex.replace(/['\\]/g, (c) => `\\${c}`);

This replaces each occurrence of ' or \ with a backslash-prefixed version ('\', \\\). The rest of the surrounding logic (forming fieldQueries and advancedQueryParts) stays the same. No new imports or external dependencies are required; this uses built-in JavaScript string/regex features only. The change is localized to the line where escapedRegex is computed in website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx.

Suggested changeset 1
website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx b/website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx
--- a/website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx
+++ b/website/src/components/SearchPage/DownloadDialog/SequenceFilters.tsx
@@ -151,7 +151,7 @@
             const value = this.fieldValues[mfs.name];
             if (value && typeof value === 'string' && value.trim()) {
                 const regex = makeCaseInsensitiveLiteralSubstringRegex(value.trim());
-                const escapedRegex = regex.replace(/'/g, "\\'");
+                const escapedRegex = regex.replace(/['\\]/g, (c) => `\\${c}`);
                 const fieldQueries = mfs.fields.map((f) => `${f}.regex='${escapedRegex}'`);
                 advancedQueryParts.push(`(${fieldQueries.join(' or ')})`);
             }
EOF
@@ -151,7 +151,7 @@
const value = this.fieldValues[mfs.name];
if (value && typeof value === 'string' && value.trim()) {
const regex = makeCaseInsensitiveLiteralSubstringRegex(value.trim());
const escapedRegex = regex.replace(/'/g, "\\'");
const escapedRegex = regex.replace(/['\\]/g, (c) => `\\${c}`);
const fieldQueries = mfs.fields.map((f) => `${f}.regex='${escapedRegex}'`);
advancedQueryParts.push(`(${fieldQueries.join(' or ')})`);
}
Copilot is powered by AI and may make mistakes. Always verify output.
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment Code changes targetting the deployment infrastructure website Tasks related to the web application

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants