-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Generate search index #8230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: current
Are you sure you want to change the base?
Generate search index #8230
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| text = text.replace(/^---[\s\S]*?---\n/m, ''); | ||
|
|
||
| // Remove HTML comments | ||
| text = text.replace(/<!--[\s\S]*?-->/g, ''); |
Check failure
Code scanning / CodeQL
Incomplete multi-character sanitization High
<!--
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 1 month ago
The best fix is to apply the HTML comment removal regular expression repeatedly until no further changes are made to the string. This method ensures that any nested or overlapping comment patterns are also entirely removed. Thus, we should wrap the replacement in a loop, as recommended by best practices. Only the body of the stripMarkdown function needs changes: after defining text, replace the existing single replace logic for comments (line 77) with a do-while loop applying the same regex until the string no longer changes.
No new imports or libraries are strictly required for this fix, as the loop method is sufficiently robust and does not substantially change the code's functionality.
-
Copy modified lines R76-R81
| @@ -73,8 +73,12 @@ | ||
| // Remove frontmatter (already handled by gray-matter, but just in case) | ||
| text = text.replace(/^---[\s\S]*?---\n/m, ''); | ||
|
|
||
| // Remove HTML comments | ||
| text = text.replace(/<!--[\s\S]*?-->/g, ''); | ||
| // Remove HTML comments (apply repeatedly in case of nested or overlapping comments) | ||
| let previous; | ||
| do { | ||
| previous = text; | ||
| text = text.replace(/<!--[\s\S]*?-->/g, ''); | ||
| } while (text !== previous); | ||
|
|
||
| // Remove Docusaurus admonitions but keep content | ||
| // Pattern: :::type [optional title]\ncontent\n::: |
| text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1'); | ||
|
|
||
| // Remove HTML tags | ||
| text = text.replace(/<[^>]*>/g, ''); |
Check failure
Code scanning / CodeQL
Incomplete multi-character sanitization High
<script
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 1 month ago
To fix the issue, we need to ensure complete removal of HTML tags, including those that may be formed by multi-character sequences that get revealed only after repeated replacements. The recommended method is to repeat the .replace() operation in a loop until the string no longer changes. This guarantees that tags broken up by tags inside tags (or odd nesting) are all removed. We should apply this specifically on line 116, replacing
text = text.replace(/<[^>]*>/g, '');
with a do-while loop that repeatedly removes tags until there is nothing left to remove. No new dependencies are required for this fix. The rest of the function can remain unchanged.
-
Copy modified lines R115-R120
| @@ -112,8 +112,12 @@ | ||
| // Remove links but keep text | ||
| text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1'); | ||
|
|
||
| // Remove HTML tags | ||
| text = text.replace(/<[^>]*>/g, ''); | ||
| // Remove HTML tags using iterative replacement to handle nested/multi-character cases | ||
| let _prevText; | ||
| do { | ||
| _prevText = text; | ||
| text = text.replace(/<[^>]*>/g, ''); | ||
| } while (text !== _prevText); | ||
|
|
||
| // Remove headers (but keep the text) | ||
| text = text.replace(/^#{1,6}\s+/gm, ''); |
What are you changing in this pull request and why?
Adds script to generate a
search-index.jsonfile. This file can be used to provide keyword search and structure for a MCP server.Preview
https://docs-getdbt-com-git-generate-search-index-dbt-labs.vercel.app/search-index.json