Skip to content

Conversation

@JKarlavige
Copy link
Collaborator

@JKarlavige JKarlavige commented Nov 26, 2025

What are you changing in this pull request and why?

⚠️ WIP

Adds script to generate a search-index.json file. This file can be used to provide keyword search and structure for a MCP server.

Preview

https://docs-getdbt-com-git-generate-search-index-dbt-labs.vercel.app/search-index.json

@vercel
Copy link

vercel bot commented Nov 26, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
docs-getdbt-com Ready Ready Preview Dec 1, 2025 8:40pm

text = text.replace(/^---[\s\S]*?---\n/m, '');

// Remove HTML comments
text = text.replace(/<!--[\s\S]*?-->/g, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<!--
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI about 1 month ago

The best fix is to apply the HTML comment removal regular expression repeatedly until no further changes are made to the string. This method ensures that any nested or overlapping comment patterns are also entirely removed. Thus, we should wrap the replacement in a loop, as recommended by best practices. Only the body of the stripMarkdown function needs changes: after defining text, replace the existing single replace logic for comments (line 77) with a do-while loop applying the same regex until the string no longer changes.

No new imports or libraries are strictly required for this fix, as the loop method is sufficiently robust and does not substantially change the code's functionality.


Suggested changeset 1
website/plugins/buildSearchIndex/index.js

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/website/plugins/buildSearchIndex/index.js b/website/plugins/buildSearchIndex/index.js
--- a/website/plugins/buildSearchIndex/index.js
+++ b/website/plugins/buildSearchIndex/index.js
@@ -73,8 +73,12 @@
   // Remove frontmatter (already handled by gray-matter, but just in case)
   text = text.replace(/^---[\s\S]*?---\n/m, '');
   
-  // Remove HTML comments
-  text = text.replace(/<!--[\s\S]*?-->/g, '');
+  // Remove HTML comments (apply repeatedly in case of nested or overlapping comments)
+  let previous;
+  do {
+    previous = text;
+    text = text.replace(/<!--[\s\S]*?-->/g, '');
+  } while (text !== previous);
   
   // Remove Docusaurus admonitions but keep content
   // Pattern: :::type [optional title]\ncontent\n:::
EOF
@@ -73,8 +73,12 @@
// Remove frontmatter (already handled by gray-matter, but just in case)
text = text.replace(/^---[\s\S]*?---\n/m, '');

// Remove HTML comments
text = text.replace(/<!--[\s\S]*?-->/g, '');
// Remove HTML comments (apply repeatedly in case of nested or overlapping comments)
let previous;
do {
previous = text;
text = text.replace(/<!--[\s\S]*?-->/g, '');
} while (text !== previous);

// Remove Docusaurus admonitions but keep content
// Pattern: :::type [optional title]\ncontent\n:::
Copilot is powered by AI and may make mistakes. Always verify output.
text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1');

// Remove HTML tags
text = text.replace(/<[^>]*>/g, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI about 1 month ago

To fix the issue, we need to ensure complete removal of HTML tags, including those that may be formed by multi-character sequences that get revealed only after repeated replacements. The recommended method is to repeat the .replace() operation in a loop until the string no longer changes. This guarantees that tags broken up by tags inside tags (or odd nesting) are all removed. We should apply this specifically on line 116, replacing
text = text.replace(/<[^>]*>/g, '');
with a do-while loop that repeatedly removes tags until there is nothing left to remove. No new dependencies are required for this fix. The rest of the function can remain unchanged.


Suggested changeset 1
website/plugins/buildSearchIndex/index.js

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/website/plugins/buildSearchIndex/index.js b/website/plugins/buildSearchIndex/index.js
--- a/website/plugins/buildSearchIndex/index.js
+++ b/website/plugins/buildSearchIndex/index.js
@@ -112,8 +112,12 @@
   // Remove links but keep text
   text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1');
   
-  // Remove HTML tags
-  text = text.replace(/<[^>]*>/g, '');
+  // Remove HTML tags using iterative replacement to handle nested/multi-character cases
+  let _prevText;
+  do {
+    _prevText = text;
+    text = text.replace(/<[^>]*>/g, '');
+  } while (text !== _prevText);
   
   // Remove headers (but keep the text)
   text = text.replace(/^#{1,6}\s+/gm, '');
EOF
@@ -112,8 +112,12 @@
// Remove links but keep text
text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1');

// Remove HTML tags
text = text.replace(/<[^>]*>/g, '');
// Remove HTML tags using iterative replacement to handle nested/multi-character cases
let _prevText;
do {
_prevText = text;
text = text.replace(/<[^>]*>/g, '');
} while (text !== _prevText);

// Remove headers (but keep the text)
text = text.replace(/^#{1,6}\s+/gm, '');
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants