Generate search index #8230

JKarlavige · 2025-11-26T15:13:38Z

What are you changing in this pull request and why?

⚠️ WIP

Adds script to generate a search-index.json file. This file can be used to provide keyword search and structure for a MCP server.

Preview

https://docs-getdbt-com-git-generate-search-index-dbt-labs.vercel.app/search-index.json

vercel · 2025-11-26T15:13:45Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Updated (UTC)
docs-getdbt-com	Ready	Preview	Dec 1, 2025 8:40pm

website/plugins/buildSearchIndex/index.js

+  text = text.replace(/^---[\s\S]*?---\n/m, '');
+
+  // Remove HTML comments
+  text = text.replace(/<!--[\s\S]*?-->/g, '');


The best fix is to apply the HTML comment removal regular expression repeatedly until no further changes are made to the string. This method ensures that any nested or overlapping comment patterns are also entirely removed. Thus, we should wrap the replacement in a loop, as recommended by best practices. Only the body of the stripMarkdown function needs changes: after defining text, replace the existing single replace logic for comments (line 77) with a do-while loop applying the same regex until the string no longer changes.

No new imports or libraries are strictly required for this fix, as the loop method is sufficiently robust and does not substantially change the code's functionality.

website/plugins/buildSearchIndex/index.js

+  text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1');
+
+  // Remove HTML tags
+  text = text.replace(/<[^>]*>/g, '');


To fix the issue, we need to ensure complete removal of HTML tags, including those that may be formed by multi-character sequences that get revealed only after repeated replacements. The recommended method is to repeat the .replace() operation in a loop until the string no longer changes. This guarantees that tags broken up by tags inside tags (or odd nesting) are all removed. We should apply this specifically on line 116, replacing
text = text.replace(/<[^>]*>/g, '');
with a do-while loop that repeatedly removes tags until there is nothing left to remove. No new dependencies are required for this fix. The rest of the function can remain unchanged.

JKarlavige added 7 commits November 5, 2025 11:29

add initial search generator script

7cdf44a

adjust mdx and version handling in generator script

a871de0

adjust version handling in search generator

98eb1c4

add plugin to config and test script

a0e481f

add readme and reference docs

28ff56d

don't copy index to static folder

722788c

Merge branch 'current' into generate-search-index

28c69ac

github-advanced-security bot found potential problems Nov 26, 2025

View reviewed changes

vercel bot deployed to Preview November 26, 2025 15:15 View deployment

JKarlavige added 3 commits November 26, 2025 10:53

gitignore claude files

3be2d9b

include breadcrumbs and sidebar section in search-index

54f5fc9

generate index in static dir for local testing

3f5ea12

vercel bot deployed to Preview December 1, 2025 20:40 View deployment

@@ -73,8 +73,12 @@
               // Remove frontmatter (already handled by gray-matter, but just in case)
               text = text.replace(/^---[\s\S]*?---\n/m, '');
-              // Remove HTML comments
-              text = text.replace(/<!--[\s\S]*?-->/g, '');
+              // Remove HTML comments (apply repeatedly in case of nested or overlapping comments)
+              let previous;
+              do {
+                previous = text;
+                text = text.replace(/<!--[\s\S]*?-->/g, '');
+              } while (text !== previous);
               // Remove Docusaurus admonitions but keep content
               // Pattern: :::type [optional title]\ncontent\n:::

@@ -112,8 +112,12 @@
               // Remove links but keep text
               text = text.replace(/\[([^\]]+)\]\([^\)]+\)/g, '$1');
-              // Remove HTML tags
-              text = text.replace(/<[^>]*>/g, '');
+              // Remove HTML tags using iterative replacement to handle nested/multi-character cases
+              let _prevText;
+              do {
+                _prevText = text;
+                text = text.replace(/<[^>]*>/g, '');
+              } while (text !== _prevText);
               // Remove headers (but keep the text)
               text = text.replace(/^#{1,6}\s+/gm, '');

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate search index #8230

Generate search index #8230

Uh oh!

JKarlavige commented Nov 26, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Generate search index #8230

Are you sure you want to change the base?

Generate search index #8230

Uh oh!

Conversation

JKarlavige commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are you changing in this pull request and why?

Preview

Uh oh!

vercel bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JKarlavige commented Nov 26, 2025 •

edited

Loading

vercel bot commented Nov 26, 2025 •

edited

Loading