Conversation
Add LLMS.txt support to help Large Language Models understand the SIWA documentation structure and content: - Add static llms.txt file in public folder with overview and links - Add dynamic llms-full.txt route handler with comprehensive content - Update robots.txt to reference llms.txt The llms.txt file provides a quick overview with links to all documentation pages, while llms-full.txt contains the complete documentation content including all API references and code examples. Co-authored-by: greg <greg@gnazar.io>
|
Cursor Agent can help with this pull request. Just |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Replace static llms.txt with auto-generated content that reads from the actual documentation MDX files. This ensures the LLMS.txt files stay in sync with documentation changes automatically. Changes: - Add lib/llms.ts utility to parse MDX files and extract content - Create dynamic route handler for /llms.txt with overview and links - Update /llms-full.txt route to use auto-generation - Remove static public/llms.txt file The utility extracts frontmatter, titles, descriptions, and content from all MDX documentation files, organizes them by section, and generates both concise (llms.txt) and comprehensive (llms-full.txt) output formats. Co-authored-by: greg <greg@gnazar.io>
Co-authored-by: greg <greg@gnazar.io>
There was a problem hiding this comment.
Pull request overview
This PR adds LLMS.txt support to enable better consumption of documentation by Large Language Models. It provides both a concise overview file (/llms.txt) with links and a comprehensive file (/llms-full.txt) with full content, following the LLMS.txt specification.
Changes:
- Created a new library (
apps/docs/lib/llms.ts) to parse MDX documentation files and generate LLMS.txt formatted content - Added two route handlers (
/llms.txtand/llms-full.txt) to serve the generated content - Updated
robots.txtto reference the LLMS.txt file as per specification
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| apps/docs/lib/llms.ts | Core library implementing MDX parsing, content extraction, and LLMS.txt generation |
| apps/docs/app/llms.txt/route.ts | Route handler serving the concise LLMS.txt overview with links |
| apps/docs/app/llms-full.txt/route.ts | Route handler serving comprehensive documentation content |
| apps/docs/public/robots.txt | Updated to reference LLMS.txt per specification |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apps/docs/lib/llms.ts
Outdated
| const description = extractDescription(body); | ||
|
|
||
| // Determine section from path | ||
| const pathParts = basePath.split("/"); |
There was a problem hiding this comment.
When basePath is an empty string (for the root page.mdx), calling split('/') will return an array with one empty string element [''], not an empty array. This means pathParts[0] will be '' instead of undefined, causing the section classification logic to incorrectly assign 'Overview' to the root page. While this happens to work correctly due to the default 'Overview' assignment, this logic is fragile and could fail if the classification logic changes.
| const pathParts = basePath.split("/"); | |
| const pathParts = basePath ? basePath.split("/") : []; |
apps/docs/lib/llms.ts
Outdated
| const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---/); | ||
|
|
||
| if (frontmatterMatch) { | ||
| const frontmatterStr = frontmatterMatch[1]; | ||
| const frontmatter: Record<string, string> = {}; | ||
|
|
||
| for (const line of frontmatterStr.split("\n")) { |
There was a problem hiding this comment.
The frontmatter regex pattern requires exactly \n line endings, which will fail to match files with Windows-style \r\n line endings. This could cause frontmatter parsing to fail on files created or edited on Windows systems. Consider using /^---\r?\n([\s\S]*?)\r?\n---/ to handle both Unix and Windows line endings.
| const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---/); | |
| if (frontmatterMatch) { | |
| const frontmatterStr = frontmatterMatch[1]; | |
| const frontmatter: Record<string, string> = {}; | |
| for (const line of frontmatterStr.split("\n")) { | |
| const frontmatterMatch = content.match(/^---\r?\n([\s\S]*?)\r?\n---/); | |
| if (frontmatterMatch) { | |
| const frontmatterStr = frontmatterMatch[1]; | |
| const frontmatter: Record<string, string> = {}; | |
| for (const line of frontmatterStr.split(/\r?\n/)) { |
apps/docs/lib/llms.ts
Outdated
| .replace(/<(Steps|TSDoc|Callout)[^>]*>[\s\S]*?<\/\1>/g, "") | ||
| .replace(/<(Steps|TSDoc|Callout)[^>]*\/>/g, "") |
There was a problem hiding this comment.
The hardcoded list of JSX components to remove (Steps, TSDoc, Callout) creates a maintenance burden. If new components are added to the documentation, they will need to be manually added here. Consider extracting this list to a constant or configuration array for easier maintenance.
apps/docs/public/robots.txt
Outdated
|
|
||
| # LLMS.txt - Documentation for Large Language Models | ||
| # https://llmstxt.org/ | ||
| Sitemap: https://siwa.aptos.dev/llms.txt |
There was a problem hiding this comment.
According to the robots.txt specification, the Sitemap directive is intended for XML sitemap files, not for LLMS.txt files. The LLMS.txt specification suggests adding a comment or reference to llms.txt in robots.txt, but not using the Sitemap directive. Consider changing this to a comment like # LLMS.txt available at: https://siwa.aptos.dev/llms.txt.
| Sitemap: https://siwa.aptos.dev/llms.txt | |
| # LLMS.txt available at: https://siwa.aptos.dev/llms.txt |
apps/docs/lib/llms.ts
Outdated
| for (const line of frontmatterStr.split("\n")) { | ||
| const [key, ...valueParts] = line.split(":"); | ||
| if (key && valueParts.length > 0) { | ||
| frontmatter[key.trim()] = valueParts | ||
| .join(":") | ||
| .trim() | ||
| .replace(/^["']|["']$/g, ""); | ||
| } | ||
| } |
There was a problem hiding this comment.
The frontmatter parser will fail on multi-line YAML values or values containing colons. For example, a title like title: \"Introduction: Getting Started\" will work, but more complex YAML structures (arrays, multi-line strings, nested objects) will not be parsed correctly. Consider using a proper YAML parser library like js-yaml or gray-matter for robust frontmatter parsing.
- Use gray-matter for robust YAML frontmatter parsing instead of
custom regex (handles complex YAML, multi-line values, etc.)
- Fix basePath split logic to correctly handle empty string case
by using conditional: `basePath ? basePath.split("/") : []`
- Extract JSX components list to a constant for easier maintenance
- Fix robots.txt to use comment instead of Sitemap directive
(Sitemap is for XML sitemaps per robots.txt specification)
- Add Windows line ending support in description extraction
The LLMS.txt content is automatically generated at runtime from
the MDX documentation files - any changes to docs are automatically
reflected without manual updates.
Co-authored-by: greg <greg@gnazar.io>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const content = generateLlmsTxt(BASE_URL); | ||
|
|
||
| return new NextResponse(content, { | ||
| headers: { | ||
| "Content-Type": "text/plain; charset=utf-8", | ||
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | ||
| }, | ||
| }); |
There was a problem hiding this comment.
The GET handler doesn't include error handling for the case where generateLlmsTxt might fail (e.g., if the docs directory doesn't exist or there are file system errors). While readMdxFiles returns an empty array if the directory doesn't exist, file system errors during reading could still cause the handler to fail ungracefully.
Consider wrapping the content generation in a try-catch block and returning an appropriate error response if generation fails.
| const content = generateLlmsTxt(BASE_URL); | |
| return new NextResponse(content, { | |
| headers: { | |
| "Content-Type": "text/plain; charset=utf-8", | |
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | |
| }, | |
| }); | |
| try { | |
| const content = generateLlmsTxt(BASE_URL); | |
| return new NextResponse(content, { | |
| headers: { | |
| "Content-Type": "text/plain; charset=utf-8", | |
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | |
| }, | |
| }); | |
| } catch (error) { | |
| console.error("Failed to generate llms.txt content:", error); | |
| return new NextResponse("Internal Server Error", { | |
| status: 500, | |
| headers: { | |
| "Content-Type": "text/plain; charset=utf-8", | |
| "Cache-Control": "no-store", | |
| }, | |
| }); | |
| } |
| const content = generateLlmsFullTxt(BASE_URL); | ||
|
|
||
| return new NextResponse(content, { | ||
| headers: { | ||
| "Content-Type": "text/plain; charset=utf-8", | ||
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | ||
| }, | ||
| }); |
There was a problem hiding this comment.
The GET handler doesn't include error handling for the case where generateLlmsFullTxt might fail (e.g., if the docs directory doesn't exist or there are file system errors). While readMdxFiles returns an empty array if the directory doesn't exist, file system errors during reading could still cause the handler to fail ungracefully.
Consider wrapping the content generation in a try-catch block and returning an appropriate error response if generation fails.
| const content = generateLlmsFullTxt(BASE_URL); | |
| return new NextResponse(content, { | |
| headers: { | |
| "Content-Type": "text/plain; charset=utf-8", | |
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | |
| }, | |
| }); | |
| try { | |
| const content = generateLlmsFullTxt(BASE_URL); | |
| return new NextResponse(content, { | |
| headers: { | |
| "Content-Type": "text/plain; charset=utf-8", | |
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | |
| }, | |
| }); | |
| } catch (error) { | |
| // Optionally log the error for debugging/monitoring purposes | |
| console.error("Failed to generate LLMS full text content:", error); | |
| return NextResponse.json( | |
| { error: "Failed to generate content" }, | |
| { status: 500 } | |
| ); | |
| } |
| export async function GET() { | ||
| const content = generateLlmsFullTxt(BASE_URL); | ||
|
|
||
| return new NextResponse(content, { | ||
| headers: { | ||
| "Content-Type": "text/plain; charset=utf-8", | ||
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | ||
| }, | ||
| }); |
There was a problem hiding this comment.
The route handlers read and process all documentation files on every request. While caching headers are set (1 hour), the server-side processing happens on each request to the route, which could be inefficient for large documentation sites.
Consider implementing one of the following optimizations:
- Use Next.js static generation to pre-generate these files at build time
- Implement server-side caching to avoid re-reading files on every request
- Use Next.js revalidation mechanisms for better performance
This is especially important if the documentation grows significantly.
| !trimmed.startsWith("<") && | ||
| !trimmed.startsWith("import") | ||
| ) { | ||
| return trimmed.slice(0, 200); |
There was a problem hiding this comment.
The description is truncated at exactly 200 characters using slice(0, 200), which could cut a word in the middle, resulting in incomplete words in the description. This may look unprofessional in the generated LLMS.txt files.
Consider truncating at word boundaries instead, or adding an ellipsis when the description is truncated. For example, you could find the last space before the 200-character limit and truncate there.
| return trimmed.slice(0, 200); | |
| if (trimmed.length <= 200) { | |
| return trimmed; | |
| } | |
| const maxLength = 200; | |
| const lastSpaceBeforeLimit = trimmed.lastIndexOf(" ", maxLength); | |
| const cutPosition = | |
| lastSpaceBeforeLimit > 0 ? lastSpaceBeforeLimit : maxLength; | |
| const truncated = trimmed.slice(0, cutPosition).trimEnd(); | |
| return truncated + "..."; |
| export async function GET() { | ||
| const content = generateLlmsTxt(BASE_URL); | ||
|
|
||
| return new NextResponse(content, { | ||
| headers: { | ||
| "Content-Type": "text/plain; charset=utf-8", | ||
| "Cache-Control": "public, max-age=3600, s-maxage=3600", | ||
| }, | ||
| }); |
There was a problem hiding this comment.
The route handlers read and process all documentation files on every request. While caching headers are set (1 hour), the server-side processing happens on each request to the route, which could be inefficient for large documentation sites.
Consider implementing one of the following optimizations:
- Use Next.js static generation to pre-generate these files at build time
- Implement server-side caching to avoid re-reading files on every request
- Use Next.js revalidation mechanisms for better performance
This is especially important if the documentation grows significantly.
| gray-matter: | ||
| specifier: ^4.0.3 | ||
| version: 4.0.3 |
There was a problem hiding this comment.
The gray-matter package depends on js-yaml@3.14.1, which is an older version from 2020. While this version should not have the critical security vulnerability that affected earlier versions (CVE-2021-25743), it would be beneficial to ensure that gray-matter is using an up-to-date version of its dependencies.
Consider checking if there's a newer version of gray-matter available that uses a more recent version of js-yaml, or monitor for updates to the dependency chain.
| next@15.3.6: | ||
| resolution: {integrity: sha512-oI6D1zbbsh6JzzZFDCSHnnx6Qpvd1fSkVJu/5d8uluqnxzuoqtodVZjYvNovooznUq8udSAiKp7MbwlfZ8Gm6w==} | ||
| engines: {node: ^18.18.0 || ^19.8.0 || >= 20.0.0} | ||
| deprecated: This version has a security vulnerability. Please upgrade to a patched version. See https://nextjs.org/blog/security-update-2025-12-11 for more details. |
There was a problem hiding this comment.
The lock file shows that Next.js 15.3.6 has a security vulnerability as indicated by the deprecation notice: "This version has a security vulnerability. Please upgrade to a patched version."
While this isn't directly related to the LLMS.txt feature being added, the lock file refresh has surfaced this issue. Consider upgrading Next.js to a patched version in a follow-up PR to address this security concern.
| next@15.3.6: | |
| resolution: {integrity: sha512-oI6D1zbbsh6JzzZFDCSHnnx6Qpvd1fSkVJu/5d8uluqnxzuoqtodVZjYvNovooznUq8udSAiKp7MbwlfZ8Gm6w==} | |
| engines: {node: ^18.18.0 || ^19.8.0 || >= 20.0.0} | |
| deprecated: This version has a security vulnerability. Please upgrade to a patched version. See https://nextjs.org/blog/security-update-2025-12-11 for more details. | |
| next@15.3.7: | |
| resolution: {integrity: sha512-oI6D1zbbsh6JzzZFDCSHnnx6Qpvd1fSkVJu/5d8uluqnxzuoqtodVZjYvNovooznUq8udSAiKp7MbwlfZ8Gm6w==} | |
| engines: {node: ^18.18.0 || ^19.8.0 || >= 20.0.0} |
| for (const title of sectionOrder) { | ||
| const sectionPages = sectionMap.get(title); | ||
| if (sectionPages && sectionPages.length > 0) { | ||
| sections.push({ title, pages: sectionPages }); | ||
| } | ||
| } |
There was a problem hiding this comment.
The function only includes sections that are explicitly listed in the sectionOrder array (lines 216-223). If new documentation sections are added that don't match these exact names, they will be silently excluded from the generated LLMS.txt files.
Consider either:
- Adding a catch-all at the end to include any remaining sections from
sectionMapthat weren't insectionOrder, or - Adding a warning/logging when sections are being excluded
This ensures all documentation is included even when new sections are added.
| * Clean MDX content for plain text output | ||
| */ | ||
| function cleanMdxContent(content: string): string { | ||
| // Build regex patterns from the component list | ||
| const componentsPattern = JSX_COMPONENTS_TO_REMOVE.join("|"); | ||
|
|
||
| return ( | ||
| content | ||
| // Remove import statements | ||
| .replace(/^import\s+.*$/gm, "") | ||
| // Remove JSX/TSX components with their content | ||
| .replace( | ||
| new RegExp(`<(${componentsPattern})[^>]*>[\\s\\S]*?<\\/\\1>`, "g"), | ||
| "", | ||
| ) |
There was a problem hiding this comment.
The regex pattern for removing JSX components with content uses a greedy [\s\S]*? pattern which could potentially fail to properly match nested components with the same name. For example, if there's a <Steps> component containing another <Steps> component, the regex may not correctly identify the closing tag.
Consider using a more robust parsing approach or documenting this limitation. For simple documentation content this may be acceptable, but it's a potential edge case to be aware of.
| * Clean MDX content for plain text output | |
| */ | |
| function cleanMdxContent(content: string): string { | |
| // Build regex patterns from the component list | |
| const componentsPattern = JSX_COMPONENTS_TO_REMOVE.join("|"); | |
| return ( | |
| content | |
| // Remove import statements | |
| .replace(/^import\s+.*$/gm, "") | |
| // Remove JSX/TSX components with their content | |
| .replace( | |
| new RegExp(`<(${componentsPattern})[^>]*>[\\s\\S]*?<\\/\\1>`, "g"), | |
| "", | |
| ) | |
| * Remove specified JSX components and their children from MDX content. | |
| * This handles nested components with the same name using a simple depth counter. | |
| */ | |
| function removeJsxComponentsWithContent( | |
| source: string, | |
| componentNames: string[], | |
| ): string { | |
| let result = source; | |
| for (const name of componentNames) { | |
| const openTag = `<${name}`; | |
| const closeTag = `</${name}>`; | |
| let searchFrom = 0; | |
| while (true) { | |
| const firstOpen = result.indexOf(openTag, searchFrom); | |
| if (firstOpen === -1) break; | |
| let depth = 0; | |
| let i = firstOpen; | |
| while (i < result.length) { | |
| const nextOpen = result.indexOf(openTag, i); | |
| const nextClose = result.indexOf(closeTag, i); | |
| if (nextClose === -1 && nextOpen === -1) { | |
| // No further matching tags; abort to avoid infinite loop for this component. | |
| i = -1; | |
| break; | |
| } | |
| if (nextOpen !== -1 && (nextOpen < nextClose || nextClose === -1)) { | |
| // Potential opening tag. | |
| const endOfOpen = result.indexOf(">", nextOpen); | |
| if (endOfOpen === -1) { | |
| i = -1; | |
| break; | |
| } | |
| const selfClosingPos = result.lastIndexOf("/>", endOfOpen); | |
| const isSelfClosing = | |
| selfClosingPos !== -1 && | |
| selfClosingPos >= nextOpen && | |
| selfClosingPos <= endOfOpen; | |
| if (!isSelfClosing) { | |
| depth += 1; | |
| } | |
| i = endOfOpen + 1; | |
| } else { | |
| // Closing tag for this component. | |
| if (depth === 0) { | |
| // Malformed structure; break to avoid infinite loop. | |
| i = -1; | |
| break; | |
| } | |
| depth -= 1; | |
| const endOfClose = nextClose + closeTag.length; | |
| i = endOfClose; | |
| if (depth === 0) { | |
| // Remove from the first opening tag to the end of this closing tag. | |
| result = | |
| result.slice(0, firstOpen) + result.slice(endOfClose); | |
| // Continue searching from the same position in the updated string. | |
| searchFrom = firstOpen; | |
| break; | |
| } | |
| } | |
| } | |
| if (i === -1) { | |
| // Could not find a complete, well-formed pair; stop processing this component. | |
| break; | |
| } | |
| } | |
| } | |
| return result; | |
| } | |
| /** | |
| * Clean MDX content for plain text output | |
| */ | |
| function cleanMdxContent(content: string): string { | |
| // Build regex patterns from the component list | |
| const componentsPattern = JSX_COMPONENTS_TO_REMOVE.join("|"); | |
| const contentWithoutImports = content.replace(/^import\s+.*$/gm, ""); | |
| const withoutComponents = removeJsxComponentsWithContent( | |
| contentWithoutImports, | |
| JSX_COMPONENTS_TO_REMOVE, | |
| ); | |
| return ( | |
| withoutComponents | |
| // Remove self-closing JSX/TSX components for the specified list |
Description
Adds LLMS.txt support to the documentation site to improve content discoverability and consumption by Large Language Models.
This PR introduces:
llms.txtfile (/llms.txt) providing an overview and key links.llms-full.txtfile (/llms-full.txt) containing the full documentation content.robots.txtto referencellms.txtas per the LLMS.txt specification.Test Plan
/llms.txtand verify the concise overview with links is displayed./llms-full.txtand verify the comprehensive documentation content is displayed./robots.txtand verify theSitemap: https://siwa.aptos.dev/llms.txtentry is present.Related Links