Skip to content

feat: parse GitHub.copilot-chat/transcripts/*.jsonl event-stream format#70

Open
hora7ce wants to merge 3 commits into
microsoft:mainfrom
hora7ce:feat/parse-transcript-jsonl
Open

feat: parse GitHub.copilot-chat/transcripts/*.jsonl event-stream format#70
hora7ce wants to merge 3 commits into
microsoft:mainfrom
hora7ce:feat/parse-transcript-jsonl

Conversation

@hora7ce
Copy link
Copy Markdown

@hora7ce hora7ce commented May 27, 2026

Summary

Closes #64.

VS Code stores Copilot Chat sessions in two distinct locations inside each workspace's workspaceStorage entry:

# Location Format Status
1 chatSessions/*.{json,jsonl} JSON / JSONL session blobs ✅ Already parsed
2 GitHub.copilot-chat/transcripts/*.jsonl Newline-delimited event stream ❌ Silently ignored — this PR

Sessions recorded only in format 2 never appeared in the dashboard. This PR makes the extension parse both formats transparently.


The transcript event-stream format

Each .jsonl file represents one session. Each line is a typed JSON event:

Event type Meaning
session.start New conversation; carries sessionId and metadata
user.message User turn with content text
assistant.message AI response; may include toolRequests[] array
tool.execution_start Tool call begins; carries toolCallId + toolName
tool.execution_complete Tool call finishes

Changes

src/core/parser-vscode.ts

New private helpers — each kept small to stay within the ESLint complexity limit:

Helper Purpose
listTranscriptFiles(dir) Lists *.jsonl files under a transcripts/ directory; returns [] when the dir does not exist
parseTranscriptLines(raw) Deserialises the raw JSONL text; blank and corrupt lines are silently skipped
buildToolNameIndex(events) Pre-indexes toolCallId → toolName from tool.execution_start events
collectToolsFromToolRequests(...) Extracts tool names from assistant.message.toolRequests[], with toolCallId fallback to the pre-built index
buildRequestsFromTranscriptEvents(events, toolNames) Groups events into SessionRequest[]; each user.message starts a new turn that is flushed when the next user.message (or end-of-stream) arrives

New exported function:

  • parseTranscriptFile(filePath, wsId, wsName, harness, customInstructionsBytes?) — reads a transcript file, builds the Session object (or returns null for empty / unreadable files), and is the single public surface consumed by the integration points below.

Integration in processWorkspaceEntry and processWorkspaceEntryAsync:

  • After processing chatSessions/, both functions now also scan GitHub.copilot-chat/transcripts/ and add discovered sessions to the same sessions[] / sessionSourceIndex pipeline — fully transparent to the rest of the analyzer.
  • The async path accounts for transcript files in totalUnits so progress reporting stays accurate.

src/core/parser-vscode.test.ts

Five new tests in a parseTranscriptFile describe block:

Test Acceptance criterion covered
Full flow — single turn with tool call Maps event-stream turns to SessionData.requests[]; validates sessionId, workspaceId, harness, messageText, responseText, toolsUsed, agentMode, timestamp
Multi-turn grouping Two user/assistant pairs produce two separate requests
No user messages Returns null gracefully
All-corrupt JSONL lines Returns null gracefully (guards the events.length === 0 path)
Tool deduplication Same tool name appearing in both toolRequests and tool.execution_start collapses to a single entry

Dependency

This branch is based on PR #63 (fix/vscode-server-log-discovery), which adds ~/.vscode-server path discovery. The two PRs are independent at the code level — they touch disjoint functions — but merging #63 first is recommended so the transcript sessions are attributed the correct Local Agent (Server) harness label for VS Code Server users.


Checklist

hora7ce added 3 commits May 26, 2026 10:35
…SH / devcontainer

findVsCodeDirs() only scanned desktop installation paths (~/.config/Code,
AppData, ~/Library/...) and missed the VS Code Server path used by WSL2,
Remote SSH, and Dev Containers:

  ~/.vscode-server/data/User/workspaceStorage
  ~/.vscode-server-insiders/data/User/workspaceStorage

Add both server editions to the scan on non-Windows platforms.

Also extend harnessFromPath() with .vscode-server-insiders and
.vscode-server checks (ordered most-specific first to avoid the Insiders
path matching the plain .vscode-server substring) so sessions discovered
via these paths are labelled 'Local Agent (Server)' or
'Local Agent (Server Insiders)' rather than the fallback 'Local Agent'.

Fixes microsoft#62
- README.extension.md: add Local Agent (Server) and Local Agent (Server Insiders) rows to Supported Harnesses table
- docs/content/_index.md: add server harness row to Multi-Harness Support table
- docs/content/getting-started/supported-tools.md: note Remote-WSL/SSH/devcontainer log paths under Local Agent section
- parser-vscode.ts: tighten harnessFromPath ordering comment (substring collision) and findVsCodeDirs platform guard comment per reviewer suggestions
- parser-vscode.test.ts: add findVsCodeDirs test covering server workspaceStorage path inclusion via temporary home directory
Fixes microsoft#64.

VS Code stores Copilot Chat sessions in two locations inside each
workspace's workspaceStorage entry:

  1. chatSessions/*.{json,jsonl}   — existing format (already parsed)
  2. GitHub.copilot-chat/transcripts/*.jsonl — newer event-stream format
     (silently ignored until now)

This commit adds support for the second format.

## New helpers (parser-vscode.ts)

- listTranscriptFiles(dir)  — lists *.jsonl files in a transcripts/ dir
- parseTranscriptLines(raw) — parses JSONL text into TranscriptEvent[]
- buildToolNameIndex(events) — pre-indexes toolCallId → toolName
- collectToolsFromToolRequests(...) — extracts tool names from assistant
  message toolRequests arrays with fallback to the pre-built index
- buildRequestsFromTranscriptEvents(events, toolNames) — groups events
  into per-turn SessionRequest[] (one request per user.message)
- parseTranscriptFile(filePath, wsId, wsName, harness, customInstrBytes)
  — public API: reads a transcript file and returns a Session or null

## Integration

processWorkspaceEntry / processWorkspaceEntryAsync now scan the
transcript directory alongside chatSessions and wire discovered sessions
into the same sessions[] / sessionSourceIndex pipeline so the dashboard
picks them up transparently.

The async path tracks transcript files in the same progress-reporting
budget as chat files (totalUnits includes both).

## Tests (parser-vscode.test.ts)

Five new cases in the parseTranscriptFile describe block:
  - full flow: session.start → user.message → assistant.message with
    tool calls → tool.execution_start/complete → final assistant.message
  - multi-turn: two user/assistant pairs produce two requests
  - empty session: no user messages → null
  - malformed file: all-corrupt lines → null (events.length === 0)
  - deduplication: same tool appearing in both toolRequests and
    tool.execution_start is deduplicated to a single entry
@aymenfurter
Copy link
Copy Markdown
Contributor

aymenfurter commented May 27, 2026

@san360 LGTM can you check if this PR works and does not introduce any duplication? Thanks!

@aymenfurter aymenfurter requested a review from san360 May 27, 2026 16:22
@eyehiel
Copy link
Copy Markdown

eyehiel commented May 28, 2026

I took this PR since my data was empty and with this PR it parses my data,
but from some reason I see duplicated sessions in the Timeline:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: parse GitHub.copilot-chat/transcripts/*.jsonl event-stream format

3 participants