Skip to content

Claude log format

Claude Logs Structure and Processing Documentation

Section titled “Claude Logs Structure and Processing Documentation”

This document describes Claude log files (JSONL format), their structure, and how StackUnderflow processes them to generate analytics while handling complex edge cases.

  1. Log File Structure
  2. Entry Types and Fields
  3. Message Formats
  4. Tool Usage
  5. Special Cases
  6. Processing Pipeline
  7. Deduplication and Tool Counting
  8. Storage
  9. Legacy Format
  10. Known Issues and Solutions

Modern Claude Code (January 2026 and later) writes one JSONL file per session, organised by project:

~/.claude/projects/{project-path-slug}/{session-id}.jsonl

The slug is the absolute project path with path separators replaced by hyphens:

/Users/example/.claude/projects/-Users-example-dev-myproject/08fce8c2-8453-42da-a52c-e03472c24e0f.jsonl

ClaudeAdapter.enumerate() walks ~/.claude/projects/, yields a SessionRef for every .jsonl file it finds, and falls back to ~/.claude/history.jsonl for project directories that predate the per-project format (see Legacy Format).

While JSONL files are named after a primary session ID, they can contain log entries from multiple sessions:

  1. Conversation Continuation: When a conversation is continued after compaction or restart
  2. Cross-Session References: When Claude references work from another session
  3. Session Merging: When multiple related sessions are logged together

Best Practice: The adapter reads the sessionId field from each JSONL line and stores it on the Record. Filename stems are used as a fallback only when sessionId is absent.

  • summary — Session or conversation summary
  • user — User messages (includes tool results)
  • assistant — Claude’s responses

Important: The root type field indicates the log entry type, NOT necessarily the message role.

  • type (string): Type of the entry
  • timestamp (ISO 8601): When the entry was created
  • uuid (string): Unique identifier for this entry
  • sessionId (string): Session identifier
  • parentUuid (string|null): UUID of the parent message
  • isSidechain (boolean): Whether this is a side conversation (e.g., Task tool)
  • userType (string): Type of user (e.g., “external”)
  • cwd (string): Current working directory
  • version (string): Claude version
  • message (object): The actual message content
  • requestId (string): API request identifier
  • message.id (string): Unique message ID (important for streaming)
  • toolUseResult (object|string): Detailed tool execution results
  • isCompactSummary (boolean): True for conversation summaries
{
"type": "user",
"message": {
"role": "user",
"content": [
{
"type": "text",
"text": "User's message text"
},
{
"type": "tool_result",
"tool_use_id": "tool_id",
"content": "Tool execution result"
}
]
}
}
{
"type": "assistant",
"message": {
"id": "msg_id",
"type": "message",
"role": "assistant",
"model": "claude-opus-4-20250514",
"content": [
{
"type": "text",
"text": "Claude's response text"
},
{
"type": "tool_use",
"id": "toolu_xxxxx",
"name": "ToolName",
"input": {
"parameter": "value"
}
}
],
"stop_reason": "tool_use",
"usage": {
"input_tokens": 1234,
"output_tokens": 567,
"cache_creation_input_tokens": 890,
"cache_read_input_tokens": 123
}
}
}
{
"type": "summary",
"summary": "Brief description of the conversation",
"leafUuid": "uuid-of-last-message"
}

summary and compact_summary entries are skipped by the adapter (_role_from() returns None for them) — they are not inserted into the messages table.

  • File Operations: Read, Write, Edit, MultiEdit
  • System: Bash, Grep, Glob, LS
  • Task Management: TodoWrite, TodoRead
  • Special: Task (launches sub-agents), WebFetch, WebSearch
  • Jupyter: NotebookRead, NotebookEdit

Tool results appear in subsequent user messages:

{
"type": "tool_result",
"tool_use_id": "toolu_xxxxx",
"content": "Result of tool execution",
"is_error": true
}

ClaudeAdapter._tools_from() walks the message.content array and collects every block whose type is "tool_use". The resulting tuple of names is stored in the Record.tools field and serialised as tools_json in the messages table.

Critical: Task tool operations are NOT individually logged:

  • Only the Task invocation and final result appear in logs
  • Internal tool operations by sub-agents are invisible
  • Token usage by sub-agents is NOT tracked
  • This causes apparent “missing” tool counts in analytics

Claude logs streaming responses as multiple entries with the same message ID:

// Entry 1: Text response
{
"type": "assistant",
"message": {
"id": "msg_01Y9yWFraRY5ptb3Bqbvpmqx",
"content": [{"type": "text", "text": "I'll implement..."}]
}
}
// Entry 2: Tool use (same message ID)
{
"type": "assistant",
"message": {
"id": "msg_01Y9yWFraRY5ptb3Bqbvpmqx",
"content": [{"type": "tool_use", "name": "Write", ...}]
}
}

When conversations approach context limits, Claude Code creates comprehensive summaries:

{
"type": "user",
"isCompactSummary": true,
"message": {
"role": "user",
"content": [{
"type": "text",
"text": "This session is being continued from a previous conversation..."
}]
}
}
{
"type": "tool_result",
"content": "The user doesn't want to proceed with this tool use...",
"is_error": true
}

Appears as both error AND user message:

// As error
{
"type": "tool_result",
"content": "[Request interrupted by user for tool use]",
"is_error": true
}
// As user message
{
"type": "user",
"message": {
"content": [{"text": "[Request interrupted by user for tool use]no, don't..."}]
}
}
~/.claude/
|
v
ClaudeAdapter (stackunderflow/adapters/claude.py)
enumerate() -> SessionRef[]
read(ref) -> Record[]
|
v
ingest/writer (stackunderflow/ingest/writer.py)
ingest_file() -- one transaction per file,
mtime + byte-offset tracking via ingest_log table
|
v
SQLite store (~/.stackunderflow/store.db)
projects / sessions / messages / ingest_log tables
|
v
store/queries (stackunderflow/store/queries.py)
get_project_stats() -- reconstructs RawEntry objects from raw_json,
feeds the stats chain below
|
v
stats chain (stackunderflow/stats/)
classifier -> enricher -> aggregator -> formatter
|
v
API routes (stackunderflow/routes/)

ingest/writer.run_ingest() compares each SessionRef’s (mtime, size) against the ingest_log table:

  • Unchanged (same mtime and size): skip entirely — no read, no transaction.
  • Appended (larger size, same or newer mtime): seek to processed_offset and read only the new bytes.
  • Truncated / rotated (size shrank): delete the ingest_log row and reparse from byte 0.

This means large projects pay for a filesystem stat check only, not a full reparse, on every poll.

ClaudeAdapter._parse_line() converts a raw JSONL object into a Record dataclass:

# Role assignment
base_type = obj['type'] # 'user' | 'assistant' | 'summary' | ...
if base_type == 'user':
role = 'user'
elif base_type == 'assistant':
role = 'assistant'
elif base_type in ('summary', 'compact_summary'):
return None # skip — not a conversational record

Token counts come from message.usage; tool names from every "tool_use" block in message.content; the entire raw dict is preserved in Record.raw and written to messages.raw_json.

All timestamps are stored in UTC but displayed in the user’s local timezone:

  1. Frontend detects timezone offset: new Date().getTimezoneOffset()
  2. Backend converts UTC to local time for grouping
  3. Charts display dates in the user’s local timezone

When Claude Code crashes and restarts with --continue:

  • Duplicate messages appear in multiple files
  • Same message shows inconsistent tool counts
  • Incomplete assistant responses
  • Missing tool execution logs

The stackunderflow/stats/classifier.py module receives a list of RawEntry objects (reconstructed from messages.raw_json) and performs two-phase deduplication:

  1. Phase 1 — ID-based merge: Merges entries sharing the same message.id (keeping the longer content variant). This handles streaming responses where Claude emits multiple entries for the same message.

  2. Phase 2 — Exact duplicate drop: Drops exact duplicates by hashing timestamp + content + UUID. This handles entries duplicated across files after crash/continue scenarios.

The deduplication logic that was previously in pipeline/dedup.py now lives inside the stats chain at stackunderflow/stats/classifier.py. The on-disk records themselves are stored with duplicates intact — dedup is a query-time operation so the raw JSONL is always faithfully preserved.

  1. Split Interactions: User message in file A, assistant response in file B
  2. Incomplete Tool Executions: Crash during tool execution
  3. Compact Summary Continuations: Sessions starting with summaries
  4. Missing Tool Logs: Tools used but not logged
  5. Streaming Response Merging: Multiple entries with same message ID
  6. Task Tool Sidechains: Sub-agent operations not logged
~/.stackunderflow/store.db

The authoritative schema lives in stackunderflow/store/migrations/v001_initial.sql. Key tables:

TablePurpose
projectsOne row per (provider, slug) pair
sessionsOne row per session UUID, FK to projects
messagesOne row per parsed line, FK to sessions
ingest_logOne row per source file; tracks mtime, size, processed_offset

messages is the central table. Rows are keyed on (session_fk, seq) where seq is the byte offset of the line within its source file. Every row carries a raw_json column containing the full original JSONL object, so nothing is ever discarded during ingest — downstream consumers reconstruct whatever they need from the raw payload.

Selected messages columns:

  • seq (INTEGER) — byte offset used as a stable, monotonically increasing sequence number
  • role (TEXT) — "user" or "assistant"
  • model (TEXT) — model identifier when present in the source line
  • input_tokens, output_tokens, cache_create_tokens, cache_read_tokens (INTEGER)
  • tools_json (TEXT) — JSON array of tool names called in this message
  • raw_json (TEXT) — the complete original JSONL object
  • is_sidechain (INTEGER 0/1) — set when isSidechain is true in the source
  • uuid, parent_uuid (TEXT) — message threading fields from the JSONL

All typed query helpers that read from the store live in stackunderflow/store/queries.py. Application code imports helpers from there rather than writing raw SQL.

Before January 2026, Claude Code did not write per-project JSONL files. Instead, all prompts were appended to a single centralised file:

~/.claude/history.jsonl

Each line in that file has a different shape from modern per-project JSONL — notably it uses "project" (an absolute path string) and "timestamp" (milliseconds since epoch) rather than the nested "message" object modern sessions use.

ClaudeAdapter handles both formats transparently:

  • enumerate() checks each project directory for .jsonl files. If none are found but a .continuation_cache.json exists, it treats the project as legacy and yields a single synthetic SessionRef whose session_id starts with "legacy-" and whose file_path points at ~/.claude/history.jsonl.
  • read() detects the "legacy-" prefix and calls _read_history() instead of _read_jsonl().
  • _read_history() filters lines by _slug_for(obj["project"]), converts the millisecond timestamp to ISO 8601, and yields minimal Record objects (role "user", no token counts, no tools) — one per matching history line.

This means analytics for pre-January-2026 projects will show user prompts but no token counts or model information, since the legacy format does not record those fields.

Cause: Same user message in multiple files after crash/continue Solution: Two-phase deduplication in stats/classifier.py at query time; raw records are preserved intact in the store

Cause: Incomplete logging, Task tool limitations, streaming issues Solution: Tool count reconciliation across all interaction versions during the classify → enrich chain

Cause: Incomplete assistant messages from crashes Solution: Preserve model info during interaction merging in the stats chain; MAX(CASE WHEN model IS NOT NULL …) aggregation in get_session_stats()

Status: Documented in TODO Workaround: Refresh individual project dashboards first

  1. Accuracy: No duplicate messages, correct type classification
  2. Performance: Incremental ingest — only new bytes read per poll cycle
  3. Completeness: All tools counted accurately; raw JSONL always preserved
  4. Timezone Support: Correct local time display
  5. Reliability: Graceful handling of crashes and continuations