Evals Export
The Evals view lets you label coding sessions with quality ratings and export them as structured datasets for AI evaluation frameworks. This is useful for building benchmarks, training data, and quality assurance workflows.Eval workflow
Mark sessions as eval-ready
From the Sessions view, select sessions and click Mark as Eval Ready. This sets
evalReady: true on the session, making it appear in the Evals tab.Add expected output
For sessions you want to use as ground truth, write the expected output that the model should have produced.
Eval labels
Each session can be tagged with one of these statuses:| Status | Meaning | When to use |
|---|---|---|
golden | Perfect response, suitable as ground truth | The assistant’s answer is exactly right, well-structured, and complete |
correct | Acceptable response | The answer works but could be better in style or completeness |
incorrect | Wrong or harmful response | The assistant made factual errors, produced broken code, or missed the point |
needs_review | Not yet evaluated | Default for sessions that need human review |
sessions.evalStatus field in the database.
Eval metadata fields
Beyond the status label, each session supports these eval-specific fields:| Field | Type | Description |
|---|---|---|
evalReady | boolean | Whether the session appears in the Evals view |
evalStatus | string | One of: golden, correct, incorrect, needs_review |
evalNotes | string | Free-text notes about the quality or issues |
evalTags | string[] | Custom tags for categorization (e.g., “refactoring”, “debugging”, “architecture”) |
expectedOutput | string | The ideal response for ground-truth comparison |
reviewedAt | number | Timestamp of the last review |
Export formats
DeepEval JSON
For use with DeepEval, the open-source LLM evaluation framework. Each session exports as one JSON object per user-assistant turn pair:context field includes all prior messages in the conversation up to that point.
If expectedOutput is set on the session, it is included in the expected_output field of the last turn.
OpenAI Evals JSONL
For use with OpenAI Evals and similar chat-format evaluation tools. Each line is a JSON object representing the full conversation:Plain Text
A human-readable format that outputs the conversation as labeled text:Exporting from the UI
Filter by status
Use the status filter to show only golden, correct, incorrect, or needs_review sessions.
Select sessions
Use the checkboxes to select individual sessions, or Select All for the current filtered view.
Exporting via API
For programmatic access, use the export API endpoint:| Param | Values | Description |
|---|---|---|
format | deepeval, openai, text | Export format |
status | golden, correct, incorrect, needs_review | Filter by eval status |
tag | Any string | Filter by eval tag |
limit | Number | Max sessions to export |