Skip to main content

Evals Export

The Evals view lets you label coding sessions with quality ratings and export them as structured datasets for AI evaluation frameworks. This is useful for building benchmarks, training data, and quality assurance workflows.

Eval workflow

1

Mark sessions as eval-ready

From the Sessions view, select sessions and click Mark as Eval Ready. This sets evalReady: true on the session, making it appear in the Evals tab.
2

Review and label

Open each eval-ready session and assign a quality status and optional tags.
3

Add expected output

For sessions you want to use as ground truth, write the expected output that the model should have produced.
4

Export

Select labeled sessions and export in your chosen format.

Eval labels

Each session can be tagged with one of these statuses:
StatusMeaningWhen to use
goldenPerfect response, suitable as ground truthThe assistant’s answer is exactly right, well-structured, and complete
correctAcceptable responseThe answer works but could be better in style or completeness
incorrectWrong or harmful responseThe assistant made factual errors, produced broken code, or missed the point
needs_reviewNot yet evaluatedDefault for sessions that need human review
These map to the sessions.evalStatus field in the database.

Eval metadata fields

Beyond the status label, each session supports these eval-specific fields:
FieldTypeDescription
evalReadybooleanWhether the session appears in the Evals view
evalStatusstringOne of: golden, correct, incorrect, needs_review
evalNotesstringFree-text notes about the quality or issues
evalTagsstring[]Custom tags for categorization (e.g., “refactoring”, “debugging”, “architecture”)
expectedOutputstringThe ideal response for ground-truth comparison
reviewedAtnumberTimestamp of the last review

Export formats

DeepEval JSON

For use with DeepEval, the open-source LLM evaluation framework. Each session exports as one JSON object per user-assistant turn pair:
{
  "input": "The user's message text",
  "actual_output": "The assistant's response text",
  "expected_output": "The expected response (if set)",
  "context": ["Previous messages in the conversation..."],
  "retrieval_context": []
}
A session with 5 user-assistant exchanges produces 5 JSON objects. The context field includes all prior messages in the conversation up to that point. If expectedOutput is set on the session, it is included in the expected_output field of the last turn.

OpenAI Evals JSONL

For use with OpenAI Evals and similar chat-format evaluation tools. Each line is a JSON object representing the full conversation:
{"messages": [{"role": "user", "content": "User prompt"}, {"role": "assistant", "content": "Assistant response"}]}
The messages array preserves the complete conversation order including system, user, assistant, and tool messages.

Plain Text

A human-readable format that outputs the conversation as labeled text:
## Session: Fix the login redirect bug

**User:** I'm seeing a redirect loop when I try to log in...

**Assistant:** The issue is in your AuthProvider component...

---
Tokens: 4,521 | Cost: $0.08 | Model: claude-sonnet-4-20250514
Status: golden | Tags: debugging, auth

Exporting from the UI

1

Open the Evals tab

Click Evals in the sidebar. This shows only sessions where evalReady is true.
2

Filter by status

Use the status filter to show only golden, correct, incorrect, or needs_review sessions.
3

Select sessions

Use the checkboxes to select individual sessions, or Select All for the current filtered view.
4

Click Export

Choose your format from the Export dropdown. The file downloads immediately.

Exporting via API

For programmatic access, use the export API endpoint:
curl -H "Authorization: Bearer osk_your_api_key" \
  "https://your-convex.convex.site/api/export?format=deepeval&status=golden"
Query parameters:
ParamValuesDescription
formatdeepeval, openai, textExport format
statusgolden, correct, incorrect, needs_reviewFilter by eval status
tagAny stringFilter by eval tag
limitNumberMax sessions to export

Use cases

Model comparison

Export the same set of “golden” sessions and run them against different models. Compare actual outputs against expected outputs to measure which model performs better on your specific coding tasks.

Fine-tuning datasets

Use “golden” sessions as training data for fine-tuning. The DeepEval format provides input/output pairs that work directly as supervised training examples.

Regression testing

When you update your prompts or switch models, re-run your evaluation suite against the labeled sessions to check for regressions.

Team quality review

Tag sessions from different team members, review them together, and build shared benchmarks for what “good” looks like for your codebase.

Next steps