Nesoi
llms.txt

Evaluations

Test your AI instructor with simulated viewers before publishing

Evaluations let you test how your AI instructor handles real conversations before your video goes live. The system simulates a viewer interacting with your instructor, judges each response as pass or fail, and helps you identify areas for improvement.

Opening the Evaluations Panel

  1. Open a video in the editor
  2. Click the Evaluations tab
  3. You'll see a list of any previous evaluations, or an empty state if this is your first

Creating an Evaluation

Click the New button in the panel header to configure a test run.

Evaluation Name

A label to identify this test. It auto-generates with a timestamp (e.g., "Evaluation Jan 24 3:45 PM"), but you can rename it to something descriptive like "Beginner persona test" or "Off-topic handling check."

Viewer Persona

Describe the type of viewer you want to simulate. This controls how the simulated viewer behaves during the conversation.

The default persona is a student who works through the full course — engaging with lectures, participating in roleplays and interactive exercises, answering quiz questions, and progressing step by step. You can customize this to test specific scenarios:

  • "A confused student who struggles with technical terms and needs simple explanations"
  • "An advanced learner who already knows the basics and asks challenging follow-up questions"
  • "A distracted viewer who frequently goes off-topic"
  • "A skeptical viewer who questions claims and asks for evidence"

If you've edited the persona and want to go back, click Reset to default next to the label.

Evaluation Criteria

Instructions that define how your instructor's responses should be judged. This is written as a single paragraph and sent as one instruction block to the evaluator.

The default criteria cover relevance to lesson content, handling of off-topic questions, and engagement quality. You can rewrite this to focus on what matters most for your video — for example, "Stay grounded in the lesson, answer clearly, and politely redirect if the viewer asks something unrelated."

If you've edited the criteria and want to go back, click Reset to default next to the label.

Number of Interactions

How many back-and-forth conversation turns to simulate, from 1 to 100. Each interaction is one viewer message and one instructor response.

Repeat Evaluation

Run the same configuration multiple times (1–10 runs). Since AI responses naturally vary, multiple runs help you gauge consistency. Each run creates a separate result you can compare.

Personalization Variables

If your video uses personalization, you can fill in test values:

  • Viewer Name — the name the instructor will use when addressing the viewer
  • Personalization Questions — if your video's AI Instructions include personalization questions, they appear here with their original labels so you can fill in test answers

These fields appear automatically based on your video's configuration.

Watching an Evaluation Run

After clicking Start Evaluation, the conversation appears in real time:

  • Viewer messages (right side) show what the simulated viewer said
  • Instructor messages (left side) show your AI's responses
  • A progress bar tracks how many turns have completed

You can continue working in other tabs while the evaluation runs. Come back anytime to check progress.

To stop an evaluation early, click the Cancel button. Partial results are kept.

Understanding Results

Config Banner

At the top of every evaluation conversation, a config banner shows the settings used for that run:

  • Persona — the viewer persona description
  • Criteria — the evaluation criteria
  • Variables — any personalization variables that were set
  • Turns — the number of interactions configured
  • Duration — how long the evaluation took to complete

This makes it easy to remember what each evaluation was testing, even weeks later.

Pass / Fail

Each conversation turn is judged against your evaluation criteria and marked as pass or fail. The results appear as badges:

  • All passed — every scored turn met the criteria
  • N failed — the number of turns that didn't meet the criteria

The overall result is shown in the sidebar next to each evaluation and in the chat header when viewing results.

Status

StatusMeaning
PendingQueued and waiting to start
RunningIn progress — conversation updating live
CompletedFinished — pass/fail results and summary available
FailedSomething went wrong (check the error message)
CancelledStopped before finishing — partial results available

The evaluations sidebar shows key information at a glance:

  • Status badge — current state of the evaluation
  • Pass/fail badge — overall result for completed evaluations
  • Turn count — how many turns were completed
  • Timestamp — when the evaluation was created
  • Personalization variables — the first two variable values, with a "+N more" indicator if there are additional ones
  1. Run an initial evaluation with default settings to establish a baseline
  2. Review the results — look at which turns failed and read the summary
  3. Update your AI Instructions or video content based on findings
  4. Run another evaluation with the same configuration to measure improvement
  5. Repeat until all turns pass consistently

Testing Different Viewer Types

Create separate evaluations with different personas to get a well-rounded picture:

  • A "confused beginner" persona to test clarity
  • An "advanced expert" persona to test depth
  • A "distracted, off-topic" persona to test redirection

Tips

  • Start with defaults — the default persona and criteria work well for a general quality check
  • Use descriptive names — name evaluations after what you're testing so you can find them later
  • Test after major changes — run a fresh evaluation whenever you update AI Instructions or video content
  • Compare multiple runs — use 2–3 repeat runs to separate consistent issues from one-off variations
  • Keep criteria focused — a clear, concise paragraph gives better results than a long list of vague rules