Skip to main content
LLM outputs are inherently non-deterministic. The same input can produce different outputs across runs. This variability can make evaluation results noisy and hard to interpret. Trials and scorer aggregations help you get statistically meaningful results by running each test case multiple times and combining scores in ways that match your evaluation goals.

When to use trials

Use trials when:
  • Measuring consistency: You want to know how reliably your capability produces correct outputs, not just whether it can.
  • Reducing noise: A single run might fail due to random variation, masking an otherwise good capability.
  • Testing robustness: You want to verify that your capability performs well across multiple attempts.
For simple deterministic checks or when evaluation time is critical, a single run (the default) is sufficient.

Configuring trials

Add the trials parameter to your evaluation to run each case multiple times:
import { Eval, Scorer } from 'axiom/ai/evals';

Eval('classify-intent', {
  trials: 3, // Run each case 3 times
  data: [
    { input: 'How do I reset my password?', expected: 'account' },
    { input: 'The app crashes on startup', expected: 'bug' },
  ],
  task: async ({ input }) => classifyIntent(input),
  scorers: [ExactMatch],
});
Each case runs independently trials times. Scores from each trial are then aggregated according to the scorer’s aggregation strategy (Mean by default).

Aggregation strategies

Aggregations control how individual trial scores combine into a final score. Import them from axiom/ai/evals/aggregations:
import { Mean, Median, PassAtK, PassHatK } from 'axiom/ai/evals/aggregations';

Mean (default)

Computes the arithmetic mean of all trial scores. Use this when you care about average performance.
const AccuracyScorer = Scorer(
  'accuracy',
  ({ output, expected }) => output === expected ? 1 : 0,
  { aggregation: Mean() }
);

// Trials: [1, 0, 1] → Score: 0.67

Median

Returns the middle value of trial scores. Use this when you want to reduce the impact of outliers.
const LatencyScorer = Scorer(
  'latency-acceptable',
  ({ output }) => output.latencyMs < 1000 ? 1 : 0,
  { aggregation: Median() }
);

// Trials: [0, 1, 1] → Score: 1

PassAtK (pass@k)

Returns 1 if at least one trial meets the threshold, 0 otherwise. Use this when success on any attempt counts as a pass—common for generative tasks where multiple valid outputs exist.
const ToolCalledScorer = Scorer(
  'tool-called',
  ({ output }) => output.toolCalls.length > 0 ? 1 : 0,
  { aggregation: PassAtK({ threshold: 1 }) }
);

// Trials: [0, 1, 0] → Score: 1 (at least one passed)

PassHatK (pass^k)

Returns 1 if all trials meet the threshold, 0 otherwise. Use this when you need consistent, reliable behavior across every attempt.
const ConsistencyScorer = Scorer(
  'consistent-output',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassHatK({ threshold: 1 }) }
);

// Trials: [1, 1, 0] → Score: 0 (not all passed)

User-friendly aliases

For readability, you can use these aliases:
  • AtLeastOneTrialPasses — alias for PassAtK
  • AllTrialsPass — alias for PassHatK
import { AtLeastOneTrialPasses, AllTrialsPass } from 'axiom/ai/evals/aggregations';

Custom aggregations

Create your own aggregation by returning an object with a type string and an aggregate function:
import type { Aggregation } from 'axiom/ai/evals/aggregations';

const Min = (): Aggregation<'min'> => ({
  type: 'min',
  aggregate: (scores: number[]) => 
    scores.length === 0 ? 0 : Math.min(...scores),
});

const WorstCaseScorer = Scorer(
  'worst-case',
  ({ output, expected }) => output === expected ? 1 : 0,
  { aggregation: Min() }
);

Complete example

This example runs each case 5 times with different aggregation strategies for different scorers:
import { Eval, Scorer } from 'axiom/ai/evals';
import { Mean, PassAtK, PassHatK } from 'axiom/ai/evals/aggregations';
import { classifyTicket } from './classify-ticket';

// Average accuracy across trials
const CategoryMatch = Scorer(
  'category-match',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: Mean() }
);

// Pass if the model ever gets it right
const CanClassify = Scorer(
  'can-classify',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassAtK({ threshold: 1 }) }
);

// Pass only if the model is consistent
const AlwaysCorrect = Scorer(
  'always-correct',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassHatK({ threshold: 1 }) }
);

Eval('ticket-classification-reliability', {
  trials: 5,
  data: [
    { input: { content: 'App crashes on startup' }, expected: { category: 'bug' } },
    { input: { content: 'How do I export data?' }, expected: { category: 'question' } },
  ],
  task: async ({ input }) => classifyTicket(input),
  scorers: [CategoryMatch, CanClassify, AlwaysCorrect],
});
With 5 trials producing scores [1, 1, 0, 1, 1] for a single case:
ScorerAggregationResult
category-matchMean0.8
can-classifyPassAtK1
always-correctPassHatK0

Builder API

You can also configure trials using the builder pattern:
import { createEvalBuilder } from 'axiom/ai/evals';

createEvalBuilder('classify-intent', {
  data: testCases,
  task: classifyIntent,
  scorers: [ExactMatch],
})
  .withTrials(3)
  .run();

Best practices

Choose trials based on variability

More variable capabilities need more trials. Start with 3-5 trials for typical LLM tasks. For highly variable outputs (creative generation, complex reasoning), consider 10+.

Match aggregation to intent

  • Use Mean for general accuracy metrics
  • Use PassAtK for “can it ever do this?” questions
  • Use PassHatK for “is it reliable?” questions

Consider evaluation time

Trials multiply execution time. For a 10-case evaluation with 5 trials, you’re running 50 task executions. Balance statistical confidence against practical constraints.

Combine with flags

Use trials alongside flags and experiments to compare configurations with statistical rigor:
axiom eval --flag.model=gpt-4o
axiom eval --flag.model=gpt-4o-mini
Comparing mean scores across trials gives more reliable signals than single-run comparisons.

What’s next?