Axiom Docs

LLM outputs are inherently non-deterministic. The same input can produce different outputs across runs. This variability can make evaluation results noisy and hard to interpret. Trials and scorer aggregations help you get statistically meaningful results by running each test case multiple times and combining scores in ways that match your evaluation goals.

When to use trials

Use trials when:

Measuring consistency: You want to know how reliably your capability produces correct outputs, not just whether it can.
Reducing noise: A single run might fail due to random variation, masking an otherwise good capability.
Testing robustness: You want to verify that your capability performs well across multiple attempts.

For simple deterministic checks or when evaluation time is critical, a single run (the default) is sufficient.

Configuring trials

Add the trials parameter to your evaluation to run each case multiple times:

import { Eval, Scorer } from 'axiom/ai/evals';

Eval('classify-intent', {
  trials: 3, // Run each case 3 times
  data: [
    { input: 'How do I reset my password?', expected: 'account' },
    { input: 'The app crashes on startup', expected: 'bug' },
  ],
  task: async ({ input }) => classifyIntent(input),
  scorers: [ExactMatch],
});

Each case runs independently trials times. Scores from each trial are then aggregated according to the scorer’s aggregation strategy (Mean by default).

Aggregation strategies

Aggregations control how individual trial scores combine into a final score. Import them from axiom/ai/evals/aggregations:

import { Mean, Median, PassAtK, PassHatK } from 'axiom/ai/evals/aggregations';

Mean (default)

Computes the arithmetic mean of all trial scores. Use this when you care about average performance.

const AccuracyScorer = Scorer(
  'accuracy',
  ({ output, expected }) => output === expected ? 1 : 0,
  { aggregation: Mean() }
);

// Trials: [1, 0, 1] → Score: 0.67

Median

Returns the middle value of trial scores. Use this when you want to reduce the impact of outliers.

const LatencyScorer = Scorer(
  'latency-acceptable',
  ({ output }) => output.latencyMs < 1000 ? 1 : 0,
  { aggregation: Median() }
);

// Trials: [0, 1, 1] → Score: 1

PassAtK (pass@k)

Returns 1 if at least one trial meets the threshold, 0 otherwise. Use this when success on any attempt counts as a pass—common for generative tasks where multiple valid outputs exist.

const ToolCalledScorer = Scorer(
  'tool-called',
  ({ output }) => output.toolCalls.length > 0 ? 1 : 0,
  { aggregation: PassAtK({ threshold: 1 }) }
);

// Trials: [0, 1, 0] → Score: 1 (at least one passed)

PassHatK (pass^k)

Returns 1 if all trials meet the threshold, 0 otherwise. Use this when you need consistent, reliable behavior across every attempt.

const ConsistencyScorer = Scorer(
  'consistent-output',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassHatK({ threshold: 1 }) }
);

// Trials: [1, 1, 0] → Score: 0 (not all passed)

User-friendly aliases

For readability, you can use these aliases:

AtLeastOneTrialPasses — alias for PassAtK
AllTrialsPass — alias for PassHatK

import { AtLeastOneTrialPasses, AllTrialsPass } from 'axiom/ai/evals/aggregations';

Custom aggregations

Create your own aggregation by returning an object with a type string and an aggregate function:

import type { Aggregation } from 'axiom/ai/evals/aggregations';

const Min = (): Aggregation<'min'> => ({
  type: 'min',
  aggregate: (scores: number[]) => 
    scores.length === 0 ? 0 : Math.min(...scores),
});

const WorstCaseScorer = Scorer(
  'worst-case',
  ({ output, expected }) => output === expected ? 1 : 0,
  { aggregation: Min() }
);

Complete example

This example runs each case 5 times with different aggregation strategies for different scorers:

import { Eval, Scorer } from 'axiom/ai/evals';
import { Mean, PassAtK, PassHatK } from 'axiom/ai/evals/aggregations';
import { classifyTicket } from './classify-ticket';

// Average accuracy across trials
const CategoryMatch = Scorer(
  'category-match',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: Mean() }
);

// Pass if the model ever gets it right
const CanClassify = Scorer(
  'can-classify',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassAtK({ threshold: 1 }) }
);

// Pass only if the model is consistent
const AlwaysCorrect = Scorer(
  'always-correct',
  ({ output, expected }) => output.category === expected.category ? 1 : 0,
  { aggregation: PassHatK({ threshold: 1 }) }
);

Eval('ticket-classification-reliability', {
  trials: 5,
  data: [
    { input: { content: 'App crashes on startup' }, expected: { category: 'bug' } },
    { input: { content: 'How do I export data?' }, expected: { category: 'question' } },
  ],
  task: async ({ input }) => classifyTicket(input),
  scorers: [CategoryMatch, CanClassify, AlwaysCorrect],
});

With 5 trials producing scores [1, 1, 0, 1, 1] for a single case:

Scorer	Aggregation	Result
`category-match`	Mean	0.8
`can-classify`	PassAtK	1
`always-correct`	PassHatK	0

Builder API

You can also configure trials using the builder pattern:

import { createEvalBuilder } from 'axiom/ai/evals';

createEvalBuilder('classify-intent', {
  data: testCases,
  task: classifyIntent,
  scorers: [ExactMatch],
})
  .withTrials(3)
  .run();

Best practices

Choose trials based on variability

More variable capabilities need more trials. Start with 3-5 trials for typical LLM tasks. For highly variable outputs (creative generation, complex reasoning), consider 10+.

Match aggregation to intent

Use Mean for general accuracy metrics
Use PassAtK for “can it ever do this?” questions
Use PassHatK for “is it reliable?” questions

Consider evaluation time

Trials multiply execution time. For a 10-case evaluation with 5 trials, you’re running 50 task executions. Balance statistical confidence against practical constraints.

Combine with flags

Use trials alongside flags and experiments to compare configurations with statistical rigor:

axiom eval --flag.model=gpt-4o
axiom eval --flag.model=gpt-4o-mini

Comparing mean scores across trials gives more reliable signals than single-run comparisons.

What’s next?

To run evaluations and compare results, see Run evaluations.
To analyze trial-level data in the Console, see Analyze results.

Platform overview

Send data

Console

AI engineering

Miscellaneous

Handling non-determinism

When to use trials

Configuring trials

Aggregation strategies

Mean (default)

Median

PassAtK (pass@k)

PassHatK (pass^k)

User-friendly aliases

Custom aggregations

Complete example

Builder API

Best practices

Choose trials based on variability

Match aggregation to intent

Consider evaluation time

Combine with flags

What’s next?

Platform overview

Send data

Console

AI engineering

Miscellaneous

​When to use trials

​Configuring trials

​Aggregation strategies

​Mean (default)

​Median

​PassAtK (pass@k)

​PassHatK (pass^k)

​User-friendly aliases

​Custom aggregations

​Complete example

​Builder API

​Best practices

​Choose trials based on variability

​Match aggregation to intent

​Consider evaluation time

​Combine with flags

​What’s next?

When to use trials

Configuring trials

Aggregation strategies

Mean (default)

Median

PassAtK (pass@k)

PassHatK (pass^k)

User-friendly aliases

Custom aggregations

Complete example

Builder API

Best practices

Choose trials based on variability

Match aggregation to intent

Consider evaluation time

Combine with flags

What’s next?