When to use trials
Use trials when:- Measuring consistency: You want to know how reliably your capability produces correct outputs, not just whether it can.
- Reducing noise: A single run might fail due to random variation, masking an otherwise good capability.
- Testing robustness: You want to verify that your capability performs well across multiple attempts.
Configuring trials
Add thetrials parameter to your evaluation to run each case multiple times:
trials times. Scores from each trial are then aggregated according to the scorer’s aggregation strategy (Mean by default).
Aggregation strategies
Aggregations control how individual trial scores combine into a final score. Import them fromaxiom/ai/evals/aggregations:
Mean (default)
Computes the arithmetic mean of all trial scores. Use this when you care about average performance.Median
Returns the middle value of trial scores. Use this when you want to reduce the impact of outliers.PassAtK (pass@k)
Returns 1 if at least one trial meets the threshold, 0 otherwise. Use this when success on any attempt counts as a pass—common for generative tasks where multiple valid outputs exist.PassHatK (pass^k)
Returns 1 if all trials meet the threshold, 0 otherwise. Use this when you need consistent, reliable behavior across every attempt.User-friendly aliases
For readability, you can use these aliases:AtLeastOneTrialPasses— alias forPassAtKAllTrialsPass— alias forPassHatK
Custom aggregations
Create your own aggregation by returning an object with atype string and an aggregate function:
Complete example
This example runs each case 5 times with different aggregation strategies for different scorers:[1, 1, 0, 1, 1] for a single case:
| Scorer | Aggregation | Result |
|---|---|---|
category-match | Mean | 0.8 |
can-classify | PassAtK | 1 |
always-correct | PassHatK | 0 |
Builder API
You can also configure trials using the builder pattern:Best practices
Choose trials based on variability
More variable capabilities need more trials. Start with 3-5 trials for typical LLM tasks. For highly variable outputs (creative generation, complex reasoning), consider 10+.Match aggregation to intent
- Use Mean for general accuracy metrics
- Use PassAtK for “can it ever do this?” questions
- Use PassHatK for “is it reliable?” questions
Consider evaluation time
Trials multiply execution time. For a 10-case evaluation with 5 trials, you’re running 50 task executions. Balance statistical confidence against practical constraints.Combine with flags
Use trials alongside flags and experiments to compare configurations with statistical rigor:What’s next?
- To run evaluations and compare results, see Run evaluations.
- To analyze trial-level data in the Console, see Analyze results.