Concept · Is it a real edge, or luck?
A measurement is statistically significant when the observed pattern is unlikely to have occurred by random chance. In strategy evaluation, it means we have enough trades to trust that the observed metrics reflect a true underlying edge.
If you flip a coin 5 times and get 4 heads, you cannot conclude the coin is biased — 4-out-of-5 happens to fair coins about 19% of the time. If you flip it 500 times and get 400 heads, you can conclude the coin is biased — that outcome is essentially impossible by chance.
A trading strategy is the same. A strategy with 35 trades and 60% win rate could easily be a 40%-true-WR strategy that got lucky. A strategy with 500 trades and 60% win rate cannot be — the sample is large enough that luck is ruled out.
Significance is about the measurement, not the strategy. The strategy might have real edge or not — significance just tells you whether the observed performance is good evidence either way.
The 210-strategy fleet ranges from N=3 trades (a 50/200 daily variant) to N=10,574 (a 9/21 scalp on 1-minute candles). The metrics for the low-N strategies cannot be trusted as much as the metrics for high-N strategies, even when the headline numbers look better.
This is the single most important corrective to "PnL ranking" — high-PnL low-trade strategies are often statistical mirages.
For a measured win rate p̂ over N trades, the standard error is:
SE = sqrt( p̂ × (1 − p̂) / N )
The 95% confidence interval is roughly p̂ ± 1.96 × SE.
At N=30, the half-width of that CI is around ±17.5% (assuming p̂ ≈ 40%). That means a measured 30% win rate is consistent with anything from ~12% to ~47% true WR.
Same dataset, two completely different epistemic statuses. Note: a tightly-pinned win rate (id511 passes that) is not the same as a proven edge — id511's edge is still NOT-significant. Pinning the measurement and proving the advantage are two separate questions.
| Trade count | What you can do with the metrics |
|---|---|
| N < 30 | Discard. Numbers are noise. |
| 30 ≤ N < 100 | Hypothesize edge; validate elsewhere before trusting. |
| 100 ≤ N < 300 | Meaningful, but wide intervals — watch for regime sensitivity. |
| N ≥ 300 | Reliable estimates; PF and Sharpe close to true values. |
wiki/qa-sessions/2026-05-17-session.md#q2 (first asked here)/api/analytics perSymbol and tradeCount fieldsRelated concepts
See it in a real result →Put it to the test
Spawn your variant, run it on the same engine, and read the edge-significance verdict — before you risk real money.