Blog

How Many Trades Do You Need for a Valid Backtest? The Real Math, Not "At Least 100"

A 20-trade backtest with a 65% win rate is compatible with a coin flip. Here is the actual math for how many trades you need before your numbers mean anything.

How Many Trades Do You Need for a Valid Backtest? The Real Math, Not "At Least 100"

You ran a backtest (a simulation that replays your trading rules against historical price data to see how they would have performed). It produced some number of trades, maybe 23, maybe 4,800. And now you are asking the only question that matters: can I trust this?

The folk answer on every trading forum is "at least 100 trades." It is not wrong, exactly. It is just not an answer. It does not tell you what 100 trades buys you, what 30 fails to buy you, or why 5,000 can still lie to your face.

I build a backtesting simulator, so this question is my day job. Here is the actual math, in plain English.

Why trade count matters at all

Every backtest result is a sample. A trade (one complete round trip: an entry into a position and an exit out of it) is one data point. Your win rate (the percentage of trades that closed at a profit) is a statistic computed from those data points.

Small samples are noisy. Flip a fair coin 20 times and you will routinely see 13 or 14 heads. That is a 65% or 70% "win rate" from a process with zero edge (edge meaning a genuine, repeatable advantage that produces profit beyond random chance). Nobody would conclude the coin is special. But show a trader a 20-trade backtest with a 65% win rate and the strategy gets named, capitalized, and deployed.

The question "how many trades do I need" is really the question "how noisy is my measured win rate?" And statistics has a clean answer for that.

The math: margin of error shrinks with the square root of sample size

The margin of error (the plus-or-minus band around a measured number, inside which the true value plausibly sits) for a win rate follows a standard formula. At a 95% confidence level (meaning: if you reran this experiment many times, about 95% of the bands you draw would contain the true value), the margin of error is approximately:

margin of error ≈ 1.96 × sqrt( p × (1 − p) / n )

where p is the win rate and n is the number of trades. The key feature is that n lives under a square root. To cut your uncertainty in half, you need four times as many trades. Not twice as many. Four times.

Here is what that formula produces at p = 0.5, which is the worst case and a sensible default when you do not yet know your true win rate:

Trades (n)	95% margin of error on win rate
20	roughly ±22 points
50	roughly ±14 points
100	roughly ±10 points
300	roughly ±5.7 points
1,000	roughly ±3.1 points

Read that first row again. At 20 trades, a measured 60% win rate is consistent with a true win rate anywhere from roughly 38% to 82%. That band contains "loses money," "coin flip," and "best strategy you have ever seen." The backtest cannot distinguish between them. The result is not bad. It is not yet information.

At 100 trades, the band is about ±10 points. A 60% measured win rate now suggests a true rate somewhere between roughly 50% and 70%. Better, but the coin flip is still inside the band. You need the measured rate to clear 50% by more than the margin of error before you can start taking it seriously, and we cover that full significance test in the luck post.

At 1,000 trades, the band tightens to about ±3 points. Now a 60% win rate actually means something close to 60%.

One honest caveat: this is the simple approximation, and it gets unreliable at very small samples and extreme win rates. PerpForge uses the Wilson confidence interval (a refinement of the same idea that behaves correctly at small n) for its significance verdicts. The square-root scaling, which is the lesson here, is identical in both.

Honest rules of thumb

These are heuristics, not laws. The formula above is the law. But if you want practical bands:

Under ~30 trades: an anecdote, not evidence. The error bars are so wide that almost any result is compatible with luck. Useful for checking your code runs. Not useful for deciding anything.
30 to 100 trades: suggestive, with wide error bars. Worth noting. Not worth conviction. A great-looking result here earns the strategy more testing, nothing more.
100 to 300 trades: worth a serious look. The margin of error drops to roughly ±10 to ±6 points. A strong result here can clear the luck threshold, especially with a clearly positive average profit per trade.
300+ trades: error bars tight enough to mean something. Roughly ±5.7 points and shrinking. At this scale, the statistics stop being the weakest link in your analysis. Something else becomes the weakest link, which brings us to the uncomfortable part.

The flip side: 5,000 trades on the same month proves less than you think

Here is where the folk wisdom really breaks down. Trade count is necessary, not sufficient.

The formula assumes your trades are independent samples from the conditions your strategy will actually face. But a 1-minute strategy (one that makes decisions on one-minute price candles, the bars summarizing each minute of price action) can generate 5,000 trades inside a single month. Statistically, the win-rate error bars are gorgeous: ±1.4 points. Practically, all 5,000 trades sampled one market regime (a sustained mode of market behavior: trending up, trending down, or chopping sideways).

If that month was a clean uptrend, you have measured, with great precision, how your strategy performs in clean uptrends. You have measured nothing about the chop that will eventually arrive. This is the seed of overfitting (when a strategy's rules end up tuned to the accidents of one specific stretch of data rather than to anything that repeats).

So the honest requirement is two-dimensional:

Enough trades for the error bars to tighten. The table above.
Enough time diversity for those trades to span different regimes: up trends, down trends, and sideways chop, ideally across years rather than weeks.

A 300-trade backtest spread over two years of mixed conditions is stronger evidence than a 5,000-trade backtest squeezed into one trending month. The first samples the world. The second samples a moment.

A strategy's win rate is rarely stable across regimes. The same parameter set that looks strong inside a clean trend often gives much of the edge back in a choppy stretch of the same window. That is why time diversity matters as much as trade count: a backtest that only ever saw one regime has not been tested against the regime it will eventually meet.

The crypto perp wrinkle: more trades means more fees

There is a tempting shortcut hiding in the table: if I need more trades, I will just drop to a shorter timeframe. A strategy that trades 40 times per year on the 4-hour chart might trade 400 times per year on the 15-minute chart. Faster statistical significance, right?

Yes, and also a faster bleed. Every trade pays trading fees (the cost an exchange charges on each trade, taken on both the entry and the exit). On perpetual futures (perps: crypto contracts that let you go long or short with leverage and no expiry date), those fees are charged on the full position size. A strategy that earns 0.4% per trade before fees and pays 0.1% per round trip keeps most of its edge. A faster strategy earning 0.1% per trade with the same costs keeps nothing.

So high-frequency strategies do not just need more trades. They need a bigger edge per trade to survive the toll booth. Many strategies that look profitable in a fee-free backtest are net losers the moment realistic costs enter.

This is exactly why PerpForge charges a flat-rate trading fee on every entry and every exit in every backtest, instead of letting fee-free results flatter the fast strategies. And in the spirit of saying out loud what we do not simulate: funding rates, maker/taker fee splits, and slippage are not modeled yet. Fees and liquidation (the forced close of a position when losses eat through its margin) are. The admission is part of the receipt.

Plenty of strategies that look profitable in a fee-free backtest flip to net losers the moment a flat trading fee is switched on, and the effect is sharpest on the fastest timeframes, where the per-trade edge is thinnest and the round-trip toll is paid most often. The shorter the timeframe, the more the fee decides the verdict.

That is the whole method: square-root error bars, regime diversity, and costs on every leg. If you know a better way to decide when a backtest is trustworthy, show us. We will test it, credit you, and publish the change.

What this means for you

Count your trades, then look up the row. Under 30, you have an anecdote. Around 100, you have a hypothesis with ±10-point error bars. Past 300, the statistics start cooperating, and your attention should shift to regime diversity and fees, because those are now the likelier liars.

If you want to see this applied to real strategies, the PerpForge leaderboard shows backtest results with trade counts, fee-adjusted performance, and a significance verdict on every entry. Viewing is free, no signup. Testing your own variant is the product.

PerpForge is an educational simulator. No real money is traded, and nothing here is financial advice. 18+.

Put it to the test

Does your idea have a real edge, or just a big number?

Spawn your variant, run it on the same engine we use for every result on this site, and read the edge-significance verdict — before you risk real money.

Test your own idea — free →Free account, no card