5
minute read
Mar 3, 2026

How Long Should a Test Run? A Practical Guide to Statistical Power in Lending Experiments

Test length should be driven by event rates, decision volume, and your outcome window. This guide shows how to size lending experiments so results are decision-grade and governable.

The uncomfortable truth about test length

Most lending tests do not fail because the model is weak. They fail because the experiment cannot answer the question the business is asking.

Teams often ask, “How long should we run this test,” as if time is the unit of evidence. In lending, evidence is usually constrained by something else.

Events.

Defaults, early delinquency, hardship contacts, loss events, charge-offs, and even approval decisions all arrive at different speeds. A test can run for weeks and still be underpowered if the outcomes you care about are rare, or slow to mature.

A more useful framing is this: A test should run until it has enough information to make a governed decision, and that “enough” is primarily a function of event counts, not calendar days.

What statistical power really means in lending

Power is the probability your test will detect a real change when the change exists.

In lending, that change might be:

  • A lower bad rate at the same approval rate
  • A higher approval rate at the same bad rate
  • Better margin through different exposure, pricing, or mix
  • Earlier detection of stress in existing accounts

Power depends on four inputs.

  • Baseline rate for the outcome you care about
  • Minimum change you want to detect
  • How much traffic and volume you can assign to the test
  • How much noise exists in the population and decisioning process.

You can run a test for a long time and still miss the answer if your minimum detectable change is too small, your event rate is too low, or your test allocation is too conservative.

Step 1. Pick the decision, not the metric

A model does not go live. A decision goes live.

Before you size a test, write down the production decision you might change.

Examples:

  • Introduce a new risk signal as a second input into approval
  • Adjust limits in a specific range
  • Add a new routing policy for borderline applicants
  • Trigger earlier interventions in servicing

Then choose the outcomes that connect to that decision.

For origination tests, you usually need a mix of:

  • A fast signal, such as approval rate, take-up, early delinquency entry
  • A slower truth metric, such as 60 or 90 day delinquency, loss, charge-off.

For account management tests, you often need:

  • Leading indicators, such as utilization spikes, cash buffer compression, hardship contacts
  • Longer outcomes, such as delinquency transitions and loss.

The outcomes you choose determine the speed of evidence.

Step 2. Set an outcome window that matches the product

Outcome windows are where lending experiments go wrong.

If your product has a short cycle, meaningful outcomes may show quickly. If your product has longer terms, the cleanest loss outcomes may not be observable for months.

You do not need to wait for perfection, but you do need to be explicit about what you are willing to act on.

A practical approach is to define two windows.

  • A leading window for early directional evidence
  • A decision window for a governed go forward decision.

Example pattern for installment credit:

  • Leading window might include first payment default, early delinquency entry
  • Decision window might include 60 or 90 day delinquency, loss emergence.

You should also plan for censoring. Many accounts will not have matured outcomes by the time you want to decide. Your experiment design should anticipate that and define how you will treat incomplete observations.

Step 3. Power comes from events, not applicants

In lending experiments, the outcomes you care about are often rare.

That means power is driven by event counts.

A simple way to build intuition is to translate time into expected events.

  • Applications per day times test allocation equals applications in test
  • Applications in test times baseline event rate equals expected events

For example, if you expect 20 defaults in a month, you should not expect a precise estimate of default rate differences between two strategies. You might see direction, but you should not treat it as decision-grade evidence.

A practical rule for planning is to ask: How many events do we need to detect the change we care about?

You can answer that without heavy math by working backward from governance expectations.

  • Directional learning usually needs enough events to see a stable pattern by segment.
  • Production changes usually require enough events to avoid making a decision off noise.

If your event rate is low, you have three levers:

  • Increase volume by widening scope
  • Increase allocation percentage
  • Use a faster proxy outcome while you wait for the truth metric.

Step 4. Define the minimum change worth detecting

Every lender has a threshold for what matters.

If you are only willing to change policy for a tiny improvement, you are signing up for a long test, high volume, or both.

Be explicit about what you want to detect.

Examples:

  • A 10 percent reduction in early delinquency at the same approval rate
  • A 2 point increase in approval rate with no increase in 60 day delinquency
  • A meaningful improvement in contribution margin per booked account.

If you set the minimum detectable change too small, your test will either run for a very long time or end with an inconclusive result.

In practice, teams often start with a change threshold that would matter economically, then refine it after reviewing baseline variability.

Step 5. Protect the test from avoidable noise

Power is not only statistics. It is discipline.

Lending environments are full of moving parts that can erase signal.

Common sources of noise include:

  • Policy changes during the test
  • Channel mix shifts and marketing changes
  • Seasonality and pay-cycle effects
  • Operational changes in verification, funding, servicing
  • Data coverage changes that alter who gets scored.

If you change two things at once, you often cannot attribute the outcome to the new risk signal. That is a governance problem, not a modeling problem.

A practical control is to freeze what you can, and document what you cannot freeze.

Step 6. Avoid peeking traps and false certainty

Teams love early reads. Governance teams hate false certainty.

If you look at results every day and stop the test when the chart looks good, you increase the probability of a false positive. The effect is real even when everyone is acting in good faith.

You can manage this without turning your organization into a statistics seminar by setting pre-defined milestones, such as: 

  • Pre-commit review points
  • Pre-commit stop and extend rules
  • Treat early reads as directional
  • Make the final decision based on the decision window you defined.

If you need faster feedback, do not over-peek. Increase allocation or scope, or use a proxy outcome while the final outcome matures.

A practical planning framework you can reuse

Use this framework to estimate test length without pretending there is a universal answer.

  1. Write the decision that might change
  2. Pick one primary outcome and one secondary outcome
  3. Define a leading window and a decision window
  4. Estimate baseline event rates for each outcome
  5. Estimate daily volume and your test allocation
  6. Compute expected events per week
  7. Decide how many events you need for a governed decision
  8. Set review points and stop extend rules up front

If expected events are too low, adjust one of the levers:

  • Expand scope
  • Increase allocation
  • Choose a faster proxy outcome
  • Accept that the decision window must be longer

Where Carrington Labs fits

Carrington Labs is not a decision engine. Lenders retain policy and decisioning control.

Our role is to provide decision-ready risk analytics that can run alongside your existing stack, including in shadow mode. That makes it practical to generate clean evidence without taking uncontrolled production risk.

If your tests keep ending with inconclusive results, the issue is often not the model. It is the experiment design, the outcome window, or the operating discipline around change control.

Share this with your analytics and model governance partners, then align on event targets and stop extend rules before you start the test.