
Most lending tests do not fail because the model is weak. They fail because the experiment cannot answer the question the business is asking.
Teams often ask, “How long should we run this test,” as if time is the unit of evidence. In lending, evidence is usually constrained by something else.
Events.
Defaults, early delinquency, hardship contacts, loss events, charge-offs, and even approval decisions all arrive at different speeds. A test can run for weeks and still be underpowered if the outcomes you care about are rare, or slow to mature.
A more useful framing is this: A test should run until it has enough information to make a governed decision, and that “enough” is primarily a function of event counts, not calendar days.
Power is the probability your test will detect a real change when the change exists.
In lending, that change might be:
Power depends on four inputs.
You can run a test for a long time and still miss the answer if your minimum detectable change is too small, your event rate is too low, or your test allocation is too conservative.
A model does not go live. A decision goes live.
Before you size a test, write down the production decision you might change.
Examples:
Then choose the outcomes that connect to that decision.
For origination tests, you usually need a mix of:
For account management tests, you often need:
The outcomes you choose determine the speed of evidence.
Outcome windows are where lending experiments go wrong.
If your product has a short cycle, meaningful outcomes may show quickly. If your product has longer terms, the cleanest loss outcomes may not be observable for months.
You do not need to wait for perfection, but you do need to be explicit about what you are willing to act on.
A practical approach is to define two windows.
Example pattern for installment credit:
You should also plan for censoring. Many accounts will not have matured outcomes by the time you want to decide. Your experiment design should anticipate that and define how you will treat incomplete observations.
In lending experiments, the outcomes you care about are often rare.
That means power is driven by event counts.
A simple way to build intuition is to translate time into expected events.
For example, if you expect 20 defaults in a month, you should not expect a precise estimate of default rate differences between two strategies. You might see direction, but you should not treat it as decision-grade evidence.
A practical rule for planning is to ask: How many events do we need to detect the change we care about?
You can answer that without heavy math by working backward from governance expectations.
If your event rate is low, you have three levers:
Every lender has a threshold for what matters.
If you are only willing to change policy for a tiny improvement, you are signing up for a long test, high volume, or both.
Be explicit about what you want to detect.
Examples:
If you set the minimum detectable change too small, your test will either run for a very long time or end with an inconclusive result.
In practice, teams often start with a change threshold that would matter economically, then refine it after reviewing baseline variability.
Power is not only statistics. It is discipline.
Lending environments are full of moving parts that can erase signal.
Common sources of noise include:
If you change two things at once, you often cannot attribute the outcome to the new risk signal. That is a governance problem, not a modeling problem.
A practical control is to freeze what you can, and document what you cannot freeze.
Teams love early reads. Governance teams hate false certainty.
If you look at results every day and stop the test when the chart looks good, you increase the probability of a false positive. The effect is real even when everyone is acting in good faith.
You can manage this without turning your organization into a statistics seminar by setting pre-defined milestones, such as:
If you need faster feedback, do not over-peek. Increase allocation or scope, or use a proxy outcome while the final outcome matures.
Use this framework to estimate test length without pretending there is a universal answer.
If expected events are too low, adjust one of the levers:
Carrington Labs is not a decision engine. Lenders retain policy and decisioning control.
Our role is to provide decision-ready risk analytics that can run alongside your existing stack, including in shadow mode. That makes it practical to generate clean evidence without taking uncontrolled production risk.
If your tests keep ending with inconclusive results, the issue is often not the model. It is the experiment design, the outcome window, or the operating discipline around change control.
Share this with your analytics and model governance partners, then align on event targets and stop extend rules before you start the test.