Building and Breaking Models
If you’re a quantitative investor or trader, you build a model and then backtest it to see if it has worked in the past; if you’re like most people, you try to improve your model with repeated backtests. You’re operating under the assumption that there will be at least some modest resemblance between what has worked in the past and what will work in the future. (If you didn’t assume that, you wouldn’t backtest at all.)
But what few backtesters do after building their model is to try to break it by subjecting it to stress tests. A truly robust model should withstand every moderate attempt to break it. Only then should it be put into practice.
This article will outline some techniques for stress testing quantitative models. I use Portfolio123 to build and backtest my models. If you model using a different platform, most of what follows won’t apply, but I’ll try to explain my techniques in language that can be adapted to other platforms.
General Guidelines for Stress Tests
Each model that one designs on Portfolio123 essentially consists of a ranking system and a universe to which it applies. One can add a lot of complexity to the model, but those are perhaps the two most important foundations. The universe consists of a root universe and then incorporates a variety of screening rules to eliminate stocks with low liquidity, high risk, low growth, high price, or whatever else you want to put in there. In order to do these tests, you must put as many of your screening and buy rules into your universe as you can. Each model then buys the top-ranked stocks according to its ranking system, and sells them when their ranking falls to a certain point, buying new stocks to replace them. The number of holdings and the sell rules are some of the things we’re going to vary to see if the models will break.
Portfolio123 offers four very different ways of testing a particular ranking system and universe. We can use:
- a screen backtest, which rebalances to equal weight every rebalance period;
- a rolling screen backtest, which holds stocks for a certain period, and one can see overlapping returns that way;
- ranking system performance, which separates the universe into quantiles according to the ranking and shows how each would have performed; and
- an actual simulation of the model, with buy and sell rules.
To try to break a model, we should try all four of these tests, varying parameters. We should vary the number of stocks held at one time; we should vary the universe rules, testing on subsets of the universe or on an altogether different root universe; we should vary the factor weights in our ranking system; and we should vary the time period tested, including, if possible, testing it on a time period that hasn’t been tested before.
A test can be said to fail if the results have a negative alpha or a negative excess return when compared to a benchmark that consists of all of the stocks in the root universe of the one that’s being tested (or if, in a rank performance test, the top bucket is lower than the middle buckets).
So let’s take a closer look at these stress tests.
Four Sets of Variations
- First, vary the number of stocks being held. Test it on the top ten, top twenty, top fifty, and top one hundred.
- Second, vary the root universe. Test it on the S&P 500, the Russell 1000, the S&P 1500, the Russell 3000, Canadian stocks, international stocks only, and universes consisting of only certain sectors. Be sure to adjust the slippage accordingly: testing on the Russell 3000 will require higher transaction costs than testing on the S&P 500. Change your screening rules by varying your hard limits by 10% or 15%.
- Third, vary the factor weights in your ranking system to a moderate degree. Add, say, 3% to five factors and subtract 3% from another five factors, or add 3% to five factors and normalize them all. Then go back to your original weights and do that again.
- Fourth, vary the time period. Test it on the last ten years, fifteen years, and three years; go back and test it on other discrete time periods as well.
If you do all this using a screen backtest, a rolling screen backtest, a ranking system performance test, and a simulation, you’ll be doing thousands of stress tests. This is obviously impractical, so it’s best to design a subset of a dozen or so stress tests that will help you try to break your system.
Practical Stress Tests
For the purposes of this article, I designed five very different systems that backtested quite well, and then put them through a dozen stress tests. I won’t describe all these in detail, as it would be tedious in the extreme. Suffice it to say that three out of the five systems I tested failed a test. The two toughest tests were as follows:
I ran a ten-bucket rank performance test on one variation of my ranking systems over the last fifteen years with a rebalance period that matched my actual average holding period and my root universe being the S&P 1500. I then compared the top bucket to the middle buckets. In one case, the top bucket was lower than the average of the two middle buckets.
I ran a rolling screen backtest of the top 100 stocks, using the Russell 3000 as my base universe, over the last four years only, using another variation of my ranking systems, with a holding period that matched my actual average. In two cases, my backtest failed to exceed that of the benchmark, in part because of realistic slippage costs and in part because the last four years have been pretty terrible for small-cap value stocks.
The natural impulse for a quantitative investor or trader is to try to create a system by tweaking various inputs so that when backtested it shows excellent returns. But unless the system is subjected to stress tests such as the ones I’ve discussed, it has a higher chance of breaking down when actually implemented with real money. Backtesting for failure may be just as important as backtesting for success.