Is this what back testing means?

How to Design a Fundamentals-Based Strategy that Really Works, Part Three: Principles of Backtesting

This is the third article in a series; here is the first (on factor design) and here is the second (on designing ranking systems).

I’ve been using a fundamentals- and ranking-based strategy for stock picking since 2015, and since then my compound average growth rate is 48%. So I can attest that a fundamentals-based strategy can really work. My previous articles in this series discussed a) designing factors for multifactor strategies; and b) applying rules and creating ranking systems. In these articles, I’ve been advocating creating ranking systems as a good alternative or addition to traditional screening. I use a subscription-based website called Portfolio123 for this; ranking can also be done with spreadsheets or other data management tools, but that involves a huge amount of work and is beyond the scope of this series of articles.

Past Performance and Future Results

We always say that “past performance is no guarantee of future results”—or, even worse, “past performance is not indicative of future results,” which implies that there’s no relationship at all between them. If this were true, backtesting would be a complete waste of time. The whole idea behind backtesting is that there is a relationship between past performance and future results.

Laws and Theories

If you throw a ball into the air a hundred times and it always falls back down, you can assume that it will continue to do so every time you throw it into the air. Why? Because there’s a law behind it: the law of gravity. Similarly, there are absolutely reliable laws that govern portfolio performance, laws derived from statistics and mathematics.

Portfolio performance is governed not only by laws, but by theories. There are thousands of those, and they can be tested. But they’re not immutable or completely reliable. They could pertain to certain time periods and not to others. It’s important to recognize the difference.

Laws of Portfolio Performance

There are many of these, but the ones that I’ve found most useful are the laws of diversification, regression to the mean, outliers, and alpha and beta.

  • The law of diversification can be stated as follows: The standard deviation of a portfolio’s returns is more likely to decrease than to increase as more imperfectly correlated assets are added to it. (This law can be derived from the law of large numbers, which states that the average of a sample converges in probability toward an expected value as the sample gets larger.) In plain English, the greater the number of stocks you hold, the less fluctuation you’ll see in your returns, so long as the stocks are relatively uncorrelated.
  • The law of regression to the mean can be stated as follows: As long as there is a meaningful average of values, an extreme value will be more likely to become less extreme over time than to continue to be extreme. For example, if a particular industry has outperformed or underperformed other industries, it will be more likely to have average performance in the future than to continue to outperform or underperform. The same is true for factors, stocks, price returns, and so on. Regression to the mean can be overcome by certain tendencies, as one sees with stock momentum. But the law still pertains. It explains why, for instance, companies with very low free cash flow yields are more likely to have higher free cash flow yields in the future, while companies with very high yields are more likely to have lower yields in the future.
  • The law of outliers can be stated as follows: The impact of extremely large and small numbers upon results is likely to make them difficult to replicate. This law applies to averages, standard deviation, compounding, and regressions. For example, the average of 2, 3, 4, 5, 6, 7, 8, and 100 is close to 17, which is not at all representative of the sample. Similarly, a strategy that has a high return due to one or two outperforming stocks or periods is less likely to have its return replicated in another time period than one whose return is relatively unaffected by such stocks or periods. This law does not apply to medians, percentiles, and similar measures. Measures that are relatively impervious to outliers are called robust.
  • The law of alpha and beta is one I came up with myself, and proved it. It can be stated as follows: alpha and beta are inversely correlated so long as the market return tends to be positive. (There’s a corollary: alpha and beta are positively correlated in the rare case that the market return tends to be negative, and are uncorrelated when the market return is neutral.) This only applies when you calculate alpha and beta using a benchmark that constitutes the average returns of the securities among which you’re choosing from (as is usually the case). This is a mathematical law that applies to all asset classes. However, this inverse correlation, while provable mathematically and testable empirically, is relatively weak. The takeaway, in plain English, is that low-beta portfolios are likely to outperform high-beta portfolios.

Theories of Portfolio Performance

There are legions of these. A few examples that come quickly to mind are:

  • Stocks with low prices relative to their fundamentals tend to outperform stocks with high prices.
  • The stock prices of small companies tend to be more volatile than those of large companies.
  • Companies with growing sales and shrinking inventory tend to outperform companies with shrinking sales and growing inventory.
  • Companies that pay dividends regularly are likely to continue to pay dividends regularly.

All of these, and many thousands of others, are ready for testing to see if they can be backed by empirical data.

The Correlation Between Past Performance and Future Results

In order to figure out how best to backtest, you have to answer the following question to your satisfaction: Under what conditions does past (in-sample) performance best correlate with future (out-of-sample) results?

Further questions that this question raises include:

  • How many years should be backtested?
  • How large a portfolio should be backtested (i.e. should a portfolio consisting of the same number of stocks that will constitute your out-of-sample portfolio be backtested, or would a more diverse one be better?)
  • What performance measures should I use?
  • Should backtested results be adjusted for outliers?
  • How will optimization affect the correlation?

The best way to backtest is the way that maximizes the correlation between past performance and future results.

Some people think that there is no correlation between them, and that backtesting just gives you false hope. People warn about data mining and misleading backtests. That’s why it’s so important to establish a correlation baseline.

For example, let’s say an academic study shows that a certain strategy gets a very good Sharpe ratio over the past fifty years, and it involves buying the top quintile of stocks in the S&P 500 and shorting the bottom quintile.

What the study does not answer is this question: if one buys the top quintile of stocks in the S&P 500 and shorts the bottom quintile for fifty years according to, say, one hundred different strategies, does the rank of these different strategies according to their Sharpe ratio correlate with the out-of-sample Sharpe ratio of exactly the same strategies, using the same methods, over the next, say, twenty years? Of course, this would have to be tested over various fifty-year periods and the following twenty years.

I have no answer to this question. I honestly have no idea if the answer is “Yes” or “No.” And if the answer is that there is no correlation, then all these academic studies are useless.

Furthermore, there is no standard for testing. Some academic studies use the top decile minus the bottom decile, some the top tercile minus the bottom tercile, some ignore the bottom quantile altogether. Some test over fifty years and some over less than twenty. They rely on a huge variety of different universes. Some use the Sharpe ratio as a performance measure, other use CAGR, others use risk-adjusted alpha.

It astonishes me that I have not seen any such correlation study pertaining to academic methods such as these. They have been the standard for academic backtesting over the past fifty years, yet nobody has actually tried to see if the results are meaningful in a general way. In my opinion, prior to using a testing method, one must determine if that testing method will generally yield meaningful results.

Running a Correlation Study

Running a correlation study is a huge amount of work. One has to create a library of strategies, run equity curves, and measure their correlation. Doing this may not be worthwhile for most investors. So in this section of the article I’ve summarized my own conclusions.

Here’s the way I have run correlation studies—with a view not for writing academic papers but for coming up with a strategy that will work for the way I buy and sell stocks.

  1. Create fifty to a hundred different portfolio strategies. These should, ideally, have never been backtested. Get them from somewhere else, not from your own research. It’s especially important that they have never been optimized over a specific time period. On the other hand, the strategies should not be random. They should make sense, at least to someone. They should ideally be strategies that someone might use. The strategies should be flexible enough that you can decrease or increase the portfolio size. For example, you could use the various screening strategies that AAII has developed over the last few decades.
  2. Run all the strategies over as long a period as possible. Then rerun them with varying portfolio sizes.
  3. Compare in-sample and out-of-sample results. How long do you want your own strategy to last? Do you anticipate changing your strategy every six months, every three years, every ten years? Take your answer to that question and make that your “out-of-sample” period. Then measure the correlation between the total return of the out-of-sample period for your fifty or one hundred strategies and the return of the preceding in-sample periods: one year, two years, etc. up to twelve or fifteen years. (For example, a good correlation will show that the same strategies perform poorly in both periods and the same strategies perform well in both periods.) Ideally there should be a year between the end of the in-sample and the beginning of the out-of-sample period in order to avoid data leakage. Do this over and over again, as much as your total time period allows. Considering that mass of data, what is the length of the in-sample period that best correlates with the out-of-sample results? If the in-sample period is too long, the ranks of the strategies will cluster together and be almost indistinguishable; if it’s too short, the ranks may show very little correlation since regression to the mean is more prevalent than persistence over short periods. My own results are as follows: for a three-year out-of-sample period, a ten-year in-sample period is the most correlative, though almost anything longer than ten years will still correlate decently well. A five-year in-sample period is the worst—the correlations are often negative.
  4. Vary your portfolio size. Now that you’ve determined your ideal in-sample time period, measure the correlation of those in-sample time periods with various portfolio sizes to the out-of-sample periods with the portfolio size you’re actually going to use. If you double, triple, quadruple, or quintuple the number of positions in your in-sample period, does the correlation of the strategy ranks with the out-of-sample period improve? In my experience, about two to five times the number of positions gives me better results than using the same number of positions.
  5. Determine your portfolio backtesting method. You might find that a top-quintile vs bottom-quintile approach correlates best with your out-of-sample portfolio simulation; or you might find that a rolling backtest correlates best, or a simple simulation. (For me, the latter has always worked well.)
  6. Examine performance measures. If you’re interested in getting the highest return in the out-of-sample period, should you follow the strategy with the highest in-sample returns? Or is better to use a different measure? What if you’re interested in getting the highest Sharpe ratio? Do Sharpe ratios in different periods have a strong correlation? Or is there another method to use in the in-sample period that will correlate better with an out-of-sample Sharpe ratio? Personally, I have found that the performance measure with the highest correlation to out-of-sample CAGR is in-sample alpha after trimming outliers.

Optimizing Your Strategy

One of the warnings I always hear is don’t optimize your strategy too much. Simply run a few backtests to see if something works or doesn’t.

The logic of this is pretty clear. The more you optimize your strategy, the closer it will hew to the data that’s available. You will then end up with a strategy that is tailored to work beautifully for a very specific set of stocks during a very specific time period. Such a strategy may be less likely to work in a subsequent time period, and you will end up with illusory results.

On the other hand, if there is a correlation, even a slight one, between in-sample and out-of-sample returns, the laws of probability say that you will be more likely to do well out-of-sample with a strategy that performs well in-sample than one that performs badly.

One solution to this conundrum may be to test over various different time periods and various different groups of stocks. I break up my universe of stocks into five equal parts and optimize different strategies for each one. I then combine those five strategies into one. I also change the beginning and end dates of my backtests; some people (notably James O’Shaughnessy) subdivide their in-sample testing period into discrete testing periods and only pick strategies that work in most of them.

There are a lot of things to optimize for when backtesting. You might want to optimize the rules that govern your universe of stocks. You might want to optimize the various elements of your portfolio management: how many stocks to hold at a time, how often you buy and sell them, what weights you give them, and so on. If you use a ranking system, like I do, you might want to optimize the weights of your various factors.

There are some things that backtesting may not tell you. Will a backtest ever give you really realistic slippage? I suggest setting your liquidity limits without backtesting. Instead, use your experience and look at the actual slippage you’re paying by comparing your transaction records for stocks with varying liquidity.

Your backtesting method may end up favoring a certain industry or sector that outperformed during your in-sample period. Beware of this. Different industries and sectors perform well at different times, and certain factors can help you concentrate on those. But don’t favor a method that will always choose the same industry/sector simply based on past results. The best investment strategies will rotate and/or diversify between industries and sectors so that you’re always on top. And you certainly don’t want to exclude certain industries just because they performed poorly during your in-sample period, or because doing so improves your in-sample backtests. Those may end up being the best industries in the near future.

Optimizing Portfolio Parameters

In my opinion, the best way to optimize your portfolio parameters is to use a factor-based stockpicking strategy that you or someone else developed a while ago and see what portfolio parameters would have worked best with it in the years since then. The worst way is to use a strategy that you’ve already backtested over the same period that you’re now optimizing.

Here are some questions you should ask:

  • How many stocks should be held?
  • What buy rules and sell rules should you use?
  • What weights should the stocks have?
  • Should you have a minimum holding period to reduce slippage?
  • Should you set a maximum number of stocks per industry or sector?
  • Should you set a maximum weight for a single position?

Optimizing Factors

If you’re using ranking, some factors might benefit from an approach that doesn’t favor the highest or the lowest values, but instead chooses something in-between. For example, my ranking system works very well for microcaps and small caps but doesn’t work very well for nanocaps, mid-caps, and large caps. So I don’t want my size factors to be simply smaller better, because that would emphasize nanocaps.

One way to optimize factors like these is to take your ranking system, exclude the factor in question, and test it on subsets of your universe that are split up according to the factor. So, for example, I could test my ranking system on stocks of various sizes and see which size stocks perform best and which perform worst. Or, if I were optimizing a growth factor, I could test my ranking system on stocks with various growth rates.

Optimizing a Ranking System

Two questions that will immediately come up are:

  • which factors should you use?
  • what weights should they be assigned?

Fortunately, those questions can be easily answered at the same time by experimenting with a long list of factors and allowing yourself to assign 0% weights to many of them.

As I wrote above, I subdivide the universe of stocks I’m willing to invest in into five more-or-less random subuniverses and run my tests on each one. I vary my factor weights until I get the ranking systems that perform best in each of my universes—and the ones that perform best in all of them—and then I average the weights of those outperforming systems.

A tip: don’t vary your factor weights by less than 2%. It’s a waste of time. One way to do this is to have all factor weights divisible by 2%, 2.5%, or 4%.

The amount of backtesting time this process takes can be extreme. Expect to spend weeks on it, or automate the process somehow. However, you really don’t have to be perfect in your optimization. A very rough approximation is good enough. You’ll find that the highest-ranked stocks are pretty much the same after a certain point.

Common Backtesting Mistakes

  • Optimizing only on the entire universe from which you’ll be choosing stocks.
  • Optimizing on a short and fixed time period.
  • Optimizing using a small number of positions.
  • Believing that an optimized result is achievable out-of-sample. (Remember that the word optimize implies that the result is closely tied to a specific sample of stocks in a specific time period.)
  • Basing your portfolio parameter backtests on a ranking system already optimized over the same time period.
  • Basing your optimization only on CAGR rather than on more robust performance measures.
  • Taking into account only overall performance rather than looking at performance during various discrete time periods. (Some robust performance measures can handle this for you.)

Stress-Testing Your Systems

As I wrote in a previous post on stress-testing quantitative models, it’s important not only to build a good strategy but also to try to destroy it. A strategy that succeeds over a variety of stress tests is ready for the long haul. Backtesting for failure can be just as important as backtesting for success.

Leave a Reply

Your email address will not be published. Required fields are marked *