Skip to main content

Deflated Sharpe Ratio can reduce false discovery

 

Introduction

With the recent advent of large financial datasets, machine learning, and high-performance computing, analysts can backtest millions of alternative investment strategies. Backtest optimizers search for combinations of parameters to maximize a strategy's simulated historical performance, leading to backtest overfitting. The performance inflation problem goes beyond backtesting. More generally, researchers and investment managers tend to report only positive results, a phenomenon is known as selection bias. Failure to control the number of tests involved in a given finding can lead to overly optimistic performance expectations. The Deflated Sharpe Ratio (DSR) corrects for two major sources of performance inflation: selection bias under multiple tests and non-normally distributed returns. By doing so, DSR helps separate legitimate empirical results from statistical deception.

Backtesting is a good example. Backtesting is a historical simulation of how a particular investment strategy has performed in the past. While backtesting is a powerful and necessary research tool, it can also be easily manipulated. In this paper, we will argue that the most important piece of information missing from almost all backtests published in academic journals and investment products is the number of trials attempted. Without this information, it is impossible to assess the relevance of the backtracking test. Frankly, no matter how well a report performs, back-testing in which the researcher has no control over the scope of the search to which his or her findings relate is worthless. Investors and journal reviewers should request this information whenever backtesting is submitted, although doing so will not eliminate the danger entirely.

Multiple Testing

Now suppose we are interested in analyzing multiple strategies on the same dataset, with the goal of choosing the best, or at least a good strategy for future applications. Then a curious problem arises: as we test more and more strategies, each at the same level of significance, the overall probability of selecting at least one bad strategy increases. This is called the multiple testing problem, and it is so widespread and notorious that the American Statistical Association explicitly warns: running multiple tests on the same dataset increases the chance of getting at least one invalid result. Selecting an "important" result from multiple parallel tests can lead to a serious risk of false conclusions. 

 The more times we apply, the more likely it is that the invalid strategy will pass the test and then be published as a single test result. 

Selection Bias

Researchers who run multiple tests on the same data tend to publish only those who pass statistical significance tests, while others hide other data. By not reporting negative results, investors are only exposed to a biased sample of results. This problem is known as "selection bias" and is caused by a combination of multiple tests and partial reporting, see Roulston and Hand [2013]. It takes many different forms:
i) analysts do not report the full range of experiments performed ("file drawer effect"),
ii) journals publish only "positive" results ("publication bias")
iii) track-only the hedge performance of indices without exploding funds ("survival bias"), and 
iv) publish only their (so far) profitable strategies ("self-selection bias," "backfill "), and other historical managers, etc.

Deflated Sharpe Ratio (DSR)

[公式]

[公式]

while [公式]

All of these phenomena have in common whether key information is hidden from decision-makers, with implications far greater than the expected probability of Type I errors. Ignoring the full extent of the experiment makes the unlikely more likely.

The Sharpe ratio (SR) is the most widely used performance statistic. It evaluates investments based on risk-reward rather than the return of capital. Portfolio managers are keen to improve their SR in order to rank higher in databases such as hedge fund research and to obtain a greater allocation of capital. Setting a constant cut-off threshold for SR, above which a portfolio manager or strategy is considered for capital allocation, leads to the same selection bias discussed earlier: the rate of false positives continues to grow as more and more trials are considered.

Note that: unless max{SR}>>E[max{SR}], the discovered strategy is likely to be a false positive.

DSR Example

Suppose a strategist is studying seasonal patterns in the money market. He argues that the U.S. Treasury's auction cycle creates inefficiencies that can be exploited by selling non-cash bonds a few days before the auction and buying new issues a few days after the auction. He goes back to this idea of alternative allocations by combining different pre- and post-auction periods, maturities, holding periods, stop losses, etc. He found that many of the portfolios generated an annualized [formula] of 2, with the particular generated [formula] of 2.5 in a daily sample of 5 years.

Excited by the results, he called on investors to ask funds to run this strategy and argued that an annual SR of 2.5 must be statistically significant. The investor, familiar with a recent paper published by the Journal of Portfolio Management, asked the strategist to disclose:
 i) the number of independent tests performed (N);
 ii) the variance of the retest results[公式]; 
 iii) the sample length (T); 
iv) the skewness of the returns; 
v) and kurtosis of the returns; 

the analyst responded: [公式]

Soon after, investors rejected the analysts' offer. Why? Because investors have determined that this is not a legitimate empirical finding at a 95% confidence level. Specifically: Substitution of equation 1 and  equation 2, the result is:
 [公式]non-annualized [公式]

Conclusion:

Selection bias is ubiquitous in the financial literature, where backtests are often published without reporting the full range of trials involved in selecting a particular strategy. Thus, selection bias combined with backtest overfitting misleads investors into allocating capital to strategies that will systematically lose money. 

In this paper,  the test is to determine whether the estimated SR is statistically significant after correcting for two main sources of performance: selection bias and abnormal returns. The Deflated Sharpe Ratio (DSR) contains information about the unselected trials, such as 
i) the number of independent experiments conducted, 
ii) the variance of the SR
iii) sample length, and 
iv) kurtosis of returns.

References:
1. https://poseidon01.ssrn.com/delivery.php?ID=490074106114120089120085007020090104041021014087045043087028010025124102014086007075048120001122008096009116108069081003026086008007035089016093084065090070022088050014118117094085019067099072120126029104116118108118085110087002097113125064123081&EXT=pdf



Comments

Popular posts from this blog

Why safe assets, such as gold and treasury bond cannot hedge market risk?

 In early 2020, SPY (SPDR S&P500 ETF Trust), GLD (SPDR Gold Trust), and IEF (iShares 7-10 Year Treasury Bond ETF) fell simultaneously. Most people think that Gold and US treasury bond should rise while the stock market falls as they are the safe assets. Here are some reasons to explain why they cannot hedge market risk. Most hedge funds have a hedged portfolio. Their portfolios appear to have gone through a significant amount of deleveraging in March of 2020. Multi-strategy funds suffered losses so that they needed to rebalance the portfolio by unwinding their equity market neutral books. Most hedge funds use leverage to increase profit. In normal conditions, leverage allows portfolios to be more flexible and less constrained. Leverage gives long/short equity managers the flexibility to tune their books to the desired size, and add securities without resorting to exiting positions in other parts of the book to fund those purchases. The entire concept of risk parity relies on using