Skip to main content

Deflated Sharpe Ratio can reduce false discovery

 

Introduction

With the recent advent of large financial datasets, machine learning, and high-performance computing, analysts can backtest millions of alternative investment strategies. Backtest optimizers search for combinations of parameters to maximize a strategy's simulated historical performance, leading to backtest overfitting. The performance inflation problem goes beyond backtesting. More generally, researchers and investment managers tend to report only positive results, a phenomenon is known as selection bias. Failure to control the number of tests involved in a given finding can lead to overly optimistic performance expectations. The Deflated Sharpe Ratio (DSR) corrects for two major sources of performance inflation: selection bias under multiple tests and non-normally distributed returns. By doing so, DSR helps separate legitimate empirical results from statistical deception.

Backtesting is a good example. Backtesting is a historical simulation of how a particular investment strategy has performed in the past. While backtesting is a powerful and necessary research tool, it can also be easily manipulated. In this paper, we will argue that the most important piece of information missing from almost all backtests published in academic journals and investment products is the number of trials attempted. Without this information, it is impossible to assess the relevance of the backtracking test. Frankly, no matter how well a report performs, back-testing in which the researcher has no control over the scope of the search to which his or her findings relate is worthless. Investors and journal reviewers should request this information whenever backtesting is submitted, although doing so will not eliminate the danger entirely.

Multiple Testing

Now suppose we are interested in analyzing multiple strategies on the same dataset, with the goal of choosing the best, or at least a good strategy for future applications. Then a curious problem arises: as we test more and more strategies, each at the same level of significance, the overall probability of selecting at least one bad strategy increases. This is called the multiple testing problem, and it is so widespread and notorious that the American Statistical Association explicitly warns: running multiple tests on the same dataset increases the chance of getting at least one invalid result. Selecting an "important" result from multiple parallel tests can lead to a serious risk of false conclusions. 

 The more times we apply, the more likely it is that the invalid strategy will pass the test and then be published as a single test result. 

Selection Bias

Researchers who run multiple tests on the same data tend to publish only those who pass statistical significance tests, while others hide other data. By not reporting negative results, investors are only exposed to a biased sample of results. This problem is known as "selection bias" and is caused by a combination of multiple tests and partial reporting, see Roulston and Hand [2013]. It takes many different forms:
i) analysts do not report the full range of experiments performed ("file drawer effect"),
ii) journals publish only "positive" results ("publication bias")
iii) track-only the hedge performance of indices without exploding funds ("survival bias"), and 
iv) publish only their (so far) profitable strategies ("self-selection bias," "backfill "), and other historical managers, etc.

Deflated Sharpe Ratio (DSR)

[公式]

[公式]

while [公式]

All of these phenomena have in common whether key information is hidden from decision-makers, with implications far greater than the expected probability of Type I errors. Ignoring the full extent of the experiment makes the unlikely more likely.

The Sharpe ratio (SR) is the most widely used performance statistic. It evaluates investments based on risk-reward rather than the return of capital. Portfolio managers are keen to improve their SR in order to rank higher in databases such as hedge fund research and to obtain a greater allocation of capital. Setting a constant cut-off threshold for SR, above which a portfolio manager or strategy is considered for capital allocation, leads to the same selection bias discussed earlier: the rate of false positives continues to grow as more and more trials are considered.

Note that: unless max{SR}>>E[max{SR}], the discovered strategy is likely to be a false positive.

DSR Example

Suppose a strategist is studying seasonal patterns in the money market. He argues that the U.S. Treasury's auction cycle creates inefficiencies that can be exploited by selling non-cash bonds a few days before the auction and buying new issues a few days after the auction. He goes back to this idea of alternative allocations by combining different pre- and post-auction periods, maturities, holding periods, stop losses, etc. He found that many of the portfolios generated an annualized [formula] of 2, with the particular generated [formula] of 2.5 in a daily sample of 5 years.

Excited by the results, he called on investors to ask funds to run this strategy and argued that an annual SR of 2.5 must be statistically significant. The investor, familiar with a recent paper published by the Journal of Portfolio Management, asked the strategist to disclose:
 i) the number of independent tests performed (N);
 ii) the variance of the retest results[公式]; 
 iii) the sample length (T); 
iv) the skewness of the returns; 
v) and kurtosis of the returns; 

the analyst responded: [公式]

Soon after, investors rejected the analysts' offer. Why? Because investors have determined that this is not a legitimate empirical finding at a 95% confidence level. Specifically: Substitution of equation 1 and  equation 2, the result is:
 [公式]non-annualized [公式]

Conclusion:

Selection bias is ubiquitous in the financial literature, where backtests are often published without reporting the full range of trials involved in selecting a particular strategy. Thus, selection bias combined with backtest overfitting misleads investors into allocating capital to strategies that will systematically lose money. 

In this paper,  the test is to determine whether the estimated SR is statistically significant after correcting for two main sources of performance: selection bias and abnormal returns. The Deflated Sharpe Ratio (DSR) contains information about the unselected trials, such as 
i) the number of independent experiments conducted, 
ii) the variance of the SR
iii) sample length, and 
iv) kurtosis of returns.

References:
1. https://poseidon01.ssrn.com/delivery.php?ID=490074106114120089120085007020090104041021014087045043087028010025124102014086007075048120001122008096009116108069081003026086008007035089016093084065090070022088050014118117094085019067099072120126029104116118108118085110087002097113125064123081&EXT=pdf



Comments

Popular posts from this blog

Alternative Sector Classification Methods

Abstract This paper offers two alternative sector classification methods in order to classify companies more accurately. Introduction During the early 1900s, various departments of the US government initiated research and studies on the various industries and their different functions. Due to the lack of set standards, each department ended up using its own methodology. Consolidating information across multiple sources became a challenge. The Standard Industrial Classification (SIC)  was hence proposed as a uniform classification system, aimed to represent major industries, sub-class and specific function/product, and was formally adopted in 1937.  However, SIC was facing a challenge because of the change in the economic environment. After that, the Global Industry Classification System (GICS) was launched by Standard & Poor's (S&P) and Morgan Stanley (MSCI) in August 1999. The standard provides a comprehensive, globally consistent definition of economic sectors and industr

1929 Market Crash Cause

Many people believe in the reason why the market value of all share listed in NYSE fell by 30% is that there was an economic bubble and the stock market is irrational. However, is that the truth? We are going to show you three different approaches to the 1929 market crash. Remember that correlation is not causation. These are just references. Approaches: 1. There was no economic bubble before the 1929 market crash. Irving Fisher, a famous US economist said: “prices of stocks are low.” Fisher based his projection on strong earnings reports, fewer industrial disputes, and evidence of high investment in R&D and other intangible capital. 2. There was insider trading. In the months prior to the stock market crash of 1929, the price of a seat on the NYSE was abnormally low. Rising stock prices and volume should have driven up seat prices during the boom of 1929; instead, there were negative cumulative abnormal returns to seats of approximately 20% in the months before the market crash. A