Backtesting metrics

Seasonal Anomalies Jul 8, 2021

This article contains the explanations of the backtesting metrics used to describe and evaluate the characteristics and the performance of an anomaly, or a portfolio of anomalies.

Count and Time

N°Years: the number of years of history of a financial instrument.
N°Trades: the number of trades.
N°Trades 1y: the average number of trades that has been done in a calendar year.
Avg Trade Duration (days): it is the average duration, in calendar days, of a single trade.

Total Return and Profit

Total[R]: it is the total return in percentage.
Total profit: it is the total profit, in dollars, starting from a capital of 100.

Yearly Trade Returns

Avg [R] annualized: it's the average return of the trades, in percentage, annualized.
Avg [R] 1y: it is the average yearly return of the trades.
Stdev [R] 1y: it's the standard deviation of the yearly returns.
RR 1y: the Reward / Risk or RR, is the ratio between the yearly average return and the standard deviation of the yearly returns
Winning % 1y: it is the percentage of the positive yearly returns.
Sharpe:
- the 'Sharpe ratio' is a metric to evaluate the risk-adjusted return of the Anomaly or the Portfolio (let's call them 'strategy').
- It indicates how well the strategy has performed in comparison to a 'Risk-Free' rate of return.
- It is computed as the ratio between the yearly excess return of the strategy, over the risk free (US 3m T-Bill rate) and the standard deviation of the yearly returns of the strategy.
Sortino
- the Sortino ratio' is a metric to evaluate the** risk-adjusted return** of the anomaly/portfolio.
- It is similar to the Sharpe ratio, and differs just in the denominator: it only considers the standard deviation of the downside risk, rather than that of the entire (upside + downside) risk.

Trade Returns

Avg [R]: it is the average return of the trades.
Stdev [R]: it is the standard deviation of the returns.
Reward / Risk (RR): is the ratio between the average return and the standard deviation.
Winning %: it is the percentage of winning trades.
Profit Factor (PF):
- is an index of the quality of trading, which evaluates, with a number, the relationship between the risks assumed and the results.
- It is computed by dividing the sum of the profits by the sum of the losses.
Stability:
- it's the stability of the equity line of the backtest.
- It can goes from 0% (min stability) to 100% (max stability).
- An high stability means that the equity have had a steady linear rise over time.

Drawdown

Avg Dwn:
- is the average drawdown of the equity line of the backtest.
- The lower is the average drawdown, the closer the equity line has been to its all-time highs.
Max Dwn:
- is the maximum drawdown of the equity line.
Max Dwn / Avg [R] 1y
- is the ratio between the maximum drawdown of the anomaly and its average yearly return.
- It expresses how many years it could take to recover from a drawdown equal to the maximum historical drawdown.

Risk in the worst case scenario

C-VaR:
- C-VaR stands for 'Conditional Value at Risk', and is also called 'Expected Shortfall'.
- It is an advanced metric and is used in portfolio optimization for effective risk management.
- It is computed by taking the average of the “extreme” losses in the tail of the distribution of historical returns, beyond the value at risk (VaR) cutoff point, usually 99%. So is the average of the worst centile (1%) historical returns.

Winning/Losing Streaks

Z-Score streaks
- the Z-Score streaks measures how it is likely that our streaks of trades (consecutive wins and consecutive loss) are random or not.
- It fluctuates between -3 to +3, but sometimes, can go above and below these levels.
  - A positive Z-score means that a profitable position is likely to be followed by a losing one, while a losing one should probably be followed by a winning one, so the probability of long winning and losing streaks is low.
  - A Z-score value of 0 means that we are dealing with completely random results.
  - A negative Z-score means that a profitable position is likely to be followed by more profitable positions, and a losing position uses to be followed by more losing positions, it means that winning or losing streaks are probable.
- For example, if the last trade was a winning one, we can expect that the following one will be:
  - if Z-Score is near to +3 ==> losing
  - if Z-Score is near to 0 ==> 50% losing, 50% winning
  - if Z-Score is near to -3 ==> winning.

Excess Metrics of Anomalies

Exc. Avg [R] ann.: it is the difference between the 'average gross return annualized' of the anomaly and the 'other trades' one.
Exc. Avg [R]: it is the difference between the 'average gross return' of the anomaly and the 'other trades' one.
Exc. RR: it's the difference between the 'reward / risk' of the anomaly and the 'other trades' one.
Exc. Winning %: it is the difference between the 'positive percentage' of the anomaly and the 'other trades' one.

Note that:

Excess metrics are computed over the 'Other Trades': trades done in the same instrument but in a different periods than the Anomaly one.
For example, "Apple TDW 1" are as other trades: "Apple TDW 2,3,4,5".
Technicality: the 'other trades' returns are duration-adjusted according to the average trade duration of the anomaly. In the "Apple TDW 1" case, the returns of "Apple TDW 2,3,4,5" are divided by 4.

Excess Metrics of Portfolios

Exc. Avg [R] 1y on Bench.: it's the average yearly return of the portfolio minus the benchmark one.
Exc. Avg [R] 1y on RF: it is the average yearly return of the portfolio minus the Risk Free one.
Exc. RR 1y on Bench.: the yearly Reward / Risk of the portfolio minus the benchmark one.
Winning % 1y on Bench.: the percentage of years in which the return of the portfolio has been higher than the benchmark one.
Winning % 1y on RF: it's the percentage of years in which the return of the portfolio has been higher than the Risk Free one.

Note that:

Excess metrics are computed over 'Benchmark' or over 'Risk-free rate' returns.
- The 'Benchmark' is the S&P 500.
- The 'Risk-free rate' is the annualized US 3 months T-Bill rate.

Scores and Ratings

Score
- The 'Score' is the origin of the 'Rating'. Statistically speaking, the'Score' is computed from a statistical test, which is different according to the distribution of returns, as '1 / p-value'.
- There are 2 types of scores:
  - Score on zero: (for 'anomalies' and 'portfolios') measures how much a set of returns is significantly different from a set of returns with zero-mean.
  - Score on others: (for 'anomalies' only) measures how much a set of returns is significantly different with the 'other trades' set of returns (explained above with Apple TDW example).

There are 3 types of Ratings:

Rating
- is the 'Rating' derived from the 'Score' and goes from 0 to 5 stars.
- The higher the Rating, the more the returns are statistically significantly different from zero or from another set of returns.
- the Rating is given according to a 'Score' clustering.
Rating FC
- it's a variation of the Rating
- it is computed as the weighted average of 3 Ratings given in a same period (In-Sample, or Out-of-Sample, or Entire-Backtest). The formula is: 0.4 'Rating Net' + 0.4 'Rating on Others' + 0.2 'Rating Gross'.
Rating FC summary
- it is a Rating derived from the 'Rating FC' computed over the entire backtest returns. The formula is like: 0.4 'Rating Net All' + 0.4 'Rating on Others All' + 0.2 'Rating Gross All'
- I have written 'is like' because, in addition, the final formula gives:
  - a bonus to the Anomalies that were able to keep the good 'In-Sample' performance in the 'Out-of-Sample' period;
  - a malus vice-versa in the opposite case.