Strategy Evaluation and Candidate Selection

Mar 27, 2024

Abstract

When evaluating the historical success of a strategy with any stock, the ratio of recent wins/losses alone does not provide a statistical edge. However, when correlating factors to strategy success are identified, their accuracy in recent history gives valuable insight into the probability of success for a given position. When these correlating factors are measured against each other there are clear advantages to using some metrics over others. When they are layered and used as decision making criteria, they become filters with an array of combinations. The filters can be used to select candidates (stocks) for strategy implementation. After evaluating and testing filters across 1-year and 5-year research periods, they individually and collectively resulted in improvements over the baseline with independent advantages and disadvantages. Backtesting revealed 3 successful filters that maintained generally independent candidate selection. Using a variable selection model that chooses which filter to implement based on recent success resulted in an increase in trade volume and net return and a decrease in drawdown and volatility.

Goal

The goal of this research project is to evaluate a stock market strategy’s success across an array of stocks and to find correlations between recent success and position outcome. Identified trends within a defined lookback period will be used to select candidates (stocks) to take positions.

Hypothesis

Recent strategy success with any given stock can be measured and used as an indicator of probable success.

Terms

Period (p): A length of time.

Buffer (buff): Time added to the lookback period to account for days lost during look ahead calculations.

Mini Backtest Engine (mini): Simple daily resolution backtest engine built for this project.

Periodic Average Win (pAwin): The average net return during a defined recent period.

Periodic Average Time (pAtime): The average time taken to close a position in a defined recent period.

Count (count): The number of strategy triggers by a stock during a defined recent period.

Ratio (ratio): The measure of how often correlating factors resulted in a predictable return.

Filter: Candidate selection criteria assessed at the trigger point of a strategy based on recent historical data.

Evaluation Process

The evaluation process is done in two parts, Research and Backtesting, both of which are coded in Python and utilize NumPy arrays for fast calculations. Data is sourced from TradingView and QuantConnect. Research is done utilizing jupyter notebooks and Backtesting is completed using QuantConnect’s LEAN engine.

Using TradingView, 200 NYSE tickers were pulled with a price > 10 and average daily volume over 1m. These will stay constant throughout testing even though some tickers will not have 5 years of price history. The strategy trigger is a Lower Bollinger Band Break at the open without any additional parameters. This trigger was chosen for simplicity and to demonstrate how greatly a strategy can be improved with simple historical evaluation. Once triggered a buy order is simulated and a trailing stop and limit is set using the Average True Range (ATR) of the stock.

Calculation Steps:

*repeated across 200 tickers

Pull history at hourly resolution for period of lookback + buff
Calculate Lower Bollinger Band
Consolidate data into daily resolution
Calculate Average True Range (ATR)
Run data through mini
- simulates trades at every trigger
- logs return and time held
Evaluate trade history to calculate pAwin, pAtime, and count
Evaluate result history for trigger date - p
- calculates ratio

Data is pulled at an hourly resolution to set the Bollinger band and then consolidated into a daily resolution while saving the lower band value at the open. ATR is calculated with a daily resolution and neither ATR nor lower band calculations use smoothing. Using ATR to set the stop and limit values allows each stock to have a unique setting based on recent behavior. The data is then sent through mini which operates at a daily resolution.

Mini evaluates each row of data and checks for the trigger criteria. When the trigger criteria is met, the engine sets the stop (stop = open - ATR*2) and limit (limit = open + ATR). If the limit is broken, then both stop and limit are updated (stop = price -ATR*2, limit = price + ATR). If the stop is broken, then the position is closed, and the percentage return and time held are calculated. For the research portion each position was given a 60 day window to be closed out. In the event that a position lasts the full 60 days, the closing price is logged as the exit price. This window size will be narrowed to 20 days for real-time testing in LEAN to ensure that the data picture is as recent as possible.

Once all trades have been simulated, the history prior to each trade is evaluated to determine pAwin, pAtime, and count. For both research and backtesting a lookback period of 8 weeks was used. The ratio is then calculated using pAwin and pAtime. This process is repeated across all 200 tickers.

Ratio

In the early stages of this project, the ratio was calculated as the win rate of the lookback period. This proved to be an ineffective metric after running tests in the research environment. However, there was a consistent correlation between winning trades and pAwin and pAtime. If pAwin was positive there was a higher likelihood the return would be positive. Losing positions generally had a lower pAtime (closed within 10 days) and winning positions generally had a higher pAtime (closing after 20 days). Combining these two metrics and evaluating their accuracy during the lookback period creates a meaningful measurement, however, drastically limits the number of candidates selected.

An important note for both ratio calculation and filter parameters is that the numbers 0 and 10 were chosen as extremely general goal posts to mitigate the risk of overfitting. The numbers 0 and 10 can be changed within a reasonable degree and the results will not be significantly altered.

$ratio = \frac{{p_{\text{wins}}(pAwin > 0 \text{ and } pAtime > 10) + p_{\text{losses}}(pAwin < 0 \text{ and } pAtime < 10)}}{{\text{total wins} + \text{total losses}}}$

Filters

Filters are added in layers and simulate decision making criteria using the information available at and before the trigger point. Each filter will accept or reject candidates based on their performance history. After thorough testing, the following filters were found most effective in the research environment.

Text within this block will maintain its original spacing when published

Filter1_1: ratio > 0.25
Filter1_2: pAwin > 0
Filter1_3: count > 1

Filter2_1: pAwin > 0                                  
Filter2_2: count > 1

Filter3_1: pAtime > 10
Filter3_2 pAwin > 0
Filter3_3: count > 1
Filter3_32: ratio > 0.25
*Filter3_32 does not include Filter3_3

Research Results

Each of the 4 unique layers improve results independently, but when combined the strategy performs significantly better. While results steadily improve, too narrow criteria can drastically limit the number of candidates as well as the potential return. The Filters that performed the best overall in the research environment were Filter1_3, Filter2_2, Filter3_2, Filter3_3, and Filter3_32. The 1-year test (2023-2024 data) gives a good visual example of expected performance. These trends were echoed in the 3-year (2021-2024) and 5-year (2019-2024) results.

The baseline simulates taking a position at every opportunity based only on the trigger criteria. While taking all positions (not accounting for position size, redundant positions, etc) has the highest overall profit, it has a low average return with a low win rate. The Delta shows that the baseline lost 55% of its total gains to losses while taking 5-10x the number of positions as the filters in the 1-year test.

Each filter simulates taking a position only if the candidate meets the filter criteria based on the preceding 8 weeks. While the overall return of the filters is lower than the baseline, the Delta, Average Return, and Win Rate show significantly less volatility. There are clear advantages and disadvantages to each filter which show a tradeoff between prediction accuracy (ratio) and overall return. In research, Filter3_32 has a high rate of accuracy but has a significantly lower overall return than Filter3_2 which accepted the most candidates and took the most positions.

1-year Research

5-year Research

Backtesting in LEAN

QuantConnect’s LEAN engine allows users to program automated strategies, backtest them with data dating as far back as 1998, and deploy them live by connecting to a brokerage. The code developed for research was crafted with LEAN in mind so it could be easily converted and carried over directly with minor changes. While LEAN has built-in features for indicators, order management, and risk management the same calculation process were used as in research to ensure consistency with measurements. This project utilizes a custom order and trailing stop risk management system developed for previous projects which has proven to outperform (in terms of time and computing power) LEAN’s proprietary features.

The initial parameters for backtesting were a starting cash balance of $100,000 and a position size of 5% of total portfolio value. The only condition to take a position, other than the strategy trigger, is that the portfolio is not already invested in that asset. There is no look ahead bias in this model and each backtest takes significant hits in 2020 and during the general market decline of 2022. This is done to show an accurate representation of the research that the model was based on and to allow the strategy and each applied filter to be evaluated without distortion.

The 1-year and 5-year backtests reflected the research finding that filters would only accept 10-25% of possible candidates. Because of the massive order volume in the take-all baseline tests, the starting cash balance was increased to $1,000,000 with a position size of 0.5% which allows the baseline to reflect all possible trades while maintaining similar position size progression as the filter tests.

In the 1-year test the baseline quickly acquired assets until 100% exposure was reached. It then held consistent exposure at approximately 75% regardless of market conditions. The 1-year filter tests showed an overall decrease in volatility with exposure decreasing when market conditions were unfavorable. Filter3_2 and Filter3_32 performed significantly worse than research suggested while Filter1_3, Filter2_2, and Filter3_3 met overall expectations from research. The difference in performance for Filter3_2 and Filter3_32 can be attributed to the added condition that prevents doubling down on a candidate which caused an overall decrease in total orders from all filters.

The 5-year backtests confirmed the same trends where Filter1_3, Filter2_2, and Filter3_3 were most successful. These filters achieved a higher average return than the baseline with a significant decrease in volatility. Filter2_2 achieved the best results overall and the chart reports generated by QuantConnect give a good visual representation of the difference between the filtered candidates and the baseline.

1-year Backtests

Baseline 1-year Backtest 2023-2024

Filter 2_2 1-year Backtest 2023-2024

5-year Backtests

Baseline 5-year Backtest 2019-2024

Filter2_2 5-year Backtest 2019-2024

Variable Filters

Filter2_2 achieved the best overall results in testing. However, Filter1_3 and Filter3_3 consistently bested the baseline while selecting different candidates than Filter2_2. Each of these filters alone had periods where no positions were taken because there were no candidates that fit the criteria. With the goal of increasing the number of position opportunities while maintaining a low level of volatility, an added step was developed to evaluate each filter based on how well it performed in the preceding 8 weeks. Each day the algorithm selects the candidates of either Filter1_3, Filter2_2, or Filter3_3 based on each filter’s average return during the lookback period.

This added step dramatically increased the number of candidates by populating an array of new candidates nearly every trading day. Standalone, the filters may only take new positions once every few days and there was previously no limit to how many positions the algorithm could take at one time. After initial testing, it was determined that the algorithm now produced too many candidates which would lead to over exposure. Further testing revealed that taking more than 5 positions at once did not improve results. To solve this issue, if the algorithm produces more than 5 candidates, only the 5 candidates with the highest individual ratio are selected.

Using this variable method across the same 1-year and 5-year testing periods resulted in a higher volume of total orders, higher net profit, and an on-par win rate when compared with the standalone tests. While the maximum drawdown increased with the higher number of candidates, overall volatility was moderate compared to the baseline. The 1-year backtest produced a 36% net return with only a 15.9% drawdown while filling 3x the orders of Filter2_2 and increasing the win rate to 42%.

In the initial 5-year backtest, the order volume increased too quickly for a $100k portfolio causing it to reach 200% exposure and creating issues with order fulfillment. To present a clear picture of this model’s potential while maintaining position sizing and mitigating over exposure, the starting portfolio value was increased to $200k with a position size of 2.5%. The 5-year backtest produced a 66% return with a 29.8% drawdown and an overall win rate of 40%. No additional risk mitigation protocols were added to achieve this result.

This model is now trading live on the QuantConnect platform through a brokerage connection.