Lessons from J.P. Finley Part 2:
The Finely Forecasting Problem and Implications for modern AI
Recap
In part 1 of this 2-part post, we explored John Park Finley’s experimental tornado forecasting method that he devised and tested in the spring of 1884. From March through late June Finley produced twice-daily tornado forecasts for different regions of central and eastern CONUS. At the end of the experiment, he calculated his verification statistics and pleasantly found that his accuracy was an impressive 96.6%. Even by today’s standards this is an impressive result. Based on analyses by Forecast Advisor, most major weather groups (NWS, Accuweather, The Weather Channel, etc…) have forecast accuracies between 75-93% over a 1-month period for just basic forecast variables like temperature, rainfall, or wind speed.
Pushback
Naturally, a statistic this good drew skepticism, and perhaps rightfully so. One critic, geologist G. K. Gilbert, pointed out that “the occurrence of tornadoes in any one of the districts… is highly exceptional and that their non-occurrence is the rule” (Galway 1984). In essence, Gilbert pointed out that in rare event forecasting you can achieve impressive results by simply saying “nothing will happen”.
To illustrate this point, let’s imagine that we’re tasked with creating a model that predicts your chances of winning the lottery for any given lottery ticket. According to a report by CNBC, the odds of winning the Mega Millions jackpot are 1 in 302,575,350. That’s a 0.00000033% chance for each lottery ticket, and is based on the number of winning combinations (1) compared to the number of total possible combinations (a lot). Using these statistics, we can simply have our model output the baseline probability that any given lottery ticket will win and achieve an accuracy of 99.9999996695%. On paper that’s a fantastic model! In reality, the model tells you absolutely no new or useful information beyond the baseline statistics.
This is what Gilbert set out to demonstrate. By his calculations, if Finley had simply issued a “no tornado” forecast for every single forecast he would have achieved an accuracy of 98.2%. Gilbert then found that by omitting all of Finley’s “no tornado” forecasts accuracy fell to only 23%. As discussed in part 1, Finley’s approach does have merits and is quite impressive considering the state of the science in the late 19th century; however, his methodology simply isn’t sophisticated enough to reliably capture tornado events.
Takeaway #1: Class Imbalance
Nonetheless, Finley’s experiment still holds important lessons for today’s meteorologists - especially those developing supervised machine learning methods to forecast severe weather. The first, and perhaps most important lesson, is the need for balanced classes. What exactly do I mean by this? In supervised machine learning, the developer collects data pertaining to two or more classes (these could be events such as winning or not winning the lottery, the occurrence or non-occurrence of a tornado) to create a training dataset. In most scenarios, the number of each type of class is roughly equal in the dataset and has a frequency ratio of approximately 1:1. In rare event forecasting, this frequency ratio can be highly skewed such as in the lottery example where the ratio of losing lottery tickets to winning lottery tickets is 302,575,350:1. Likewise, in tornado forecasting, the ratio of “a tornado occurred” to “no tornado occurred” can be highly skewed given how truly rare tornadoes are on a day-to-day basis across the world.
To elaborate more on this point, consider this scenario. You're a researcher who wants to develop a machine learning model that predicts tornadoes based purely on surface temperature, dewpoint, and wind speed. Ignoring the fact that this would be woefully insufficient environmental predictors to accurately gauge tornado potential, let's assume that this would be enough data to develop a model. Assuming you're using hourly gridded data with a resolution of 40 km across the CONUS, this would amount to approximately 1000 grid points per hour. 2024 saw the 2nd most number of tornadoes on record, so let's say we're using the entire calendar year of 2024 as our training dataset. That gives you 8760 grid hours worth of data, or approximately 8760000 grid points that can be used as samples for building a training dataset. Again we'll make a questionable assumption that each tornado report in 2024 occurred in its own unique grid hour (this is an unrealistic best case scenario), this would give you only 1796 grid point hours that represent a tornado event. Conversely, you would have 8758204 non-tornado grid point hours; that's a ratio of approximately 4877:1!
This disparity between occurrence and non-occurrence can have significant impacts on a machine learning model if left unchecked. Common machine learning techniques such as random forests or neural networks will quickly learn that they can achieve high accuracy by simply issuing a “non-occurrence” forecast, similar to our lottery model from earlier. In fact, high levels of accuracy should be a warning sign to developers that their rare-event model may be useless in practical application.
The Solution
So how do we avoid this problem? It all comes down to balancing the classes in the training dataset. First of all, and perhaps obviously, developers will want to collect and retain as many rare events as possible. This will be the primary limiting factor for how much data your model will have to train and test with (in other words, how well your model can learn). Secondly, you’ll want to collect a similar quantity of null events with a focus on the quality of those null events. Null event samples that are trivial may provide little to no utility to the machine learning model and could result in poor model performance. For example, let’s say we’re collecting model soundings for tornado and non-tornado events for a random forest model. Soundings that feature zero CAPE and zero low-level SRH are easy to distinguish as “non-tornado” soundings by humans and will similarly be easily distinguishable to a machine learning algorithm. Instead, we’re interested in soundings with large CAPE and shear that are much more difficult to distinguish as a “tornado” or “non-tornado”. Those are high-quality null events that should be leveraged for training a machine learning model. Once the events are collected, the frequency ratio of the event and non-event classes should be approximately balanced with a ratio around 1:1. From here, over-sampling or k-fold cross validation techniques can be utilized if the overall sample size remains small (exactly what is considered “small” is typically dependent on the application).
Takeaway #2: Verification Metrics
Another lesson from Finley’s results is the choice of verification metric. Finley, as well as many today, focus on accuracy as the best assessment of forecast skill, but this is far from the truth. While accuracy can tell you how often a model predicts the correct outcome, it can be misleading depending on the goal of the forecast (e.g. our lottery example). For rare event forecasting, other metrics such as probability of detection (POD), false alarm rate (FAR), or critical success index (CSI) (among many others) can give valuable insight into model performance and characteristics that may or may not be desired. For example, an emergency manager may appreciate a model with very high POD, but high FAR if it means they are never surprised by an adverse weather event. Verification metrics can also be miscontrued to fit narratives. One common claim that’s easy to find in the weather community is that a certain model or forecast service will have a high POD - that is they rarely miss a rare event. That is all well and good, but beware if they don’t mention their FAR!
So which metric should be used when assessing rare-event forecasts? Unfortunately, this is a question with no clear-cut answer. Literature in the meteorological community gives a variety of possible answers. For example, Hitchens et al. 2013 utilize the Critical Success Index normalized to a practically-perfect hindcast to assess forecast skill (the Armchair Forecaster verification routine is partially modeled after this scheme). Other studies, such as Doswell et al. 1990 suggest the Heidke Skill score is preferable because it takes into account both false alarm as well as correct nulls and is more difficult to game compared to similar metrics, like the True Skill Score. Ebert and Milne 2022 also support the view that the Heidke Skill score is perhaps the best single metric for assessing forecast performance, though assert that any skill metric is superior to accuracy alone. Receiver Operator Characteristic (ROC) curves are also popular methods for visualizing forecast performance as they allow developers to determine which forecast thresholds are most skillful and easily compare performance across different models. Precision-Recall curves are similar, but focus only on the rare-event class while ignoring false alarms. Saito and Rehmsmeier 2015 and Sofaer et al. 2018 argue that this makes a Precision-Recall curve a better method for determining forecast utility when dealing with rare events.
While there does appear to be conensus within the rare-event forecasting community that accuracy is not a robust measure of forecast performance, the jury is still out on the optimal verification metric or techinque. In a comprehensive review of rare-event forecasting techinques, Shyalika et al. 2024 point out that evaluation techinques remain a "gap in current literature". Specifically they warn that "many studies still rely on standard and general evaluation metrics designed for balanced datasets, which is challenging because these metrics fail to accurately reflect model performance in rare events, leading to misleading assessments and suboptimal model improvements".
The lack of clear consensus can be frustrating for AI/ML model developers (although it offers potential opportunities for enterprising researchers!). However, that doesn't mean that determining forecast utility is an entirely fruitless effort, it just means that careful thought must be given to what verification metrics can and can't tell you. It is possible that a comprehensive approach that utilizes multiple verification metrics and techniques will give you the most insights on how your forecast model is performing and when/where it will fail. As a last resort, you can always just use the LGTM (Looks Good To Me, credit to Kelton Halbert) method. Subjective evaluation by experienced users is how scientists with the Spring Forecast Experiment gauge forecast performance of new forecast models, which has proven useful for directing future model development. A combination of subjective assessment with multiple objective verification metrics will likely give you enough evidence of forecast performance to avoid heartbreak when you deploy your latest ML model in real-world scenarios.
hidden text!