Predicting Flood Damage of Northwest Florida Counties using Random Forest Regression Modeling - Spring 2026
Introduction
Flooding is caused by many mechanisms, including when heavy rainfall causes rivers to overflow, when impervious surfaces reduce water infiltration, and/or when soil becomes oversaturated (Javadinejad (2022); Lecce (2000)). Flash floods are considered to be one of the most severe natural disasters due to their rapid onset, unpredictability, infrastructure damage, and associated fatalities (Diakakis et al. (2020); Sadkou et al. (2024)). Infrastructure and agricultural damage costs have steadily increased in the twentieth century (Saharia et al. (2017)). According to the IPCC, the severity and frequency of flood-causing severe storms will likely increase due to climate change (Intergovernmental Panel on Climate Change (2021); Halsnæs et al. (2023)). Due to Northwest Florida’s location relative to the Gulf of Mexico and major hurricane paths, most flooding research has focused on sea-level rise and coastal flooding, leaving inland flood risk understudied (Bilskie et al. (2014); Morss et al. (2024)). Floods are extremely important to keep track of as they prove to contribute to human deaths, property damage, ecosystem changes, and social-economic losses (W. Chen et al. (2020)). Growing evidence of machine learning models’ capabilities to accurately predict storm severity and susceptibility highlights a major opportunity to apply these models to Northwest Florida (Gensini et al. (2021); Akshay et al. (2021); Z. Chen et al. (2022)).
Random Forest (RF) is a type of machine learning algorithm that predicts the dependent variable based on independent variables (Salman, Kalakech, and Steiti (2024)). RF is a smart model that uses simple models, aka decision trees (Salman, Kalakech, and Steiti (2024)). Errors produced by the random forest training can be counteracted by increasing the number of samples and trees of classification (Wang et al. (2015)). Additionally, each decision tree’s selection of data is random, which prevents the model from memorizing the data (Salman, Kalakech, and Steiti (2024)). The RF application allowed for the importance of each variable to be assessed, and each index contribution to the overall risk was calculated pertaining to the relevant study of interest in China (Wang et al. (2015)). The RF technique is optimal for performing hydrological studies due to the ability of the machine learning model to map flood risk analysis for hazard assessment with a low-cost application of the non-traditional model compared to rainfall-runoff models (Zhu and Zhang (2022)). For instance, a machine learning study in 2024 using 16 flood risk factors (FRFs), including historic flood damage locations, found that several high and very high flood risk areas closely overlapped with FEMA’s 100-year floodplain in Tampa Bay (Dey et al. (2024)). In another hydrological study, RF was able to reproduce the characteristics of the studied flood events and represented the daily flood discharge well in comparison with the hydromad modeling (Schoppa, Disse, and Bachmair (2020)). As research progresses, it is noted that identifying key driving factors and validating them via machine learning is a key driver in precision flood prediction (Tan et al. (2024)).
Northwest Florida, often referred to as the Florida Panhandle, is home to 16 counties: Escambia, Santa Rosa, Okaloosa, Walton, Holmes, Washington, Bay, Jackson, Calhoun, Gulf, Liberty, Franklin, Gadsden, Leon, Wakulla, and Jefferson. Since January 1996, these counties have had 1,052 floods, 790 of which were flash floods and 262 non-flash floods. Flooding has resulted in approximately $656 million in property damage throughout the counties, with Escambia County ranking number one in damage with approximately $152 million since 1996 (National Oceanic and Atmospheric Administration (2024)). This study will be utilizing RF to predict flood damage in Northwest Florida based on housing units, population, median home value, flood type (flash flood vs. flood), precipitation (1-day, 3-day, and 7-day cumulative amounts), and county. This type of RF modeling has not been used at a county level in Northwest Florida and most flooding research is focused on coastal flooding due to the counties being in major hurricane paths. This is important because floods are not uniform for every county and mitigation planning is an important part of preparing for natural hazards.
Methods
Random Forest is a supervised machine learning algorithm that introduces two sources of randomness during training. First, it uses bootstrap sampling of the training data, meaning data points are randomly selected with replacement so that the same data point may appear multiple times in a tree’s training sample, while others may not appear at all. Second, at each split within a tree, a random subset of predictor variables is considered when determining the best split. A tree continues to split until further splitting stops improving prediction or a preset depth limit is reached (Dutta, Paul, and Kumar (2023)).
In Random Forest a tree will start at a root node containing all sampled data points. Then it chooses a variable to split the data (i.e. flood versus flash flood) then splits all sampled data points by that criteria. (Rigatti (2017)) The tree creates splits on either side of the original split and then reorganizes the data points based on that criteria, as seen in Figure 1. Each decision tree corresponds to the binary split of an explanatory variable: rainfall amount, flood type, or housing unit amount. Each decision produces an outcome determined by the algorithm to create a final prediction of flood damages using random forest for the Northwest Florida counties.
Random forest is a flexible, buildable, and straightforward machine learning method. Random forest produces hundreds of decision trees and binary splitting to determine the best split per possible variable. Each tree is trained with a random parameter, and the splitting function is used for the selection of the more fitting observable variable. Random forest can handle different feature types of binary, categorical, and numeric data for the use of classification, clustering, and/or regression analysis. The random forest method ensembles powerful extractions and applications of information with decision trees utilized in its model structure (Speiser et al. (2019)).
Random forest regression analysis is capable of handling large datasets and computing the variable importance of each data component. Random forest is capable of handling highly dimensional data as it produces regression trees with binary splits for outcome predictions. Random forest applies to the flood damage prediction method as it determines the model’s accuracy for optimal predictors of the dataset based on variable selection and statistical characteristics. The three datasets chosen for our prediction analysis require decision-making to develop the outcome of interest of flood damage prediction (Haddouchi and Berrado (2019)).
Formula
Mean square error is used in random forest regression to determine the optimal splits during training and evaluate the model performance. The lower mean square error that is determined, the greater the model accuracy. The squared difference between actual and predicted values is squared (yᵢ − ŷᵢ)². 1 over the number of observations averages the total squared error. The summation symbol Σ represents the summation of the observed data points (Hodson, Over, and Foks (2021)).
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Assumptions
- Each tree makes its own decisions.
- Data points are chosen randomly to reduce mistakes.
- Sufficient data ensures the trees have unique patterns and variety.
- Combining the predictions from different trees leads to a more accurate final result.
Advantages
- Easy to identify importance of features
- Multiple decision trees tend to cause little risk of overfitting (not too close to original training data)
- Very accurate due to multiple subgroups
- Less need for data management (pre-processing to perform random forest)
Limitations
- Slower to acquire observations for best decision making
- Bound by the lowest and highest observations (no extrapolate)
- Difficult to interpret the decisions of ‘best features’ seen with trees## Data and Visualization
Analysis and Results
Data Sources:
Data collected for this project utilizes the NOAA Storm Events Database; daily precipitation observations from the Global Historical Climatology Network Daily (GHCN-Daily); total population, total housing units, and median home value from the American Community Survey 5-year estimates. Variables utilized in this study are listed in Table 1, including the source for each variable.
| Variable | Description | Unit | Source |
|---|---|---|---|
| County | County where the storm event occurred | County name | NOAA Storm Events Database |
| Date | Date of the storm event | Date | NOAA Storm Events Database |
| Flood Type | Classification of event (Flood or Flash Flood) | Category | NOAA Storm Events Database |
| Property Damage | Estimated property damage caused by the storm | USD ($) | NOAA Storm Events Database |
| Rainfall (1-day) | Total precipitation during the storm day | Inches | NOAA GHCN-Daily |
| Rainfall (3-day) | Total precipitation during storm day and previous two days | Inches | NOAA GHCN-Daily |
| Rainfall (7-day) | Total precipitation during storm day and previous six days | Inches | NOAA GHCN-Daily |
| Population | Total county population | Persons | U.S. Census Bureau ACS |
| Housing Units | Total number of housing units in the county | Count | U.S. Census Bureau ACS |
| Median Home Value | Median value of owner-occupied housing units | USD ($) | U.S. Census Bureau ACS |
Data Processing
Data from the NOAA Storm Events Database was collected for the 16 counties in Northwest Florida. Property damage and crop damage amounts were combined to create Total Damage. Flood damage amounts ranged from $0 to hundreds of millions of dollars and many of the flood events resulted in $0 damage. To account for the numerous zeros, the log of total damage was used in the random forest models.
Data from the Global Historical Climatology Network Daily was collected and precipitation days without data were given a zero for precipitation. Precipitation data was already in inches so no further calculations were required.
Wakulla County did not have any precipitation gauges in the county, so precipitation data was collected from the four surrounding counties (Jefferson, Liberty, Leon, Franklin). All days were averaged for the four counties to offer a representation of Wakulla County. In addition, new columns were created for 1-day, 3-day, and 7-day average rainfall prior to a known flood event.
Once all data sources spreadsheets were cleaned up they were merged into one spreadsheet using county name.
Data Visualization
According to the data collected from NOAA, Northwest Florida expereienced 1,203 flooding events between January 1, 2000 and December 31, 2025. Flooding event distribution is represented in Figure 2. Of the flooding events, 77.1% are categorized as flash floods and 22.9% are categorized as floods. This data indicates that Northwest Florida counties experience short-duration, high-intensity rainfall events more often then long-duration rainfall events.
Further breaking down the dataset, Figure 3 illustrates the number of flash flood and flood events by county. Bay County experiences the highest number of flooding events, with over 150 flash floods and approximately 30 flood events recorded between January 1, 2000 and December 31, 2025. In contrast, Liberty County experienced the fewest flood events, with fewer than 25 flash floods and fewer than 10 flood events during the same time period. This pattern is likely influence by the each county’s geographic profile. Bay County has 708 square miles of water while Liberty County has only 7.6 square miles of water (U.S. Census Bureau (2024)).
As discussed in the introduction, rainfall events are forecasted to increase in frequency and intensity. The NOAA data set from NOAA supports this pattern. In Figure 4 yearly flood event totals and total damage are plotted for yearly from January 1, 2020 to December 31, 2025. On the right y-axis, property damage in millions is represented, and on the left y-axis, the number of flood events is represented The orange trend line tells us that the number of flood events is increasing (higher frequency). The red trend line shows us that the total property damage is slightly elevated across the 25 year period.
Table 2 presents the demographic variable totals obatined from the American Community Survey 5-year estimates. The table includes the population size, number of housing units, and the median home values for each of the 16 Northwest Florida counties from the 2024 survey.
| Population, Housing Units, and Median Home Value by County | |||
|---|---|---|---|
| Northwest Florida Counties (ACS 5-Year Estimates - 2024) | |||
| County | Population | Housing Units | Median Home Value ($) |
| Escambia County | 325,923 | 149,217 | $257,200 |
| Leon County | 297,542 | 137,793 | $301,800 |
| Okaloosa County | 216,599 | 103,864 | $351,200 |
| Santa Rosa County | 198,472 | 80,002 | $329,800 |
| Bay County | 186,393 | 108,362 | $310,500 |
| Walton County | 82,948 | 61,272 | $425,100 |
| Jackson County | 48,250 | 20,145 | $120,800 |
| Gadsden County | 43,710 | 19,144 | $174,900 |
| Wakulla County | 35,387 | 14,599 | $258,300 |
| Washington County | 25,529 | 11,044 | $171,000 |
| Holmes County | 19,513 | 8,659 | $110,200 |
| Gulf County | 15,131 | 9,362 | $250,000 |
| Jefferson County | 15,091 | 6,886 | $232,600 |
| Calhoun County | 13,492 | 5,683 | $145,300 |
| Franklin County | 12,553 | 8,562 | $273,300 |
| Liberty County | 7,687 | 3,238 | $127,300 |
Accurate representation of property damage is the essential to creating an accurate and usable model. Due to inflation, property damage values from 20 years ago is not an accurate representation of present day dollars. To improve accuraces,the NOAA data utilized for the model was reduced to the past six years, starting in January 2020. The sample size in this data range was large enough for the Random Forest Model, while portraying meaningful property damage amounts.
To strengthen the model, rainfall data was collected from the Global Historical Climatology Network Daily (GHCN-Daily). Every county with the expection of Calhoun County, had precipitation gauges. For Calhoun County, precipitation data was collected from the four souronding counties (Bay, Gulf, Jackson, and Libery Counties) and averaged per day. Figure 5 illustrates the monthly average rainfall per county in Northwest Florida.
As shown in Figure 5, rainfall amounts are highest during the summer months. However, Figure 6 indicates that flood events occur more frequently in September and April.
In order to verify this theory, rainfall daily maximums were calcualted per month (Figure 7). This figures shows that rainfall daily maximums are highest in September, suggesting that September experiences high-intesntiy but less frequent rainfall events which may contribute to more frequent flooding events.
Random Forest
The first model ran used multiple variables to predict total damage from a known flood event. The variables were 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The R² and RSME for the first model were 0.001 and 2.748, respectively. These numbers suggest that the model was unable to learn anything meaningful, possibly due to the number of flood events with zero damage.
To verify our assumption that the zeros from total damage were the issue with the first model, the second model of the dataset only kept events with property damage greater than zero. The second model used the same predictors: 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The second model preformed much better than the first model, resulting in a R² and RSME of 0.047 and 1.922, respectively. This suggests that removing the floods events with zero damage helped the model learn the relationship between rainfall, county characteristics, and damage more accurately. However, the model was still not predicting total damage very well.
To further improve the second model, the third model included exposure variables such as people and structures exposed to rainfall. To calculate these new variables, rainfall were multipled by population and rainfall by housing units. The third model still included 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The R² and RSME for model three were 0.017 and 2.059, respectively. These results suggested that the new variables did not improve total damage prediction and instead introduced noise to the model.
As learned in model three, a complex model is not always the best model. Model four was simplified by utilizing 1-day prior rainfall, population, housing units, median home value, and only storms with total damage over zero dollars. Removing 3-day and 7-day rainfall variables was completed because past models listed those variables as least important. The 1-day rainfall variable remained useful in all models, similar to housing unites, population, and median home value. The R² and RSME for model four were 0.036 and 1.959, respectively. This result suggested that short-term rainfall was more useful for predicting total damage than longer rainfall accumulations.
Overall, the most useful improvement was removing flood events with zero total damage and simplifying the predictor set. Hoever, the fourth model still restricted in its predictive power, which is understandable since flood damage can be influenced by many other factors not included in the datasets used for this study. For example, soil type and permeability plus elevation can influnce flood damage. To show the progession of our models, all R² and RSME values can be seen in Figure 8.
Though the models did not have strong predictive power for total damage, the predicitive damage by county could be calcualted. In Figure 9, the top counties and their predicted total damage can be seen.