Predicting Flood Damage of Northwest Florida Counties using Random Forest Regression Modeling - Spring 2026

Author

Meagan Russell and Alexis Bjornstad (Advisor: Dr. Cohen)

Published

March 29, 2026

Introduction

Flooding is caused by many mechanisms, including when heavy rainfall causes rivers to overflow, when impervious surfaces reduce water infiltration, and/or when soil becomes oversaturated (Javadinejad (2022); Lecce (2000)). Flash floods are considered to be one of the most severe natural disasters due to their rapid onset, unpredictability, infrastructure damage, and associated fatalities (Diakakis et al. (2020); Sadkou et al. (2024)). Infrastructure and agricultural damage costs have steadily increased in the twentieth century (Saharia et al. (2017)). According to the IPCC, the severity and frequency of flood-causing severe storms will likely increase due to climate change (Intergovernmental Panel on Climate Change (2021); Halsnæs et al. (2023)). Due to Northwest Florida’s location relative to the Gulf of Mexico and major hurricane paths, most flooding research has focused on sea-level rise and coastal flooding, leaving inland flood risk understudied (Bilskie et al. (2014); Morss et al. (2024)). Floods are extremely important to keep track of as they prove to contribute to human deaths, property damage, ecosystem changes, and social-economic losses (W. Chen et al. (2020)). Growing evidence of machine learning models’ capabilities to accurately predict storm severity and susceptibility highlights a major opportunity to apply these models to Northwest Florida (Gensini et al. (2021); Akshay et al. (2021); Z. Chen et al. (2022)).

Random Forest (RF) is a type of machine learning algorithm that predicts the dependent variable based on independent variables (Salman, Kalakech, and Steiti (2024)). RF is a smart model that uses simple models, aka decision trees (Salman, Kalakech, and Steiti (2024)). Errors produced by the random forest training can be counteracted by increasing the number of samples and trees of classification (Wang et al. (2015)). Additionally, each decision tree’s selection of data is random, which prevents the model from memorizing the data (Salman, Kalakech, and Steiti (2024)). The RF application allowed for the importance of each variable to be assessed, and each index contribution to the overall risk was calculated pertaining to the relevant study of interest in China (Wang et al. (2015)). The RF technique is optimal for performing hydrological studies due to the ability of the machine learning model to map flood risk analysis for hazard assessment with a low-cost application of the non-traditional model compared to rainfall-runoff models (Zhu and Zhang (2022)). For instance, a machine learning study in 2024 using 16 flood risk factors (FRFs), including historic flood damage locations, found that several high and very high flood risk areas closely overlapped with FEMA’s 100-year floodplain in Tampa Bay (Dey et al. (2024)). In another hydrological study, RF was able to reproduce the characteristics of the studied flood events and represented the daily flood discharge well in comparison with the hydromad modeling (Schoppa, Disse, and Bachmair (2020)). As research progresses, it is noted that identifying key driving factors and validating them via machine learning is a key driver in precision flood prediction (Tan et al. (2024)).

Northwest Florida, often referred to as the Florida Panhandle, is home to 16 counties: Escambia, Santa Rosa, Okaloosa, Walton, Holmes, Washington, Bay, Jackson, Calhoun, Gulf, Liberty, Franklin, Gadsden, Leon, Wakulla, and Jefferson. Since January 1996, these counties have had 1,052 floods, 790 of which were flash floods and 262 non-flash floods. Flooding has resulted in approximately $656 million in property damage throughout the counties, with Escambia County ranking number one in damage with approximately $152 million since 1996 (National Oceanic and Atmospheric Administration (2024)). This study will be utilizing RF to predict flood damage in Northwest Florida based on housing units, population, median home value, flood type (flash flood vs. flood), precipitation (1-day, 3-day, and 7-day cumulative amounts), and county. This type of RF modeling has not been used at a county level in Northwest Florida and most flooding research is focused on coastal flooding due to the counties being in major hurricane paths. This is important because floods are not uniform for every county and mitigation planning is an important part of preparing for natural hazards.

Methods

Random Forest is a supervised machine learning algorithm that introduces two sources of randomness during training. First, it uses bootstrap sampling of the training data, meaning data points are randomly selected with replacement so that the same data point may appear multiple times in a tree’s training sample, while others may not appear at all. Second, at each split within a tree, a random subset of predictor variables is considered when determining the best split. A tree continues to split until further splitting stops improving prediction or a preset depth limit is reached (Dutta, Paul, and Kumar (2023)).

In Random Forest a tree will start at a root node containing all sampled data points. Then it chooses a variable to split the data (i.e. flood versus flash flood) then splits all sampled data points by that criteria. (Rigatti (2017)) The tree creates splits on either side of the original split and then reorganizes the data points based on that criteria, as seen in Figure 1. Each decision tree corresponds to the binary split of an explanatory variable: rainfall amount, flood type, or housing unit amount. Each decision produces an outcome determined by the algorithm to create a final prediction of flood damages using random forest for the Northwest Florida counties.

Figure 1. Conceptual representation of the random forest algorithm
random_forest_example cluster_1 Tree 1 cluster_2 Tree 2 cluster_3 Tree 3 final Final Prediction (Average of Trees) t1a Rainfall > 3 in? t1d High Damage t1a->t1d Yes t1e Low Damage t1a->t1e No t1d->final t1e->final t2a Flood Type = Flash Flood? t2d High Damage t2a->t2d Yes t2e Moderate Damage t2a->t2e No t2d->final t2e->final t3a Housing Units > 50,000? t3d Moderate Damage t3a->t3d Yes t3e Low Damage t3a->t3e No t3d->final t3e->final

Random forest is a flexible, buildable, and straightforward machine learning method. Random forest produces hundreds of decision trees and binary splitting to determine the best split per possible variable. Each tree is trained with a random parameter, and the splitting function is used for the selection of the more fitting observable variable. Random forest can handle different feature types of binary, categorical, and numeric data for the use of classification, clustering, and/or regression analysis. The random forest method ensembles powerful extractions and applications of information with decision trees utilized in its model structure (Speiser et al. (2019)).

Random forest regression analysis is capable of handling large datasets and computing the variable importance of each data component. Random forest is capable of handling highly dimensional data as it produces regression trees with binary splits for outcome predictions. Random forest applies to the flood damage prediction method as it determines the model’s accuracy for optimal predictors of the dataset based on variable selection and statistical characteristics. The three datasets chosen for our prediction analysis require decision-making to develop the outcome of interest of flood damage prediction (Haddouchi and Berrado (2019)).

Formula

Mean square error is used in random forest regression to determine the optimal splits during training and evaluate the model performance. The lower mean square error that is determined, the greater the model accuracy. The squared difference between actual and predicted values is squared (yᵢ − ŷᵢ)². 1 over the number of observations averages the total squared error. The summation symbol Σ represents the summation of the observed data points (Hodson, Over, and Foks (2021)).

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Assumptions

  • Each tree makes its own decisions.
  • Data points are chosen randomly to reduce mistakes.
  • Sufficient data ensures the trees have unique patterns and variety.
  • Combining the predictions from different trees leads to a more accurate final result.

Advantages

  • Easy to identify importance of features
  • Multiple decision trees tend to cause little risk of overfitting (not too close to original training data)
  • Very accurate due to multiple subgroups
  • Less need for data management (pre-processing to perform random forest)

Limitations

  • Slower to acquire observations for best decision making
  • Bound by the lowest and highest observations (no extrapolate)
  • Difficult to interpret the decisions of ‘best features’ seen with trees## Data and Visualization

Analysis and Results

Data Sources:

Data collected for this project utilizes the NOAA Storm Events Database; daily precipitation observations from the Global Historical Climatology Network Daily (GHCN-Daily); total population, total housing units, and median home value from the American Community Survey 5-year estimates. Variables utilized in this study are listed in Table 1, including the source for each variable.

Table 1. Variables Used in Random Forest Model
Variable Description Unit Source
County County where the storm event occurred County name NOAA Storm Events Database
Date Date of the storm event Date NOAA Storm Events Database
Flood Type Classification of event (Flood or Flash Flood) Category NOAA Storm Events Database
Property Damage Estimated property damage caused by the storm USD ($) NOAA Storm Events Database
Rainfall (1-day) Total precipitation during the storm day Inches NOAA GHCN-Daily
Rainfall (3-day) Total precipitation during storm day and previous two days Inches NOAA GHCN-Daily
Rainfall (7-day) Total precipitation during storm day and previous six days Inches NOAA GHCN-Daily
Population Total county population Persons U.S. Census Bureau ACS
Housing Units Total number of housing units in the county Count U.S. Census Bureau ACS
Median Home Value Median value of owner-occupied housing units USD ($) U.S. Census Bureau ACS

Data Processing

Data from the NOAA Storm Events Database was collected for the 16 counties in Northwest Florida. Property damage and crop damage amounts were combined to create Total Damage. Flood damage amounts ranged from $0 to hundreds of millions of dollars and many of the flood events resulted in $0 damage. To account for the numerous zeros, the log of total damage was used in the random forest models.

Data from the Global Historical Climatology Network Daily was collected and precipitation days without data were given a zero for precipitation. Precipitation data was already in inches so no further calculations were required.

Wakulla County did not have any precipitation gauges in the county, so precipitation data was collected from the four surrounding counties (Jefferson, Liberty, Leon, Franklin). All days were averaged for the four counties to offer a representation of Wakulla County. In addition, new columns were created for 1-day, 3-day, and 7-day average rainfall prior to a known flood event.

Once all data sources spreadsheets were cleaned up they were merged into one spreadsheet using county name.

Data Visualization

According to the data collected from NOAA, Northwest Florida expereienced 1,203 flooding events between January 1, 2000 and December 31, 2025. Flooding event distribution is represented in Figure 2. Of the flooding events, 77.1% are categorized as flash floods and 22.9% are categorized as floods. This data indicates that Northwest Florida counties experience short-duration, high-intensity rainfall events more often then long-duration rainfall events.

Figure 2. Flash Flood vs Flood Events (2000–Present)

Further breaking down the dataset, Figure 3 illustrates the number of flash flood and flood events by county. Bay County experiences the highest number of flooding events, with over 150 flash floods and approximately 30 flood events recorded between January 1, 2000 and December 31, 2025. In contrast, Liberty County experienced the fewest flood events, with fewer than 25 flash floods and fewer than 10 flood events during the same time period. This pattern is likely influence by the each county’s geographic profile. Bay County has 708 square miles of water while Liberty County has only 7.6 square miles of water (U.S. Census Bureau (2024)).

Figure 3. Flood and Flash Flood Events by County

As discussed in the introduction, rainfall events are forecasted to increase in frequency and intensity. The NOAA data set from NOAA supports this pattern. In Figure 4 yearly flood event totals and total damage are plotted for yearly from January 1, 2020 to December 31, 2025. On the right y-axis, property damage in millions is represented, and on the left y-axis, the number of flood events is represented The orange trend line tells us that the number of flood events is increasing (higher frequency). The red trend line shows us that the total property damage is slightly elevated across the 25 year period.

Figure 4. Flood Events and Property Damage Over Time

Table 2 presents the demographic variable totals obatined from the American Community Survey 5-year estimates. The table includes the population size, number of housing units, and the median home values for each of the 16 Northwest Florida counties from the 2024 survey.

Table 2. Population, Housing Units, and Median Home Value by County
Population, Housing Units, and Median Home Value by County
Northwest Florida Counties (ACS 5-Year Estimates - 2024)
County Population Housing Units Median Home Value ($)
Escambia County 325,923 149,217 $257,200
Leon County 297,542 137,793 $301,800
Okaloosa County 216,599 103,864 $351,200
Santa Rosa County 198,472 80,002 $329,800
Bay County 186,393 108,362 $310,500
Walton County 82,948 61,272 $425,100
Jackson County 48,250 20,145 $120,800
Gadsden County 43,710 19,144 $174,900
Wakulla County 35,387 14,599 $258,300
Washington County 25,529 11,044 $171,000
Holmes County 19,513 8,659 $110,200
Gulf County 15,131 9,362 $250,000
Jefferson County 15,091 6,886 $232,600
Calhoun County 13,492 5,683 $145,300
Franklin County 12,553 8,562 $273,300
Liberty County 7,687 3,238 $127,300

Accurate representation of property damage is the essential to creating an accurate and usable model. Due to inflation, property damage values from 20 years ago is not an accurate representation of present day dollars. To improve accuraces,the NOAA data utilized for the model was reduced to the past six years, starting in January 2020. The sample size in this data range was large enough for the Random Forest Model, while portraying meaningful property damage amounts.

To strengthen the model, rainfall data was collected from the Global Historical Climatology Network Daily (GHCN-Daily). Every county with the expection of Calhoun County, had precipitation gauges. For Calhoun County, precipitation data was collected from the four souronding counties (Bay, Gulf, Jackson, and Libery Counties) and averaged per day. Figure 5 illustrates the monthly average rainfall per county in Northwest Florida.

Figure 5. Average Monthly Rainfall by County

As shown in Figure 5, rainfall amounts are highest during the summer months. However, Figure 6 indicates that flood events occur more frequently in September and April.

Figure 6. Average Number of Flood Events per Month

In order to verify this theory, rainfall daily maximums were calcualted per month (Figure 7). This figures shows that rainfall daily maximums are highest in September, suggesting that September experiences high-intesntiy but less frequent rainfall events which may contribute to more frequent flooding events.

Figure 7. Maximum Daily Rainfall by Month

Random Forest

The first model ran used multiple variables to predict total damage from a known flood event. The variables were 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The R² and RSME for the first model were 0.001 and 2.748, respectively. These numbers suggest that the model was unable to learn anything meaningful, possibly due to the number of flood events with zero damage.

To verify our assumption that the zeros from total damage were the issue with the first model, the second model of the dataset only kept events with property damage greater than zero. The second model used the same predictors: 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The second model preformed much better than the first model, resulting in a R² and RSME of 0.047 and 1.922, respectively. This suggests that removing the floods events with zero damage helped the model learn the relationship between rainfall, county characteristics, and damage more accurately. However, the model was still not predicting total damage very well.

To further improve the second model, the third model included exposure variables such as people and structures exposed to rainfall. To calculate these new variables, rainfall were multipled by population and rainfall by housing units. The third model still included 1-day prior rainfall, 3-day prior rainfall, 7-day prior rainfall, population, housing units, and median home value. The R² and RSME for model three were 0.017 and 2.059, respectively. These results suggested that the new variables did not improve total damage prediction and instead introduced noise to the model.

As learned in model three, a complex model is not always the best model. Model four was simplified by utilizing 1-day prior rainfall, population, housing units, median home value, and only storms with total damage over zero dollars. Removing 3-day and 7-day rainfall variables was completed because past models listed those variables as least important. The 1-day rainfall variable remained useful in all models, similar to housing unites, population, and median home value. The R² and RSME for model four were 0.036 and 1.959, respectively. This result suggested that short-term rainfall was more useful for predicting total damage than longer rainfall accumulations.

Overall, the most useful improvement was removing flood events with zero total damage and simplifying the predictor set. Hoever, the fourth model still restricted in its predictive power, which is understandable since flood damage can be influenced by many other factors not included in the datasets used for this study. For example, soil type and permeability plus elevation can influnce flood damage. To show the progession of our models, all R² and RSME values can be seen in Figure 8.

Figure 8. Random Forest Model Performance Comparison

Though the models did not have strong predictive power for total damage, the predicitive damage by county could be calcualted. In Figure 9, the top counties and their predicted total damage can be seen.

Figure 9. Average Predicted Flood Damage by County

References

Akshay, S. et al. 2021. “Predicting Flood Severity Using Machine Learning and Hybrid Models.” International Journal of Disaster Risk Reduction.
Bilskie, M. V., S. C. Hagen, S. C. Medeiros, and D. L. Passeri. 2014. “Dynamics of Sea Level Rise and Coastal Flooding on a Changing Landscape.” Geophysical Research Letters 41 (3): 927–34. https://doi.org/10.1002/2013GL058759.
Chen, Weixin, Yilei Li, Wei Xue, Hadi Shahabi, Shupeng Li, Hao Hong, Xue Wang, et al. 2020. “Modeling Flood Susceptibility Using Data-Driven Approaches of Naïve Bayes Tree, Alternating Decision Tree, and Random Forest Methods.” Science of The Total Environment 701: 134979. https://doi.org/10.1016/j.scitotenv.2019.134979.
Chen, Zhen, Qingsong Wu, Sipeng Han, Jungui Zhang, Peng Yang, and Xingwu Liu. 2022. “A Study on Geological Structure Prediction Based on Random Forest Method.” Artificial Intelligence in Geosciences 3: 226–36. https://doi.org/10.1016/j.aiig.2023.01.004.
Dey, Hemal, Md Munjurul Haque, Wanyun Shao, Matthew VanDyke, and Feng Hao. 2024. “Simulating Flood Risk in Tampa Bay Using a Machine Learning Driven Approach.” Npj Natural Hazards 1: 40. https://doi.org/10.1038/s44304-024-00045-4.
Diakakis, Michalis, Neofytos Boufidis, Jose Maria Salanova Grau, Emmanuel Andreadakis, and Iraklis Stamos. 2020. “A Systematic Assessment of the Effects of Extreme Flash Floods on Transportation Infrastructure and Circulation: The Example of the 2017 Mandra Flood.” International Journal of Disaster Risk Reduction 47: 101542. https://doi.org/10.1016/j.ijdrr.2020.101542.
Dutta, Pijush, Shobhandeb Paul, and Asok Kumar. 2023. “Comparative Analysis of Various Supervised Machine Learning Techniques for Diagnosis of COVID-19.” In Machine Learning for Healthcare Applications. Elsevier. https://doi.org/10.1016/B978-0-323-85172-5.00020-4.
Gensini, Vittorio A., Cody Converse, Walker S. Ashley, and Mateusz Taszarek. 2021. “Machine Learning Classification of Significant Tornadoes and Hail in the United States Using ERA5 Proximity Soundings.” Weather and Forecasting 36 (6): 2143–60.
Haddouchi, Mohammed, and Abdelaziz Berrado. 2019. “A Survey of Methods and Tools Used for Interpreting Random Forest,” 1–6. https://doi.org/10.1109/ICSSD47982.2019.9002770.
Halsnæs, Kirsten, Morten Andreas Dahl Larsen, Tanya Pheiffer Sunding, and Mads Lykke Dømgaard. 2023. “The Value of Advanced Flood Models, Damage Costs and Land Use Data in Cost-Effective Climate Change Adaptation.” Climate Services 32: 100424. https://doi.org/10.1016/j.cliser.2023.100424.
Hodson, Timothy O., Thomas M. Over, and Sydney S. Foks. 2021. “Mean Squared Error, Deconstructed.” Journal of Advances in Earth Systems Modeling 13 (12): e2021MS002681. https://doi.org/10.1029/2021MS002681.
Intergovernmental Panel on Climate Change. 2021. “Weather and Climate Extreme Events in a Changing Climate.” In Climate Change 2021: The Physical Science Basis, edited by V. Masson-Delmotte, P. Zhai, A. Pirani, S. L. Connors, C. Péan, S. Berger, N. Caud, et al. Cambridge University Press.
Javadinejad, Safieh. 2022. “Causes and Consequences of Floods: Flash Floods, Urban Floods, River Floods, and Coastal Floods.” Resources, Environment and Information Engineering 4 (1): 156–66. https://doi.org/10.25082/REIE.2022.01.002.
Lecce, S. A. 2000. “Spatial Variations in the Timing of Annual Floods in the Southeastern United States.” Journal of Hydrology 235 (3-4): 151–69. https://doi.org/10.1016/S0022-1694(00)00273-0.
Morss, Rebecca E., David Ahijevych, Kathryn R. Fossell, Alex M. Kowaleski, and Christopher A. Davis. 2024. “Predictability of Hurricane Storm Surge: An Ensemble Forecasting Approach Using Global Atmospheric Model Data.” Water 16 (11): 1523. https://doi.org/10.3390/w16111523.
National Oceanic and Atmospheric Administration. 2024. “Storm Events Database.” https://www.ncdc.noaa.gov/stormevents/.
Rigatti, Steven J. 2017. “Random Forest.” Journal of Insurance Medicine 47 (1): 31–39. https://doi.org/10.17849/insm-47-01-31-39.1.
Sadkou, Salma, Guillaume Artigue, Noémie Fréalle, Pierre-Alain Ayral, Séverin Pistre, Sophie Sauvagnargues, and Anne Johannet. 2024. “A Review of Flash-Floods Management: From Hydrological Modeling to Crisis Management.” Journal of Flood Risk Management 17 (3): e12999. https://doi.org/10.1111/jfr3.12999.
Saharia, Manabendra, Pierre-Emmanuel Kirstetter, Humberto Vergara, Jonathan J. Gourley, Yang Hong, and Marine Giroud. 2017. “Mapping Flash Flood Severity in the United States.” Journal of Hydrometeorology 18 (2): 397–411. https://doi.org/10.1175/JHM-D-16-0082.1.
Salman, Hasan Ahmed, Ali Kalakech, and Amani Steiti. 2024. “Random Forest Algorithm Overview.” Babylonian Journal of Machine Learning 2024: 69–79.
Schoppa, L., M. Disse, and S. Bachmair. 2020. “Evaluating the Performance of Random Forest for Large-Scale Flood Discharge Simulation.” Journal of Hydrology 590: 125531. https://doi.org/10.1016/j.jhydrol.2020.125531.
Speiser, Jaime Lynn, Michael E. Miller, Janet Tooze, and Edward Ip. 2019. “A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling.” Expert Systems with Applications 134: 93–101. https://doi.org/10.1016/j.eswa.2019.05.028.
Tan, X. Z., Y. Li, X. X. Wu, C. Dai, X. L. Zhang, and Y. P. Cai. 2024. “Identification of the Key Driving Factors of Flash Flood Based on Different Feature Selection Techniques Coupled with Random Forest Method.” Journal of Hydrology: Regional Studies 51: 101624. https://doi.org/10.1016/j.ejrh.2023.101624.
U.S. Census Bureau. 2024. “American Community Survey 5-Year Estimates.” https://data.census.gov.
Wang, Z., C. Lai, X. Chen, B. Yang, S. Zhao, and X. Bai. 2015. “Flood Hazard Risk Assessment Model Based on Random Forest.” Journal of Hydrology 527: 1130–41. https://doi.org/10.1016/j.jhydrol.2015.06.008.
Zhu, Z., and Y. Zhang. 2022. “Flood Disaster Risk Assessment Based on Random Forest Algorithm.” Neural Computing and Applications 34: 3443–55. https://doi.org/10.1007/s00521-021-05757-6.