Introducing Spatial Heterogeneity via Regionalization Methods in Machine Learning Models for Geographical Prediction: A Spatially Conscious Paradigm
Published 2024-10-17
Keywords
- spatial heterogeneity,
- regionalization,
- machine learning,
- spatial clustering,
- geographical modelling
How to Cite
Copyright (c) 2024 Lukas Boegl, Ourania Kounadi
This work is licensed under a Creative Commons Attribution 4.0 International License.
Accepted 2024-10-17
Published 2024-10-17
Abstract
This study addresses the challenge of incorporating spatial heterogeneity in predictive modeling by introducing regionalization methods in the preprocessing step of the modeling workflow. Spatial heterogeneity, where the mean of attribute values varies across spatial units, poses difficulties for traditional models. To tackle this, we propose a novel approach called Regionalization Random Forest (RegRF), which combines Random Forest with regionalization techniques to enhance predictive performance. Regionalization combines multiple spatial objects into homogeneous regions, which are incorporated into predictive models, allowing models to capture local variations. This research investigates three key questions: (1) How does the predictive performance of RegRF vary when constructed using different regionalization techniques? (2) How does RegRF compare to benchmark methods, including both spatial statistical approaches and spatially conscious machine learning models like Geographically Weighted Random Forest (GW-RF)? Five regionalization methods—WARD, AZP, Kmeans, SKATER, and Max-p—are tested on datasets of varying sizes. Results show that RegRF significantly improves performance over "non-spatial" Random Forest models with minimal additional computation time. While RegRF performs competitively with Geographically Weighted Regression, it requires much less computational effort. GW-RF was not outperformed on smaller datasets but failed to complete for larger datasets. These findings suggest that RegRF can enhance machine learning models by accounting for spatial phenomena, with potential for further optimization.
Highlights:
- RegRF significantly increases the performance of the predictive models in comparison to "non-spatial" Random Forest models, while only taking a few seconds longer to compute.
- It competes the well-established Geographically Weighted Regression, while only requiring a fraction of the computational effort.
- It can be used for larger datasets while the Geographically Weighted Random Forest may not be able to finish computation.
Downloads
References
- Anselin, L. (1995). Local Indicators of Spatial Association—LISA. Geographical Analysis, 27(2), 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x
- Assunção, R. M., Neves, M. C., Câmara, G., & Da Costa Freitas, C. (2006). Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees. International Journal of Geographical Information Science, 20(7), 797–811. https://doi.org/10.1080/13658810600665111
- Behrens, T., Schmidt, K., Viscarra Rossel, R. A., Gries, P., Scholten, T., & MacMillan, R. A. (2018). Spatial modelling with Euclidean distance fields and machine learning. European Journal of Soil Science, 69(5), 757–770. https://doi.org/10.1111/ejss.12687
- Box, G. E. P. (1979). Robustness in the Strategy of Scientific Model Building. In Robustness in Statistics (pp. 201–236). Elsevier. https://doi.org/10.1016/B978-0-12-438150-6.50018-2
- Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). Geographically Weighted Regression: A Method for Exploring Spatial Nonstationari-ty. Geographical Analysis, 28(4), 281–298. https://doi.org/10.1111/j.1538-4632.1996.tb00936.x
- Deng, Y., He, R., & Liu, Y. (2023). Crime risk prediction incorporating geographical spatiotemporal dependency into machine learning models. Information Sciences, 646, 119414. https://doi.org/10.1016/j.ins.2023.119414
- Duque, J. C., Anselin, L., & Rey, S. J. (2012). The MAX-P ‐ Regions Problem. Journal of Regional Science, 52(3), 397–419. https://doi.org/10.1111/j.1467-9787.2011.00743.x
- Feng, X., Barcelos, G., Gaboardi, J. D., Knaap, E., Wei, R., Wolf, L. J., Zhao, Q., & Rey, S. J. (2022). spopt: A python package for solving spatialopti-mization problems in PySAL. Journal of Open Source Software, 7(74), 3330. https://doi.org/10.21105/joss.03330
- Geary, R. C. (1954). The Contiguity Ratio and Statistical Mapping. The Incorporated Statistician, 5(3), 115. https://doi.org/10.2307/2986645
- Georganos, S., & Kalogirou, S. (2022). A Forest of Forests: A Spatially Weighted and Computationally Efficient Formulation of Geographical Ran-dom Forests. ISPRS International Journal of Geo-Information, 11(9), 471. https://doi.org/10.3390/ijgi11090471
- Grekousis, G. (2020). Spatial Analysis Methods and Practice: Describe – Explore – Explain through GIS (1st ed.). Cambridge University Press. https://doi.org/10.1017/9781108614528
- Haining, R. P. (2010). The Nature of Georeferenced Data. In M. M. Fischer & A. Getis (Eds.), Handbook of Applied Spatial Analysis (pp. 197–217). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03647-7_12
- Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518
- Kiely, T. J., & Bastian, N. D. (2019). The Spatially-Conscious Machine Learning Model (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1902.00562
- Klemmer, K., Koshiyama, A., & Flennerhag, S. (2019). Augmenting correlation structures in spatial data using deep generative models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1905.09796
- Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer New York. https://doi.org/10.1007/978-1-4614-6849-3
- Liang, P., Qin, C.-Z., & Zhu, A.-X. (2024). Using Automated Machine Learning for Spatial Prediction—The Heshan Soil Subgroups Case Study. Land, 13(4), 551. https://doi.org/10.3390/land13040551
- Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. 2, 5.
- Liu, X., Kounadi, O., & Zurita-Milla, R. (2022). Incorporating Spatial Autocorrelation in Machine Learning Models Using Spatial Lag and Eigenvector Spatial Filtering Features. ISPRS International Journal of Geo-Information, 11(4), 242. https://doi.org/10.3390/ijgi11040242
- Lotfata, A., Georganos, S., Kalogirou, S., & Helbich, M. (2022). Ecological Associations between Obesity Prevalence and Neighborhood Determi-nants Using Spatial Machine Learning in Chicago, Illinois, USA. ISPRS International Journal of Geo-Information, 11(11), 550. https://doi.org/10.3390/ijgi11110550
- Majumder, S., Roy, S., Bose, A., & Chowdhury, I. R. (2023). Multiscale GIS based-model to assess urban social vulnerability and associated risk: Evidence from 146 urban centers of Eastern India. Sustainable Cities and Society, 96, 104692. https://doi.org/10.1016/j.scs.2023.104692
- Molnár, V. É., Simon, E., University of Debrecen, Hungary, & Szabó, S. (2020). Species-level classification of urban trees from worldview-2 image-ry in Debrecen, Hungary: An effective tool for planning a comprehensive green network to reduce dust pollution. European Journal of Geog-raphy, 11(2), 33–46. https://doi.org/10.48088/ejg.v.mol.11.1.33.46
- Moran, P. A. P. (1950). Notes on Continuous Stochastic Phenomena. Biometrika, 37(1/2), 17. https://doi.org/10.2307/2332142
- Mueller, E., Sandoval, J. S. O., Mudigonda, S., & Elliott, M. (2018). A Cluster-Based Machine Learning Ensemble Approach for Geospatial Data: Estimation of Health Insurance Status in Missouri. ISPRS International Journal of Geo-Information, 8(1), 13. https://doi.org/10.3390/ijgi8010013
- Nduwayezu, G., Kagoyire, C., Zhao, P., Eklund, L., Pilesjo, P., Bizimana, J. P., & Mansourian, A. (2024). Spatial Machine Learning for Exploring the Variability in Low Height‐For‐Age From Socioeconomic, Agroecological, and Climate Features in the Northern Province of Rwanda. Geo-Health, 8(9), e2024GH001027. https://doi.org/10.1029/2024GH001027
- Nikparvar, B., & Thill, J.-C. (2021). Machine Learning of Spatial Data. ISPRS International Journal of Geo-Information, 10(9), 600. https://doi.org/10.3390/ijgi10090600
- Nussbaum, M., Spiess, K., Baltensweiler, A., Grob, U., Keller, A., Greiner, L., Schaepman, M. E., & Papritz, A. (2018). Evaluation of digital soil map-ping approaches with large sets of environmental covariates. SOIL, 4(1), 1–22. https://doi.org/10.5194/soil-4-1-2018
- Openshaw, S. (1977). A Geographical Solution to Scale and Aggregation Problems in Region-Building, Partitioning and Spatial Modelling. Transac-tions of the Institute of British Geographers, 2(4), 459. https://doi.org/10.2307/622300
- Pace, R. K., & Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability Letters, 33(3), 291–297. https://doi.org/10.1016/S0167-7152(96)00140-X
- Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems, 9(2), 181–199. https://doi.org/10.1007/s10021-005-0054-1
- Quevedo, R. P., Maciel, D. A., Uehara, T. D. T., Vojtek, M., Rennó, C. D., Pradhan, B., Vojteková, J., & Pham, Q. B. (2022). Consideration of spatial heterogeneity in landslide susceptibility mapping using geographical random forest model. Geocarto International, 37(25), 8190–8213. https://doi.org/10.1080/10106049.2021.1996637
- Quiñones, S., Goyal, A., & Ahmed, Z. U. (2021). Geographically weighted machine learning model for untangling spatial heterogeneity of type 2 diabetes mellitus (T2D) prevalence in the USA. Scientific Reports, 11(1), 6955. https://doi.org/10.1038/s41598-021-85381-5
- Raschka, S., & Mirjalili, V. (04). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow (Second edition, fourth release,[fully revised and updated]). Packt Publishing.
- Rogerson, P. (2021). Spatial statistical methods for geography. Sage publications.
- Roy, S., Bose, A., Majumder, S., Roy Chowdhury, I., Abdo, H. G., Almohamad, H., & Abdullah Al Dughairi, A. (2023). Evaluating urban environ-ment quality (UEQ) for Class-I Indian city: An integrated RS-GIS based exploratory spatial analysis. Geocarto International, 38(1), 2153932. https://doi.org/10.1080/10106049.2022.2153932
- Roy, S., & Chowdhury, I. R. (2024). Intoxication in the city: Investigating spatial patterns and determinants of drugs and alcohol-related illegal activities in India’s geostrategic corridor. Applied Geography, 171, 103386. https://doi.org/10.1016/j.apgeog.2024.103386
- Santos, F., Graw, V., & Bonilla, S. (2019). A geographically weighted random forest approach for evaluate forest change drivers in the Northern Ecuadorian Amazon. PLOS ONE, 14(12), e0226224. https://doi.org/10.1371/journal.pone.0226224
- Sofianopoulos, S., Stigas, S., Stratakos, E., Tserpes, K., Faka, A., & Chalkias, C. (2024). Citizens as Environmental Sensors: Noise Mapping and Assessment on Lemnos Island, Greece, Using VGI and Web Technologies. European Journal of Geography, 15(2), 106–119. https://doi.org/10.48088/ejg.s.sof.15.2.106.119
- Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating spatio-temporal dependency into machine learn-ing algorithms. Cities, 131, 103941. https://doi.org/10.1016/j.cities.2022.103941
- Stojanova, D., Ceci, M., Appice, A., Malerba, D., & Džeroski, S. (2011). Global and Local Spatial Autocorrelation in Predictive Clustering Trees. In T. Elomaa, J. Hollmén, & H. Mannila (Eds.), Discovery Science (Vol. 6926, pp. 307–322). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_25
- Talebi, H., Peeters, L. J. M., Otto, A., & Tolosana-Delgado, R. (2022). A Truly Spatial Random Forests Algorithm for Geoscience Data Analysis and Modelling. Mathematical Geosciences, 54(1), 1–22. https://doi.org/10.1007/s11004-021-09946-w
- Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234. https://doi.org/10.2307/143141
- Wei, R., Rey, S., & Knaap, E. (2021). Efficient regionalization for spatially explicit neighborhood delineation. International Journal of Geographical Information Science, 35(1), 135–151. https://doi.org/10.1080/13658816.2020.1759806
- Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20(7), 557–585.
- Yao, S., Wei, M., Yan, L., Wang, C., Dong, X., Liu, F., & Xiong, Y. (2020). Prediction of Crime Hotspots based on Spatial Factors of Random Forest. 2020 15th International Conference on Computer Science & Education (ICCSE), 811–815. https://doi.org/10.1109/ICCSE49874.2020.9201899
- Zhang, Z., Xu, N., Liu, J., & Jones, S. (2024). Exploring spatial heterogeneity in factors associated with injury severity in speeding-related crashes: An integrated machine learning and spatial modeling approach. Accident Analysis & Prevention, 206, 107697. https://doi.org/10.1016/j.aap.2024.107697