Vol. 15 No. 4 (2024): (Issue in progress)
Research Article

Introducing Spatial Heterogeneity via Regionalization Methods in Machine Learning Models for Geographical Prediction: A Spatially Conscious Paradigm

Lukas Boegl
Department of Geography and Regional Research, University of Vienna, Universitätsstraße 7, 1010 Vienna, Austria
Ourania Kounadi
Department of Geography and Regional Research, University of Vienna, Universitätsstraße 7, 1010 Vienna, Austria
Median house price: full dataset, 20,640 points (left) and small dataset, 2051 points (right).

Published 2024-10-17

Keywords

  • spatial heterogeneity,
  • regionalization,
  • machine learning,
  • spatial clustering,
  • geographical modelling

How to Cite

Boegl, Lukas, and Ourania Kounadi. 2024. “Introducing Spatial Heterogeneity via Regionalization Methods in Machine Learning Models for Geographical Prediction: A Spatially Conscious Paradigm”. European Journal of Geography 15 (4):244-55. https://doi.org/10.48088/ejg.l.boe.15.4.244.255.
Received 2024-07-29
Accepted 2024-10-17
Published 2024-10-17

Abstract

This study addresses the challenge of incorporating spatial heterogeneity in predictive modeling by introducing regionalization methods in the preprocessing step of the modeling workflow. Spatial heterogeneity, where the mean of attribute values varies across spatial units, poses difficulties for traditional models. To tackle this, we propose a novel approach called Regionalization Random Forest (RegRF), which combines Random Forest with regionalization techniques to enhance predictive performance. Regionalization combines multiple spatial objects into homogeneous regions, which are incorporated into predictive models, allowing models to capture local variations. This research investigates three key questions: (1) How does the predictive performance of RegRF vary when constructed using different regionalization techniques? (2) How does RegRF compare to benchmark methods, including both spatial statistical approaches and spatially conscious machine learning models like Geographically Weighted Random Forest (GW-RF)? Five regionalization methods—WARD, AZP, Kmeans, SKATER, and Max-p—are tested on datasets of varying sizes. Results show that RegRF significantly improves performance over "non-spatial" Random Forest models with minimal additional computation time. While RegRF performs competitively with Geographically Weighted Regression, it requires much less computational effort. GW-RF was not outperformed on smaller datasets but failed to complete for larger datasets. These findings suggest that RegRF can enhance machine learning models by accounting for spatial phenomena, with potential for further optimization.

Highlights:

  • RegRF significantly increases the performance of the predictive models in comparison to "non-spatial" Random Forest models, while only taking a few seconds longer to compute.
  • It competes the well-established Geographically Weighted Regression, while only requiring a fraction of the computational effort.
  • It can be used for larger datasets while the Geographically Weighted Random Forest may not be able to finish computation.

Downloads

Download data is not yet available.

References

  1. Anselin, L. (1995). Local Indicators of Spatial Association—LISA. Geographical Analysis, 27(2), 93–115. https://doi.org/10.1111/j.1538-4632.1995.tb00338.x
  2. Assunção, R. M., Neves, M. C., Câmara, G., & Da Costa Freitas, C. (2006). Efficient regionalization techniques for socio‐economic geographical units using minimum spanning trees. International Journal of Geographical Information Science, 20(7), 797–811. https://doi.org/10.1080/13658810600665111
  3. Behrens, T., Schmidt, K., Viscarra Rossel, R. A., Gries, P., Scholten, T., & MacMillan, R. A. (2018). Spatial modelling with Euclidean distance fields and machine learning. European Journal of Soil Science, 69(5), 757–770. https://doi.org/10.1111/ejss.12687
  4. Box, G. E. P. (1979). Robustness in the Strategy of Scientific Model Building. In Robustness in Statistics (pp. 201–236). Elsevier. https://doi.org/10.1016/B978-0-12-438150-6.50018-2
  5. Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). Geographically Weighted Regression: A Method for Exploring Spatial Nonstationari-ty. Geographical Analysis, 28(4), 281–298. https://doi.org/10.1111/j.1538-4632.1996.tb00936.x
  6. Deng, Y., He, R., & Liu, Y. (2023). Crime risk prediction incorporating geographical spatiotemporal dependency into machine learning models. Information Sciences, 646, 119414. https://doi.org/10.1016/j.ins.2023.119414
  7. Duque, J. C., Anselin, L., & Rey, S. J. (2012). The MAX-P ‐ Regions Problem. Journal of Regional Science, 52(3), 397–419. https://doi.org/10.1111/j.1467-9787.2011.00743.x
  8. Feng, X., Barcelos, G., Gaboardi, J. D., Knaap, E., Wei, R., Wolf, L. J., Zhao, Q., & Rey, S. J. (2022). spopt: A python package for solving spatialopti-mization problems in PySAL. Journal of Open Source Software, 7(74), 3330. https://doi.org/10.21105/joss.03330
  9. Geary, R. C. (1954). The Contiguity Ratio and Statistical Mapping. The Incorporated Statistician, 5(3), 115. https://doi.org/10.2307/2986645
  10. Georganos, S., & Kalogirou, S. (2022). A Forest of Forests: A Spatially Weighted and Computationally Efficient Formulation of Geographical Ran-dom Forests. ISPRS International Journal of Geo-Information, 11(9), 471. https://doi.org/10.3390/ijgi11090471
  11. Grekousis, G. (2020). Spatial Analysis Methods and Practice: Describe – Explore – Explain through GIS (1st ed.). Cambridge University Press. https://doi.org/10.1017/9781108614528
  12. Haining, R. P. (2010). The Nature of Georeferenced Data. In M. M. Fischer & A. Getis (Eds.), Handbook of Applied Spatial Analysis (pp. 197–217). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-03647-7_12
  13. Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518
  14. Kiely, T. J., & Bastian, N. D. (2019). The Spatially-Conscious Machine Learning Model (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1902.00562
  15. Klemmer, K., Koshiyama, A., & Flennerhag, S. (2019). Augmenting correlation structures in spatial data using deep generative models (Version 1). arXiv. https://doi.org/10.48550/ARXIV.1905.09796
  16. Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer New York. https://doi.org/10.1007/978-1-4614-6849-3
  17. Liang, P., Qin, C.-Z., & Zhu, A.-X. (2024). Using Automated Machine Learning for Spatial Prediction—The Heshan Soil Subgroups Case Study. Land, 13(4), 551. https://doi.org/10.3390/land13040551
  18. Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. 2, 5.
  19. Liu, X., Kounadi, O., & Zurita-Milla, R. (2022). Incorporating Spatial Autocorrelation in Machine Learning Models Using Spatial Lag and Eigenvector Spatial Filtering Features. ISPRS International Journal of Geo-Information, 11(4), 242. https://doi.org/10.3390/ijgi11040242
  20. Lotfata, A., Georganos, S., Kalogirou, S., & Helbich, M. (2022). Ecological Associations between Obesity Prevalence and Neighborhood Determi-nants Using Spatial Machine Learning in Chicago, Illinois, USA. ISPRS International Journal of Geo-Information, 11(11), 550. https://doi.org/10.3390/ijgi11110550
  21. Majumder, S., Roy, S., Bose, A., & Chowdhury, I. R. (2023). Multiscale GIS based-model to assess urban social vulnerability and associated risk: Evidence from 146 urban centers of Eastern India. Sustainable Cities and Society, 96, 104692. https://doi.org/10.1016/j.scs.2023.104692
  22. Molnár, V. É., Simon, E., University of Debrecen, Hungary, & Szabó, S. (2020). Species-level classification of urban trees from worldview-2 image-ry in Debrecen, Hungary: An effective tool for planning a comprehensive green network to reduce dust pollution. European Journal of Geog-raphy, 11(2), 33–46. https://doi.org/10.48088/ejg.v.mol.11.1.33.46
  23. Moran, P. A. P. (1950). Notes on Continuous Stochastic Phenomena. Biometrika, 37(1/2), 17. https://doi.org/10.2307/2332142
  24. Mueller, E., Sandoval, J. S. O., Mudigonda, S., & Elliott, M. (2018). A Cluster-Based Machine Learning Ensemble Approach for Geospatial Data: Estimation of Health Insurance Status in Missouri. ISPRS International Journal of Geo-Information, 8(1), 13. https://doi.org/10.3390/ijgi8010013
  25. Nduwayezu, G., Kagoyire, C., Zhao, P., Eklund, L., Pilesjo, P., Bizimana, J. P., & Mansourian, A. (2024). Spatial Machine Learning for Exploring the Variability in Low Height‐For‐Age From Socioeconomic, Agroecological, and Climate Features in the Northern Province of Rwanda. Geo-Health, 8(9), e2024GH001027. https://doi.org/10.1029/2024GH001027
  26. Nikparvar, B., & Thill, J.-C. (2021). Machine Learning of Spatial Data. ISPRS International Journal of Geo-Information, 10(9), 600. https://doi.org/10.3390/ijgi10090600
  27. Nussbaum, M., Spiess, K., Baltensweiler, A., Grob, U., Keller, A., Greiner, L., Schaepman, M. E., & Papritz, A. (2018). Evaluation of digital soil map-ping approaches with large sets of environmental covariates. SOIL, 4(1), 1–22. https://doi.org/10.5194/soil-4-1-2018
  28. Openshaw, S. (1977). A Geographical Solution to Scale and Aggregation Problems in Region-Building, Partitioning and Spatial Modelling. Transac-tions of the Institute of British Geographers, 2(4), 459. https://doi.org/10.2307/622300
  29. Pace, R. K., & Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability Letters, 33(3), 291–297. https://doi.org/10.1016/S0167-7152(96)00140-X
  30. Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems, 9(2), 181–199. https://doi.org/10.1007/s10021-005-0054-1
  31. Quevedo, R. P., Maciel, D. A., Uehara, T. D. T., Vojtek, M., Rennó, C. D., Pradhan, B., Vojteková, J., & Pham, Q. B. (2022). Consideration of spatial heterogeneity in landslide susceptibility mapping using geographical random forest model. Geocarto International, 37(25), 8190–8213. https://doi.org/10.1080/10106049.2021.1996637
  32. Quiñones, S., Goyal, A., & Ahmed, Z. U. (2021). Geographically weighted machine learning model for untangling spatial heterogeneity of type 2 diabetes mellitus (T2D) prevalence in the USA. Scientific Reports, 11(1), 6955. https://doi.org/10.1038/s41598-021-85381-5
  33. Raschka, S., & Mirjalili, V. (04). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow (Second edition, fourth release,[fully revised and updated]). Packt Publishing.
  34. Rogerson, P. (2021). Spatial statistical methods for geography. Sage publications.
  35. Roy, S., Bose, A., Majumder, S., Roy Chowdhury, I., Abdo, H. G., Almohamad, H., & Abdullah Al Dughairi, A. (2023). Evaluating urban environ-ment quality (UEQ) for Class-I Indian city: An integrated RS-GIS based exploratory spatial analysis. Geocarto International, 38(1), 2153932. https://doi.org/10.1080/10106049.2022.2153932
  36. Roy, S., & Chowdhury, I. R. (2024). Intoxication in the city: Investigating spatial patterns and determinants of drugs and alcohol-related illegal activities in India’s geostrategic corridor. Applied Geography, 171, 103386. https://doi.org/10.1016/j.apgeog.2024.103386
  37. Santos, F., Graw, V., & Bonilla, S. (2019). A geographically weighted random forest approach for evaluate forest change drivers in the Northern Ecuadorian Amazon. PLOS ONE, 14(12), e0226224. https://doi.org/10.1371/journal.pone.0226224
  38. Sofianopoulos, S., Stigas, S., Stratakos, E., Tserpes, K., Faka, A., & Chalkias, C. (2024). Citizens as Environmental Sensors: Noise Mapping and Assessment on Lemnos Island, Greece, Using VGI and Web Technologies. European Journal of Geography, 15(2), 106–119. https://doi.org/10.48088/ejg.s.sof.15.2.106.119
  39. Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating spatio-temporal dependency into machine learn-ing algorithms. Cities, 131, 103941. https://doi.org/10.1016/j.cities.2022.103941
  40. Stojanova, D., Ceci, M., Appice, A., Malerba, D., & Džeroski, S. (2011). Global and Local Spatial Autocorrelation in Predictive Clustering Trees. In T. Elomaa, J. Hollmén, & H. Mannila (Eds.), Discovery Science (Vol. 6926, pp. 307–322). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_25
  41. Talebi, H., Peeters, L. J. M., Otto, A., & Tolosana-Delgado, R. (2022). A Truly Spatial Random Forests Algorithm for Geoscience Data Analysis and Modelling. Mathematical Geosciences, 54(1), 1–22. https://doi.org/10.1007/s11004-021-09946-w
  42. Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 46, 234. https://doi.org/10.2307/143141
  43. Wei, R., Rey, S., & Knaap, E. (2021). Efficient regionalization for spatially explicit neighborhood delineation. International Journal of Geographical Information Science, 35(1), 135–151. https://doi.org/10.1080/13658816.2020.1759806
  44. Wright, S. (1921). Correlation and causation. Journal of Agricultural Research, 20(7), 557–585.
  45. Yao, S., Wei, M., Yan, L., Wang, C., Dong, X., Liu, F., & Xiong, Y. (2020). Prediction of Crime Hotspots based on Spatial Factors of Random Forest. 2020 15th International Conference on Computer Science & Education (ICCSE), 811–815. https://doi.org/10.1109/ICCSE49874.2020.9201899
  46. Zhang, Z., Xu, N., Liu, J., & Jones, S. (2024). Exploring spatial heterogeneity in factors associated with injury severity in speeding-related crashes: An integrated machine learning and spatial modeling approach. Accident Analysis & Prevention, 206, 107697. https://doi.org/10.1016/j.aap.2024.107697