Machine learning for snow depth estimation over the European Alps, using Sentinel-1 observations, meteorological  forcing data and process-based model simulations

Boeykens, Lucas; Dunmire, Devon; Jans, Jonas-Frederik; Waegeman, Willem; De Lannoy, Gabriëlle; Beernaert, Ezra; Verhoest, Niko E. C.; Lievens, Hans

doi:10.5194/tc-20-3187-2026

Articles | Volume 20, issue 5

https://doi.org/10.5194/tc-20-3187-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/tc-20-3187-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 20, issue 5

Research article

|

29 May 2026

Research article |

| 29 May 2026

Machine learning for snow depth estimation over the European Alps, using Sentinel-1 observations, meteorological forcing data and process-based model simulations

Lucas Boeykens, Devon Dunmire, Jonas-Frederik Jans, Willem Waegeman, Gabriëlle De Lannoy, Ezra Beernaert, Niko E. C. Verhoest, and Hans Lievens

Download

Final revised paper (published on 29 May 2026)
Preprint (discussion started on 30 Jul 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-3327', Anonymous Referee #1, 08 Oct 2025

Summary:
This manuscript presents an extensive analysis of machine learning capabilities for snow depth estimation. The authors compare a variety of machine learning (XGBoost) model configurations and apply a threefold nested cross validation to evaluate their approach. The inputs to the machine learning model are remote sensing data from Sentinel-1 (of which PolSAR had not been used before to estimate snow depth), downscaled meteorological forcing data and physically-based model simulations. The authors then evaluate the importance of features in the machine learning model, and the spatial predictions of snow depth at unseen locations by the model.
The aims and findings of this study are interesting, with the main novelty being the inclusion of PolSAR variables as well as meteorogical forcings or a physically-based model forced with those to predict snow depth at high resolution (100 m) over the Alps. I am impressed with the amount of data processing and careful methodological procedures that the authors went through, which seems very robust.
However, I think the manuscript should more clearly state the novelty of this study in comparison with Dunmire et al. (2024). While the authors claim that the snow depth estimations are improved with the inclusion of PolSAR and meteorological forcings, it is hard to see any significant improvement when comparing similar figures between the manuscripts. Even within this manuscript, it is often claimed that a method improves snow depth estimates without this clearly seen in the figures. Furthermore, I have several concerns regarding the presentation of results, some of them are not very clear and there are many instances of “results not shown”. I believe the authors need to improve the manuscript before it can be published, and I hope my comments below will help.
Main comments:
The title claims snow depth estimation over the European Alps, but there is no map of estimated snow depth over the European Alps, and no map of the predicted snow depth validation over the entire mountain range (as there is in Figure 2, and Figure 7, in Dunmire et al 2024).
About XGBoost: Besides referring the reader to Chen and Guestrin (2016), I think there should be at least a few lines description of what this ML model is and its characteristics, and why the authors (or previous authors) chose this model.
The inclusion in the ML model of physically-based model simulations with meteorological forcing yields a comparable accuracy than using the meteorological forcings directly as input to the ML model. As I understand it, there is therefore no advantage of using physically-based model simulations, as this adds unnecessary complexity. It seems the ML models learns the physics already with the meteorological forcing. I think this is an interesting finding and should be better discussed.
Section 4.1 states a couple of times that differences are significant because p <<0.05 (e.g. differences due to Table C1). However, the improvements are quite marginal (R2 0.88 vs R2 0.89; MAE 0.3m vs MAE 0.29 m). I think the authors should discuss the significance based on the absolute improvement, which is very little, and not the statistical significance, which in this case is clearly just due to the large sample size. See https://www.nature.com/articles/s41598-021-00199-5 and https://linkinghub.elsevier.com/retrieve/pii/S026151771730078X . With this in mind, the authors should revise this section carefully.
There are several instances of “results not shown” in the paper. I think they should all be included as they seem relevant (lines 396, 411, 465, 474, 484, 494, 501, 526, 550). There are also instances where a result is discussed but not seen on any figure (lines 397-399, 431-433, 497-498, 511-512, 534-537, 586)
About the novelty with respect to Dunmire et al (2024). Line 344 even states that the errors are slightly higher in this study than in Dunmire, and another example in line 434 shows very similar results. There should be a more open discussion about the little improvement, despite the novelty of this paper.
Sometimes in the manuscript, it seems that meteorological forcing data AND physically-based model simulations are used simultaneously in the ML model, but that is not the case. I suggest to change the following to OR (not in the title, as that is a list of all the inputs). (Line 8, 568)
Comments by line number:
L34: a reference for “essential climate variables” is missing.
L40: I suggest to add example datasets: “measurements offer frequent data at many locations globally (e.g. Matiu et al. 2021 https://doi.org/10.5194/tc-151343-2021, Fontrodona-Bach et al. 2023; https://doi.org/10.5194/essd-15-2577-2023, Mortimer and Vionnet, 2025) https://doi.org/10.5194/essd-17-3619-2025).
L53: Does “this work” refer to the one in this manuscript? Not clear if it refers to the previous references.
L54: “an increasing snowpack DEPTH”?
L55: I recommend against the use of etc. Either complete the list or simply state the examples.
Lines 62-64 and 75-77 seem to be a repetition of each other regarding the current gap in knowledge.
L74: perhaps: “snow depth retrieval”?
L83: “compared to in-situ measurements.” This needs references.
L85: such as instead of e.g.
L88: This needs a reference at the end of the sentence.
L91: perhaps: “contribute to improving SD predictions”?
L103: coarse instead of course.
L116: The GHC needs a reference (and isn’t It GHCN?). Does the end of the sentence mean only Germany and Slovenia are taken from this dataset?
L145: Why does rescaling matter for interannual start of season differences? It is unclear what this means.
L158: How many are these remaining gaps? How many were filled?
L166: a quick definition of majority resampling would be useful.
L178: What other downscaling techniques?
Equation 1: Where does this downscaling equation come from? A reference or explanation is needed.
L223: With a rather long paper and a lot of specific nomenclature, it is sometimes easy to forget what LIA, or TPI mean, especially for unspecialised readers. I suggest to include a table or list in the appendix with all abbreviations used (or expand Table B1)
L239: Perhaps it is useful to remind the reader that here the input of meteorological data or physical model simulations is still not assessed. I thought there should be 5 configurations otherwise.
L241: Why “next” and not together with the previous?
L243: I suggest “The second configuration, focusing…” instead of “Conversely, the configuration…”
Table 2 caption: I suggest “within this study.”
Table 2: some of the features are presented in the text after the table is presented, therefore the reader does not know what all these variables are. I suggest to spell them out in the caption, or in the acronym table I suggested above.
3.2.3 Snow depth prediction: I do not understand how this title links to the paragraph. It seems that the paragraph is about standardization of features.
L276: If all folds contain at least one station from each of all the boxes of stations within 5 km of each other, how is this a blind validation? I am possibly understanding this wrong, please clarify this.
L278: The procedure for the temporal fold is also not entire clear to me. What does it mean that sites were kept separate, but grouped and divided in 5 folds?
Figure 2: Please a add a legend for the colours and textures.
L308: Does this mean that for these sites, the snow season is less than 10 days?
L314: The bias, although discussed, is not always shown in figures or tables. Please include it (e.g. Table 3, Table C1).
L332: Why is Table C1 not together with Table 3? As it seems quite important and thoroughly discussed.
L336: “the temporal framework overestimates model performance” what does this mean? Can model performance be overestimated or is snow depth overestimated?
L339: This says that the spatio-temporal framework provide a more realistic evaluation of model performance for this study, but lines 331-332 say that performance is highest for the temporal framework and progressively deteriorates in the spatial and spatio-temporal frameworks. These two statements contradict each other.
L349: in the figure c1b caption it says observed-predicted, so a negative bias would mean an overestimation of snow depth. Please standardise this.
L353-355: How is a deterioration of model performance seen as an accurate representation of model performance? This sentence is unclear.
L355: why “also”? Which other improvements were there?
L355-365: As stated in a general comment, I don’t see this little improvement as a significant improvement, I think it is the effect of the sample size.
L368: FSC instead of fSC.
Figure 3: It is difficult to see differences between configurations, perhaps a table in the supplement or Appendix would be useful.
L397-399: How are these results seen in Fig. 4a?
Figure 5: The predicted snow depth time series show a clear flat long period in the middle of the accumulation period (especial at 5a and 5b), which does not match observations very well, suggesting that snowfalls and increasing snow depth in mid-winter are not well captured? This should be discussed. Also please state which sites are these (name, location, source of the data).
L505-507: Linking to one of my main comments, I think the results underscore the potential of using meteorological forcing data alone, as input to ML models (as the improvement of Snowclim is minimal). I think this should be included here.
Figure 6. I suggest adding the title of each configuration on each row, to make the figure more easily readable.
L531: again, the improvement seems quite minimal.
542-543: the potential inability to correctly predict snow density is a key limitation for further refining this method to predict daily time series in the future. This could be discussed.
L539: The authors say weatherML and snowclimML overestimate snow depth, based on the biases. However, when comparing figure 7b with the measured Figure 7d, it seems the opposity. It seems that 7b (weatherML) shows much lower snow depths than measured. In fact, the scatter plot suggests that weatherML outperforms snowclimML, but the snowclimML snow depth map resembles the observations more. This discrepancy should be clarified.
Figure 7. Why do maps have different MAE than their respective scatter plots? Why do the scatter plots have a low density of points when approaching 0 m snow depth?
Figure 8. It would be better to show the maps of snow depth with survey data, without survey data, the difference, and the measured maps, to enable a better comparison.
L593-595: Compare these results to estimates from other studies, such as the results from Dunmire et al 2024.
Equation A1: Can Wsat not become infinite if any weight is zero? Revise or clarify.
L633-635: what downscaling techniques and what parameters?
Figure B1: It would be interesting to see different scatter plots for the snow surveys and the point measurements.
Figure C1. Why not just a map with the bias per station, and compare it with the one from Dunmire et al. (2024) in their Figure 2?

Citation: https://doi.org/10.5194/egusphere-2025-3327-RC1
- AC1: 'Reply on RC1', Lucas Boeykens, 27 Jan 2026
  
  We refer to the PDF file containing our responses to the reviewer’s comments.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3327-AC1
EC1:
'Comment on egusphere-2025-3327', Francesco Avanzi, 18 Nov 2025
Dear authors, I am posting a review on behalf of the second reviewer for this manuscript.
-----------------------
Quantifying the volume of frozen water stored as seasonal mountain snow is a fundamental topic in hydrology and water resources management. The authors address the research need by modeling snow depth (SD) at in situ locations and for spatially distributed applications, using an XGBoost machine learning algorithms and forcing it with Sentinel-1 dual polarized synthetic aperture radar (PolSAR) data products and traditional inputs (e.g., precipitation , elevation, slope, aspect). The novelty of the work is using the PolSAR data in the ML workflow, and rigorously assessing the information gained by comparing to existing ML SD modeling workflows through model performance metrics, feature importance, and SHAP analysis-for a total of 5 ML SD models. The results indicate that the PolSAR features improved model performance at in situ and spatially distributed SD modeling locations, often a key model input in the feature importance assessments but behind meteorological and fraction snow cover area products. A key element of the manuscript is the three-fold nested cross-validation technique used for model evaluation. It is rigorous and establishes a standard for others to follow as the method supports an independent assessment of the model’s spatial, temporal, and spatiotemporal prediction skill.
While the manuscript provides a publishable contribution to the hydrologic and cryosphere modelling community, the manuscript is quite long and could be much more focused. Primarily, this is the results and discussion section that is 9 full pages. While the information is useful, many of the case-by-case examples could be placed in a supplementary information section and the manuscript could communicate the key findings of the study and discuss. I suggest a minor revision, emphasizing a restructuring of the Results/Discussion section to have it more effectively communicate the key results and discuss their relevance to the snow modeling community.

Does the paper address relevant scientific questions within the scope of TC?

Yes, SD modeling is within the scope of TC. And the use of PolSAR data inputs within an ML application is novel.

Does the paper present novel concepts, ideas, tools, or data?

Yes, the use of PolSAR data inputs within an ML application is novel.

Are substantial conclusions reached?

Yes, the authors demonstrate that including PolSAR observations within an ML SD model produces a more skillful SD model (statistically significant) than those without.

Are the scientific methods and assumptions valid and clearly outlined?

Yes, the introduction identifies a research gap (it identifies several), the methods are quite clear (minor comments to address), and the results/discussion shares and interprets the findings.

Are the results sufficient to support the interpretations and conclusions?

Mostly. The results support the discussion interpretations. However, the are some conclusions (in the conclusions section) that are not supported.

Is the description of experiments and calculations sufficiently complete and precise to allow their reproduction by fellow scientists (traceability of results)?

The authors mention that the code will be provided upon manuscript acceptance. So at this time, on cannot assess the reproducibility. The authors provide links to get the data/model inputs.

Do the authors give proper credit to related work and clearly indicate their own new/original contribution?

Yes.

Does the title clearly reflect the contents of the paper?

Yes, but it could highlight the impacts more, rather than saying what is done.

Does the abstract provide a concise and complete summary?

The abstract shares a complete summary, but covers too much. It could be more concise with respect to the main experiments, results, and conclusions.

Is the overall presentation well-structured and clear?

Yes, it is well structured but it is long. The manuscript would substantially benefit from being more focused and highlighting key conclusions. Additional information could be referenced and placed in a supplemental information section.

Is the language fluent and precise?

Yes.

Are mathematical formulae, symbols, abbreviations, and units correctly defined and used?

Mostly. I suggest in Table 2 that the authors provide a definition of the variables (part of the ML configuration). This would support a broader audience’s understanding of the work and provide a clear reference for variables mentioned throughout the manuscript.

Should any parts of the paper (text, formulae, figures, tables) be clarified, reduced, combined, or eliminated?

Yes, the Results/Discussion section is quite lengthy. I suggest revising this section to focus on the key results and the conclusions they lead to. This would support a broader audience’s understanding of authors’ key findings and support a synthesis of the information.
The introduction has great information but could improve in flow. Key research gaps are shared throughout the section, rather than the gaps being shared at the conclusion. Revising the introduction to have each paragraph build on the prior and clearly state the research gap(s) identified in the literature review would improve the section’s flow and support reader comprehension.
Additionally, I suggest additional commentary in the figure captions, highlighting the key takeaways in addition to the description of the labels. Having the authors highlight the key findings in the figure captions will support a more detailed understanding of the work for a broader audience. Lastly, I request that figure 8 includes the No Survey mapped product. Adding this plot to the figure would improve the understanding of the spatial performance of the in situ trained model for spatial prediction.

Are the number and quality of references appropriate?

Yes.

Is the amount and quality of supplementary material appropriate?

Yes, however, much of the manuscript results/discussion section would be appropriate for a supplementary material section so that the main document could be clearer and concise, aiding in interpretation to a broader audience.
Overall quality of the preprint ("general comments"): 7.9/10. It is a quality document, that with some refinement to the communication of the results, would easily move up to 9/10.

Scientific questions/issues ("specific comments")
In the conclusion (line 582-583), the authors state “These results showcase the potential of using PolSAR observations within non-machine learning applications, e.g., within a conceptual model.” This is new information for the reader, and I do not see where the results support this conclusion.
No other questions/issues.

Purely technical corrections ("technical corrections": typing errors, etc.).
No.

Line comments
107-108, 118: How do the in situ and point-based measurements represent the variability of SD throughout the study domain (European Alps). What is the range in snow environments (e.g., Sturm snow classification), elevation, aspect, slope, etc observed in the data. Sharing the distribution of training data would support more nuanced conclusions related to the model output, such as providing more context to model skill at locations other than in situ locations.
180-181: Can the authors provide a citation for their precipitation downscaling method?
259: Table 2. Can the authors include a description of the model variable here? The would provide a reference for the reader as they move through the manuscript and want to know what variables the authors are related to model skill and feature importance.
362/493: Table 3. The authors mention Bias as a model evaluation metric, but it is not in any table or figure. Can the authors add Bias throughout?
508: “SHAP value FI”, I think this is referring to Figure 7. Can the authors clarify what FI figure to look at, or include it if it is not present?
534-536: The elevation bands do not provide meaningful information without connecting to a snow environment (e.g., alpine, tundra, sub-alpine…) as, depending on the location, this could be alpine or forest. I suggest coupling the elevation band ranges with a snow environment type. Lastly, 1000m is substantial and could cover several snow environments. A more representative snow environment communication other than elevation will be more impactful to the readers.
542: Figure 8: I suggest adding the No Survey Data spatial map so the readers can visualize the snow depth prediction across the landscape. Based on the difference map (b,e) it appears that the No survey data map product may estimate the same snow depth for all pixels, not acknowledging terrain impacts on mountain snow depth distribution. Adding two more maps would show that this is not true, or if it is, highlight the need for spatial data to be included in models. This would also help support the claim of spatial SD leading to improved spatial patterns (line 556)
582-583: I do not know how the authors came to this conclusion and do not recall (nor can I find) supporting information within the manuscript to come to this conclusion. I suggest either removing or ensure supporting findings are mentions earlier in the document.
Citation: https://doi.org/10.5194/egusphere-2025-3327-EC1
- AC2: 'Reply on EC1', Lucas Boeykens, 27 Jan 2026
  
  We refer to the PDF file containing our responses to the editor's comments.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3327-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (29 Jan 2026) by Francesco Avanzi

AR by Lucas Boeykens on behalf of the Authors (29 Jan 2026) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (30 Jan 2026) by Francesco Avanzi

RR by Anonymous Referee #1 (06 Mar 2026)

Suggestions for revision or reasons for rejection

I sincerely appreciate the efforts that the authors have made to address or rebut my comments and I am satisfied with the revised manuscript, which has improved with regards to the stated novelty, and the presentation and impact of the results. I congratulate the authors for the good and rigorous work. There are still a couple minor considerations that should be addressed, after which I would be happy to see the paper published.

Follow up on comment 2 about XGBoost: What made the authors choose XGBoost after the comparative study with other models? Was it a better performance, ease of use, computational efficiency?

Note on comment 3: I am happy with the revised sentence and the authors take on my comment about the ML model learning the physics. Indeed, we do not know what the model learns, but what we know is that the relationship between snow depth and climate variables is reproduced.

I appreciate the efforts on Comment 4 and the extra analysis/figures, but something is still unclear to me. The authors state in Review Figure 2 and associated text, that the “sample size” refers to the number of sites used. However, the performance values of Table 3 (CombinationML) and Figure 6 (also CombinationML) are the same, and it is clear based on Figure 6 that the sample size is in the order of millions (all measurements used, not per site). Therefore, I suspect that these performance values are calculated based on all the points in the dataset, and not based on average points per site. Please clarify this and if that is the case, then again I would suggest against the use of significant differences based on p-value, for this study.

I am satisfied with the contextualization of the results against the Dunmire et al 2024 paper. While I acknowledge that the purpose of the study is not a model comparison, it is important to contextualise the results with a similar paper like this one.

I agree with with reviewer 2 that given the length of the Appendix, it should be moved to a Supplement.

Data availability: Following open science practices, please provide the links for the freely downloadable snow depth data from the providers across the European Alps. Also indicate which snow depth measurements were provided through personal communication and from who.

Thank you and all the best.
Rev 1.

Hide

RR by Anonymous Referee #2 (03 Apr 2026)

ED: Publish subject to minor revisions (review by editor) (05 Apr 2026) by Francesco Avanzi

AR by Lucas Boeykens on behalf of the Authors (16 Apr 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (16 Apr 2026) by Francesco Avanzi

AR by Lucas Boeykens on behalf of the Authors (23 Apr 2026) Manuscript

Short summary

We used AI to better estimate the height of the snowpack present on the ground across the European Alps, by using novel satellite data, complemented by weather information or snow depth estimates from a computer model. We found that both combinations improve the accuracy of our AI-based snow depth estimates, performing almost equally well. This helps us better monitor how much water is stored as snow, which is vital for drinking water, farming, and clean energy production in Europe.