Evaluating a prediction system for snow management

. The evaluation of snowpack models capable of accounting for snow management in ski resorts is a major step towards acceptance of such models in supporting the daily decision-making process of snow production managers. In the frame of the EU H2020 project PROSNOW, a service to enable real-time optimisation of grooming and snow-making in ski resorts was developed. We applied snow management strategies integrated in the snowpack simulations of AMUNDSEN, Crocus and SNOWPACK/Alpine3D for nine PROSNOW ski resorts located in the European Alps. We assessed the performance of 5 the snow simulations for ﬁve winter seasons (2015-2020) using both, ground-based data (GNSS measured snow depth) and space-borne snow maps derived from Copernicus Sentinel-2. Particular attention has been devoted to characterize the spatial performance of the simulated piste snow management at a resolution of 10 meters. The simulated results showed a high overall accuracy of more than 80 % compared to the Sentinel-2 data. Moreover, the correlation to the ground observation data was high. Potential sources for the overestimation of the snow depth by the simulations are mainly due to the impact of snow redistribution 10 by skiers or spontaneous local adaptions of the snow management, which were not reﬂected in the simulations. Subdividing each individual ski resort in differently-sized ski resort reference units (SRU) based on topography showed a slight decrease in mean deviation. Although this work shows plausible and robust results on the ski-slope scale by all three snowpack models, the accuracy of the results is mainly dependent on the detailed representation of the real-world snow management practices in the models. This calls for an assessment of impacts from meteorological station measurements and their interpolations in the 15 ski resorts as well as potential limitations in describing the snow cover, especially managed snow, by simulations.


Introduction
The Alpine ski industry plays a central economic role in many mountain regions and is important for regional development.
About 13.6 million people live in the European Alpine region with around 60 to 80 million tourists visiting every year. The ski 20 resorts generate a high turnover in the winter tourism destinations (Vanat, 2020). However, a multi-annual perspective shows a stagnation of skier visits and the growing economic pressure on resorts, aggravated by the threat of climate change and decrease in average snow conditions as well as the consequences of the Covid-19 pandemic since February 2020. The trend to early winter demand for perfect ski slopes is still increasing but climate change renders a reliable early ski slope preparation more and more challenging. This was particularly visible throughout most of the European Alps at the beginning of the snow seasons 25 2014/2015 and 2015/2016, with strong deficits in natural snow amounts and elevated temperature conditions (NOAA, 2014(NOAA, , 2015 hampering the possibility to produce machine made snow. Ski resorts throughout the world have become increasingly reliant on snow-making facilities to complement the natural snow cover. Over 90 percent of all large ski areas in the Alpine region use snow-making facilities. In terms of snow-making, Italy (90 %) is in the lead followed by Austria (70 %), Switzerland (45 %), France (35 %) and Germany (25 %) (Lalli et al., 2019). The whole workflow of ski resort management with modern 30 slope preparation and maintenance, snow storage and machine made snow, however, could increase in efficiency. Examples of potential savings are related to both i) an over-production, which leads to a delayed melt out in spring well after the closing of the season and ii) a too early snow making production in autumn when the conditions can rapidly change and lead to the complete melt of the produced snow, e.g. due to an early-season warm spell.
Beyond the time scale of weather forecasts, which are generally reliable for a time frame of a few days, ski resort managers 35 have to rely on various and scattered sources of information, hampering their ability to cope with highly variable meteorological conditions. In the frame of the funded European Union's Horizon 2020 project PROSNOW (Morin et al., 2018), a demonstrator of a meteorological and climate prediction and snow management system from a short-term forecast covering the first 4 days to several months ahead, specifically tailored to the needs of the ski industry, was developed. Besides and within this project, approaches for representing snow management in numerical snow cover models were developed Spandre 40 et al., 2016) and integrated in the existing snow cover models AMUNDSEN (Strasser, 2008;Strasser et al., 2011;Hanzer et al., 2016) SNOWPACK/Alpine3D (Lehning et al., 2006;Bartelt and Lehning, 2002) and Crocus (Vionnet et al., 2012;Lafaysse et al., 2017). Applying these snowpack models capable of representing the effects of snow management (grooming and snowmaking), snow depth on the slopes could be predicted in real time and short-term and seasonal forecast mode for various ski resorts. 45 To increase the adaptation capacity of the skiing industry, there is a great need to combine weather and climate forecasting, snow modelling and observations, and to promote existing products to demonstrate their value for professional decision making.
In this context, in-situ observations as well as optical and microwave remote sensing have proven to be mature technologies (Sirguey et al., 2009;Dumont et al., 2012;Bühler et al., 2015). For example, in terms of snow coverage, remotely sensed data can provide twofold key information to the models. On one side, they can be used to correctly initialize the state variables of the models (Monti et al., 2016). On the other side, they can be used to cross-compare the results of the simulations and evaluate them (Mary et al., 2013;Notarnicola et al., 2013a, b). Since remotely sensed data mainly provide binary information on snow presence or absence, a complete model evaluation needs an additional comparison with measured in-situ observations (Hanzer et al., 2016). Nowadays it is increasingly common to employ advanced snow depth monitoring systems such as Global Navigation Satellite System (GNSS) equipped grooming machines to track the snow depth on the ski slopes. Thereby, 55 the performances of the models can be evaluated by comparing model outputs with the measured snow depth values. This evaluation strategy has been shown by Hanzer et al. (2020) for a single point on a slope for various ski resorts. In this work, the performance of the snowpack models were cross-compared with several snow depth measurements spatially distributed with a resolution of 10 m over different ski slopes for two winter season i.e., 2016/17 (driest winter in the recent years) and 2017/18 (snow-rich winter throughout the entire Alps) .

60
The objective of this paper is to evaluate the accuracy of the piste snow management module refined and implemented in the framework of the H2020 PROSNOW project embedded within several snowpack models to simulate snow managements in general and for each individual PROSNOW pilot ski resort in high spatial resolution. In a first step, the results of the snowpack simulations were compared both with remotely sensed satellite snow-covered maps and with snow depth measurements spatially distributed along the ski slopes acquired with specific GNSS systems. The impact of elevation, slope and aspect as well as temporal aspects within a season or amongst different years were evaluated. In the second step, the simulation domain was spatially discretized into defined ski resort reference unit (SRU)  to reduce computational effort but still be able to achieve meaningful results. In this case, each ski resort was divided into a number of elevation bands ranging from 50 to 400 m.
2 Study sites, models and data 70

Pilot ski resorts
Within the PROSNOW project we focused on the following nine ski resorts, which are also all part of this study: Seefeld (cross-country part) and Obergurgl in Austria, La Plagne and Les Saisies in France, Garmisch Classic in Germany, Colfosco, San Vigilio, and Livigno in Italy, and Arosa-Lenzerheide in Switzerland. This selection of ski resorts represents a large diversity of geographical, climatical and snow-making practices and equipment. Figure 1 and Table 1 show the locations and key 75 characteristics of the pilot resorts.

Snowpack models
Snowpack simulations are performed with AMUNDSEN for the Austrian and the Italian resorts (Colfosco, Obergurgl, San Vigilio, and Seefeld), with Crocus for the French resorts (La Plagne and Les Saisies), and with SNOWPACK/Alpine3D for the remaining resorts in Switzerland, Germany, and Italy (Arosa-Lenzerheide, Garmisch Classic, and Livigno). We used for each 80 3 https://doi.org/10.5194/tc-2021-56 Preprint. Discussion started: 24 February 2021 c Author(s) 2021. CC BY 4.0 License. ski resort different settings for the parameters concerning wet-bulb temperature, snow depth threshold, timing and density of grooming. A detailed description of the functionality and parameters of the snow-making and grooming modules, which are used in all of the snowpack models are shown in the study by Hanzer et al. (2020) and the applied configuration for each ski resort in Table 1. The three snowpack models, AMUNDSEN, Crocus, SNOWPACK/Alpine3D, are well-established and have been widely applied in numerous studies throughout the past decades (Essery et al., 2020;Krinner et al., 2018).

85
All three models require spatial input data for the snow management simulations consisting of a digital elevation model (DEM) covering the study sites, the locations of the ski slopes, and the locations and types of the snow guns (lances or fans). The used snow management configurations for each ski resort are shown in Table 1. Meteorological forcing data for the simulations is based on measurements from automatic weather stations in close to or within the study sites and from the SAFRAN analysis for Crocus model runs (Vernay et al., 2019), consisting of at least hourly measurements of air temperature, 90 precipitation, relative humidity, wind speed, and radiation. The generated output data are rasterized snow depth files with a resolution of 10 m.
The models are equipped with a machine made snow production and grooming module, which can be used for the operational applications. A set of core parameters can be used for very detailed simulations of snow management practices in single ski resorts. They take into account snow demand, the meteorological conditions including information on wet-bulb temperature 95 and wind speed, and the ski resort infrastructure in terms of the amount of snow that can be produced in a given time step at a certain location within the resort. For the simulations it is assumed that for a given snow gun all of the produced snow is distributed immediately and evenly over a predefined slope section. Additionally, the grooming module allows to account for the distinct properties of groomed snow on ski slopes depending on the amount of snow present and a defined grooming schedule. It assumes that grooming has no effect on the distribution of snow, e.g. shifting of snow from one place to another, 100 but rather only compacts it .

Ski resort reference unit -SRU
Real time simulations with a very fine spatial resolution (i.e. 10 m) require a very high computational demand and such fine spatial resolutions are also often not necessary for the overall day-to-day resort management. Spatial clustering of slopes and slope sections is often sufficient for the snow managers working in an operational mode. Therefore, we additionally discretized 105 each ski resort in ski resort reference units (SRUs). In a post-processing step, we aggregated the initial 10 m pixel size to larger areas. We defined different SRU sizes of the individual pistes by slicing them into the following elevation step ranges: 50 m, 100 m, 200 m, 300 m and 400 m. This aggregation results in different amounts of SRUs for each considered elevation range. Figure 2 shows exemplarily a map for the western part of Arosa-Lenzerheide with the resulting elevation classes for the different SRU sizes. In addition, we evaluated the sensitivity of SRU size to the final accuracy. More details about the SRU 110 definition can be found in Supplementary Material A1 and Table B1.

Sentinel-2 data
The model results for all ski resorts were compared with remote sensing images; for this study, Sentinel-2 data (S2) were used. The processing of the S2 snow-covered maps was done in three main stages: i) calibration to Top of the Atmosphere (ToA) reflectances, ii) re-projection, resampling and co-registration with the model grid with a final resolution of 10 m, iii) 115 classification with a Support Vector Machine (SVM) classifier trained with an active learning procedure (Tuia et al., 2016).
In particular, the classification was devoted to both: i) detecting the clouds; and ii) detecting the snow presence. This was done by exploiting the most representative features for each of the two classification problems. In detail, we used all the spectral bands, the normalized difference snow index (NDSI) and the normalized difference vegetation index (NDVI) for the cloud classification. NDVI was calculated as normalized difference between NIR and red bands, whereas NDSI was calculated 120 between green and SWIR bands. Regarding the snow detection, we used the spectral bands, NDSI, NDVI and the illumination angle calculated from the solar zenith and the solar azimuth angle (Riaño et al., 2003).
Since the S2 snow maps are compared to the snow simulations, particular attention was devoted to obtain accurate results.
For this purpose, three main steps were performed to address the main problems related to snow classification from optical images, which are the detection of: i) particular cloud conditions; ii) the snow mixed pixels i.e., pixel in which other classes 125 than the snow contribute to the observed spectral response; and iii) the snow under the canopy of the forests.
First, we performed a visual analysis of all the S2 images for excluding the scenes presenting complex cloud conditions.
In particular, semitransparent clouds, which are thin, high-altitude clouds composed of ice crystals were detected from the S2 band acquired at 1.375 µm. Interestingly, semitransparent clouds might not be visible in other spectral bands but they alter the spectral signatures leading to unreliable results.

130
The SVM classifier was trained in a way that a pixel with at least 50 % of snow coverage is classified as snow. This means that for example during the snow making production at the beginning of the season a pixel in which a significant snow production is ongoing is classified as snow even though not all the pixel area is covered with snow. Additionally, we identify the shadowed areas from where the multi spectral sensor on board of S2 is not able to record sufficient energy for distinguishing between snow and snow free areas. This is happening when the sun is low at the horizon approximately from mid November to mid 135 February and the terrain is extremely steep. In all the other shadow cases, the SVM classifier was trained to detect the snow presence. Hence, the output of the procedure was a classified map with four classes, i.e. snow, snow-free, shadows and clouds.
Since the detection of snow under forest canopy is a challenging research topic from both the remote sensing and modeling point of view, we conservatively masked out forested areas for all the ski resorts, based on the landcover classification provided by OpenStreetMap (OSM) (resolution of 30 m) (Schultz et al., 2017) rasterized to a resolution of 10 m and manually refined 140 for all ski resorts in such a way that the ski slopes that were passing through forest but were visible at the resolution of S2 were considered for the evaluation. The masked layers include coniferous, deciduous and mixed forests and in some particular cases also scrubs and heath. After this screening all the snow maps were considered reliable and the scenes with a percentage lower than 50 % of cloud were retained as sufficient information for the cross-comparison. A1 and B1 in the Supplementary Material. The number of S2 scenes which were available for each ski resort within the winter periods of 2015 to 2020, is in general high and ranged between 62 and 190 per ski resort. The differences in numbers of available scenes was mainly affected by cloud coverage and the atmospheric condition to perform accurate classification.

GNSS snow depth data
More and more ski resorts are relying on spatially distributed snow depth measurements performed with a modern Global Nav-150 igation Satellite System (GNSS) technology for an efficient management of their slopes. This technique relies on differential GNSS signals and takes measurements without snow depth on the slopes as a reference into account. The sensors are installed on top of the groomers and thereafter snow depth can be tracked as a positive side effect whilst grooming the pistes. This technology ensures a snow depth measurement accuracy down to the centimetre level and in a spatial resolution of 1 m, which allows also to track snow redistributions with the groomers.

155
For our study, rasterized data were provided by Snowsat and Leica and were resampled to a resolution of 10 m using the average value in order to be directly comparable with the model outputs. The GNSS snow depth data were available for almost all pilot ski resorts including Arosa-Lenzerheide, Garmisch Classic, Livigno, Obergurgl, Seefeld, San Vigilio, Colfosco and Les Saisies. We considered for the analysis the measurements spanning from the 1st of December to the 31st of March with a daily temporal resolution, when GNSS data were available. The data have been preprocessed to eliminate outliers and to check 160 their consistency. Table 1 shows the available seasons of GNSS snow-depth measurements for all ski resorts.

Evaluation
In this section, we describe how we evaluate the snowpack simulations carried out for the PROSNOW ski resorts. This includes i) the evaluation of simulated snow depth and ii) the evaluation of snow-covered area. In detail, the simulated snow depth was compared with the GNSS-derived measurements over a number of ski slopes, whereas the snow-covered area is evaluated by 165 comparing the model snow-covered area with the S2 snow maps. The metrics used for assessing the agreement between the simulations and S2 snow maps are the confusion matrix and the snow persistence index defined below.
The evaluation analysis for both snow depth and snow-covered area was conducted by stratifying the data according to temporal and topographical constrains. Moreover, a differentiation was made between natural snow i.e., snow outside the pistes and managed snow i.e., snow inside the pistes. In the following subsections more details on how the evaluation metrics 170 were calculated will be presented.

Snow coverage -Snow persistence (SP) and Confusion matrix (CM)
The latitude and elevation of a ski resort has a big impact on the timing of snow accumulation and melt. Therefore, comparing snow patterns between regions is challenging despite the widespread application of remotely sensed methods for snow research.
The snow persistence (SP) is a snow metric that can be used to map snow zones globally (Macander et al., 2015;Wayand et al., 2018;Vionnet et al., 2020). It is calculated in this work as the number of snow-covered days divided by the number of valid Sentinel-2 observations for the whole period (5 years). The number of total observations can vary for each pixel since cloudy/masked pixels are not considered for the computation. Hence, the same valid dates are considered for the model. The resulting SP Index ranges from 0 to 1. This approach has been used in a wide range of climates (Richer et al., 2013;Moore et al., 2015;Saavedra et al., 2017), to identify transitions between rainfall and snow melt peak streamflow source regimes 180 (Kampf and Lefsky, 2016) and for predicting water yield (Saavedra et al., 2017;Hammond et al., 2018).
Two SP indices were extracted considering both S2 snow maps and model simulation. They were calculated pixelwise as the ratio between the number of snow-covered days derived by S2 or from the model, divided by the total number of S2 observations (snow or snow-free). The values of SP were always between 0, i.e. always snow-free dates, and 1, i.e. always snowcovered dates. If a S2 snow map pixel is classified as cloud the corresponding snow model output is masked out preserving the 185 one to one correspondence between the two SP indices.
Additionally to the SP index, for each ski resort a confusion matrix (CM) was computed to assess the quality of the S2 and snowpack simulations. We refer the confusion matrix to modelled vs. observed variables. The confusion matrix has the form indicated in Table 2: TP: true positive, i.e. both model and S2 labelled as snow; FP: false positive, i.e. model labelled as snow, S2 as snow free; FN: false negative, i.e. model labelled as snow free, S2 as snow; and TN: true negative, i.e. both model and 190 S2 labelled as snow free. Therefore, TP and TN indicate that model and S2 data match with each other whereas FP and FN indicates that they don't match.
We distinguish between natural snow and snow on the slopes. Furthermore, the analysis was split in three periods: beginning (B: October-November-December), middle (M: January-February) and end (E: March-April-May) of the season. A pixel can be either true (snow in S2 data and model; no snow in S2 data and model) or false (snow in S2 data and no snow in model; 195 snow in model and no snow in S2 data). With the accumulation of all pixels assigned to be true, an overall agreement OA (%) was calculated for each period and catchment:

OA = TP + TN TP + FP + TN + FN
(1) which describes how often the agreement between S2 and the simulations was correct.

200
The metric used for this assessment are the mean deviation (MD) and the root mean square deviation (RMSD) over time for each ski resort. Regarding the snow-covered area evaluation, the binarization of the simulated snow depth to snow-covered map was done by imposing a threshold of 0.05 m i.e., every value above this threshold was identified as snow, while on the contrary all the values below 0.05 m were identified as snow free. This threshold is in line with previous works (Notarnicola, 2020). In each defined SRU we analyzed the variations in terms of snow depth MD and RMSD. The MD was used to relate 205 between the modelled and measured snow depth on the slopes among years for the period October to May. Measured snow depth data from all available years of observations (see Table 1) were used in order to analyse the quality of the simulated snow management configuration for each ski resort. This allowed better projection of the uncertainty of the simulated snow management configuration based upon the measured snow depth. RMSD and MD per one time point are defined as: where N is the number of valid pixels for a given date, SD m is the GNSS snow depth measured and SD s is the snow depth simulated by the model. Hence, the metrics were calculated pixel based and only on those pixels where GNSS measurements were present, i.e. the number of considered pixels N can vary for each date. A negative (positive) MD value indicates an overestimation (underestimation) of the snow persistence of the PROSNOW models.

215
In this section, the simulated results for all ski resorts were compared with the S2 and with the GNSS measured snow depth data on the ski slopes. Model runs were performed for five winter seasons (2015-2020) from 1st of October until end of May.
The simulations were carried out for all ski resorts using the default snow management configurations accounting for both fan guns and lance guns as well as different temperature threshold and base-layer production targets, given in Table 1. The configurations were either assumed or provided by the responsible person of snow making of each ski resort.

Snow coverage
The S2 algorithm produced accurate snow maps with an overall accuracy above 80 %, both for high-alpine and lower-lying mountainous regions, and different stages of the season like season start, mid season and end of season. A first approach to assess the skills of the models using S2 maps as a reference was a confusion matrix. Table 3 presents the results for each ski resort. Analyses were carried out for solely regarding snow on the pistes and additionally also for the complete ski resorts 225 including the off-piste areas. Moreover, overall accuracy trends over time are shown for each ski resort in the supplementary material ( Figures B1 and A1). The overall accuracy was over 80 % and the accuracy was highest in the middle of the season, with almost 90 % or higher. The snow distribution was well simulated at the beginning and end of the season, with an agreement of over 79 %. However, a larger mismatch was observed for San Vigilio with 69 % at the end of the season. A higher agreement between simulation and the S2 data was observed when regarding the pistes only. Compared to the analysis of the entire resort 230 including natural snow, the overall accuracy of pistes only was up to 8 % higher. Only one resort (Garmisch Classic) showed a slightly lower accuracy of 8 % on pistes.
The snow coverage quality of the snowpack simulations using the snow management modules was further assessed using the S2 data and SP indices. The SP indices were calculated for the simulations and S2 data and the relative differences are shown in Figure 3. The green pixels indicated the applied forest mask to minimize the underestimation of snow detection by the S2 235 algorithm. Observed SP presented similar patterns for each ski resort showing that snow persistence patterns were primarily controlled by the elevation. High elevation corresponds to high SP values, whereas low SP values were found near the tree line and lower elevations. In addition, low SP values were also found near ridge lines, exposed to wind, and influenced by lateral snow redistribution and snow accumulation. These effects were not captured by the simulations. Further, the derived SP values were dependent upon the elevation and slope orientation, primarily due to the impact of solar radiation on simulated snow 240 ablation. The biggest difference in the SP index between the S2 data and model was found in steep slopes where the effect of snow gliding (e.g. avalanches) is stronger.
The overall accuracy was mainly impacted by the elevation and slope. An interesting analysis was represented by the trend of the overall accuracy over 100 m elevation classes, 5 degrees slope -and 45 degrees aspect classes (i.e., North, North-East, East, South-East, South, South-West, West, North-West). This analysis was carried out taking specifically only the snow on 245 the pistes into account and is presented in Figure 4 (left). The agreement of the simulations with the S2 data increased with increasing elevation by around 1 % per 100 m elevation class. The overall accuracy over 5 degrees slope classes is not affected by the slope steepness and the orientation of the slope towards the sun showed no influence on the accuracy.
A closer look at the SP index showed an elevation dependency, but a clear slope or aspect dependency is hard to detect.

Snow depth
The simulated default snow management configurations reproduced the actual conditions in all ski resorts well. Figure 5 shows 255 the RMSD and the MD between the modelled and GNSS measured snow depth over time for each ski resort and each season.
Especially at the beginning of the season, the RMSD values between the simulated and the GNSS-measured snow depth on the pistes was below 0.4 meters on average. However, the RMSD values slightly increased during the season affected by daily adaption of the snow management configurations due to the actual weather conditions. The complexity of the snow management configuration increased during the season leading to an increase in the RMSD values at the end of the season. In 260 general, the temporal evolution of the RMSD values shows almost no large peaks. The large peak at Lenzerheide on January 15th, 2018/2019, occurred due to a heavy snowfall period which led to an extraordinary avalanche situation. As a result, a large part of the ski area was closed and many slopes were no longer groomed. This led to the strong increase in the RMSD values in this season. In general, our models mainly overestimated the snow depth and the fluctuations increased during the season.
Regarding the degree of linear dependency of the simulated and measured data, the MD as shown in Figure 5 reaches in some 265 resorts more than 0.5 meters suggesting that more snow was produced compared to the model. For several resorts an increase in MD can be observed especially at the end of the seasons. towards to the south-west. This is primarily due to the impact of solar radiation on the snow ablation.

SRU discretization
Discretizing the ski resorts in coarser clusters allows minimizing the error in terms of RMSD due to averaging effects between the simulations and GNSS measured snow depth between 8 and 45 % independent of the season and cluster size (see Table   B1). Therefore, it was not possible to find an overall optimal SRU size regarding the RMSD values. We also computed the 280 MD of the errors calculated as difference between the measured snow depth and the modelled snow depth, shown in Figure   7. The original pixel snow depth is kept for the 10 m resolution, while a mean snow depth for each SRU is calculated for the different SRU altitudinal band classes. For the computation of the SRU snow depth, only those pixels with valid GNSS measurements were considered and taken into account (between 0 and 3.5 m). The MD and the standard deviation of the error are then calculated by considering all the available measurements over time and space. We propose here a differentiation for 285 the four months December, January, February and March in Figure 7. Interestingly, the mean and the deviation of the MD values differ for single resorts widely.
In addition, we tested the spatial variability within the pistes. For a visualization example presented in Figure 6 we chose a long piste ranging between 700 and 1700 m a.s.l. within the Garmisch Classic ski resort for a date with maximal piste coverage. Figure 6 reports the simulated snow depth and the GNSS measurements as well as the pixelwise difference, for the 290 original resolution at 10 m (on the left). The same variables averaged over a possible SRU discretization, in this specific case considering 100 m elevation bands, are also reported in the same figure on the right. Regarding the fine resolution of 10 m, it is obvious that the GNSS snow depth data is much more variable in general and as especially occurring in the upper part of this piste due to highly localized snow management than the simulations with the default configuration can provide. A good agreement was found especially in the lower and middle parts of this example slope with slight under and over estimations of 295 the simulated snow height. The picture would look similar for all other pistes and resorts (not shown).

Discussion
This study presents a new high-resolution evaluation of snowpack simulations including snow management modules for mountain ski resorts to assess the quality of the simulations. The simulated results showed a high overall accuracy of more than 80 % compared to the Sentinel-2 data and a root mean squared error to the GNSS measured snow depth below 0.6 m. The simulated 300 results for all ski resorts are plausible and robust on the ski slope scale.

Snow coverage
For every ski resort, a large number of S2 images classified with high accuracy were available to assess the quality of the simulations in terms of snow coverage. More than 62 S2 images at each ski resort and even more than 150 for two resorts were analysed. The specific machine learning algorithm to derive information with low uncertainty about the presence/absence of 305 snow from the Copernicus Sentinel-2 images allow the generation of snow maps across the Alps with a relative low manual effort. Additionally, the very detailed forest mask applied to the evaluation allowed us at the same time i) to avoid situations for which the information provided by S2 are insufficient to produce accurate results; and ii) to be able to extract information about the snow cover for the pistes crossing dense forests such as it is often the case for Garmisch. This allowed us to minimize the pixels loss due to canopy shading. The highest pixel losses with respect to the total pistes area were in Arosa-Lenzerheide 310 (5.7 %) and Seefeld (5.5 %), followed by Obergurgl (2.4 %), Livigno (0.2 %), and Colfosco (0.2 %). For the other ski resorts it was zero.
The underestimation in the overall accuracy at the beginning and the end of the seasons was due to the fact that sometimes the exact snowline in the S2-data was hard to detect. Some snowlines were obscured by shadows and they often did not appear as continuous lines, which may slightly bias our regional estimates especially in areas which span different land use. Also 315 white rock types or illuminated wet rocks can lead to brighter pixels in the S2 data and to an underestimation in the overall accuracy. Due to the similar spectral signal of these pixels to snow, the algorithm detects an ice-snow boundary and classifies these pixels as part of the snowlines. Since these patches were situated in ski resorts in rocky areas, these misclassified pixels introduced negative biases in the overall accuracy estimates, and they were filtered out manually.
By inter-comparing the model simulations and the S2 images we encountered some recurrent errors. As described in the 320 previous section, in average the accuracy of the simulations are high and the snow coverage simulated by the PROSNOW models are consistent with observations. However, wrong discretization and/or missing meteorological input and lack of snow managing/land use information were the main sources of errors. In particular, (1) ephemeral snow (i.e., snow that lasts few days either at the beginning or at the end of the season) are difficult to be simulated correctly by the models; (2) rain/snow transition e.g., rapid snow melt inside the catchment are hardly to be matched correctly by the models; (3) due to unknown 325 snow making strategies, which are then not incorporated in the PROSNOW models, snow making at the beginning of the season and delayed/anticipated snow melting at the end of the seasons are not correctly modelled over managed slopes; (4) the heterogeneous landscape at 10 m resolution plays a role in the snow accumulation and melting dynamics e.g., towns, lakes and roads are visible and change the snow distribution. This information is generally not addressed by the PROSNOW models. These are just confirmations of the expected limitations of the state of the art snow cover models. However, for the 330 first time a systematic and extensive evaluation at high model resolution was performed. The details of the analysis with all the different recurrent errors encountered for each ski resort is shown in Table C1 and can be used for future studies. Note that the simulations presented here do not benefit from snow depth measurements and water consumption of snow-making, which are used, upon availability of the relevant data, for real-time applications of the models for operation PROSNOW service provision, thereby limiting the impacts of some of the caveats identified above.

Snow depth
Using the extracted GNSS data which was derived by grooming the pistes on a daily mode as ground observation can be only accepted with some restrictions. There are several problems which might affect the quality of the data: (i) the digital elevation model (DEM) profile of all ski resorts is prone to change every year due to earthwork and adaption of the slopes and is always considered correctly in the extraction of the snow depth; (ii) the inclination of the groomer has a large impact on 340 the GNSS measured snow depth. For example if the calculated snow depth is 0.5 m, the effect of 30 % inclination would be 6 cm which means that the calculated snow depth would be 12 % higher than in reality. Furthermore, the work of the groomers is not only to measure the local snow depth but also to fill sinks, compensate humps or level out the snow production or redistribution by the skiers. This might lead to very small scale variances in the GNSS-derived snow depth both for pistes with and without technical snow production, whereas the model shows less pronounced variability. Therefore, especially the small-345 scale deviations between the simulated and GNSS-measured snow depth are not well applicable to reflect the model accuracy.
The comparison rather shows to which degree the default snow management configuration was applicable for the individual ski resorts. As each ski resort spontaneously adapts its snow management production due to changing weather and snow height conditions, it is not possible to consider the spontaneous snow management adjustments during the season in the simulations. It does not make a big difference between piste with and without technical snow production despite the models do not explicitely (2) snow redistribution (i.e south, south-west exposed pistes) are not considered in the simulations; (3) levering of the pistes to reduce the accident risk of the skiers; (4) snow redistribution by the skiers; and (5) systematic errors due to wrong snow making strategy. These errors are already known but it is too complex and not straightforward to consider this in the simulations at current state. In detail, slightly different recurrent errors are 360 encountered for each ski resort by inter-comparing the model simulations with GNSS measured snow depth, which is shown in Table C1.

SRU discretization
Scale and data aggregation have important effects on the simulations and interpretation of snow depth data, which is also true for snow on pistes. Studies in different research fields clearly demonstrate that spatial variability and statistics are dependent 365 on scale (Marceau, 1999;Wu, 1999;Wu et al., 2000;Perveen and James, 2010). Our evaluation showed that these scale dependencies in spatial variability and statistics appear complex with non-linear increases with increasing grid-cell size. However a simple relationship between variability and scale emerges upon closer inspection: Aggregating the 10 m simulated pixels to different SRUs sizes led to a decrease of the error compared to the GNSS measurements due to averaging effects of the high spatial variability of the GNSS snow depth data. The RMSD values decreased between 8 to 45 % regarding the SRU size using 370 an altitudinal band of 50 m, however this effect diminished by further coarsening.
Nevertheless, the identification of guiding principles for researchers to combine data and models at different spatial and temporal scales and to extrapolate information between scales still remains a challenge. By going from fine to coarse scales, aggregation and generalization set in. The rate of information loss is influenced by small-scale spatial snow production and grooming patterns. Heterogeneous snow production for instance lead to more information loss than aggregations at coarser 375 scales. The spatial small-scale effects of moving snow by the groomers or skiers for instance, disappears slowly with decreasing resolution and those that are dispersed are lost rapidly. This leads to an under-or overestimation of the simulated snow height.
Therefore, a methodology needs to be developed to find out how much the loss of information takes place. Multiscale analysis was necessary to show that variability for different aggregation types of the snow height is inherently different, and that for each SRU this can be different as shown in Figure 4.

380
However, we demonstrated a consistent evaluation procedure for all the PROSNOW ski resorts that can be useful for snowpack modelers and ski resort managers. An understanding of the nature of scaling effects is needed, when spatial or temporal scale is an independent variable. In landscapes with homogeneous snow depths, where snow measurements can be summed directly, such scale problems may not occur. However, in snow distributed landscapes like in ski resorts, snow measurements obtained at fine scales often cannot be summed directly to produce regional estimates. Therefore, reasonable measures are not 385 always given by weighted averages because heterogeneity in snow production and distribution may influence scaling processes in nonlinear ways. In such cases increasing the level of spatial heterogeneity may also increase the difficulty of extrapolating information across scales (Perveen and James, 2010).

Remote sensing and GNSS snow depth
The use of remote sensing and GNSS data allows to define a evaluation procedure for the snowpack models and helps to 390 improve the resource management of the ski resorts. The GNSS data provides means to collect accurate snow depth points in the field for precise correction of the simulated snow depths. Further, using snow depth measurements over an entire season together with snowpack simulations is a powerful tool in the long term. It allows to estimate the minimum snow depth required at various slope sections. This ensures that slopes are optimally prepared and groomed right up to the end of the season. The spatial remote sensing images are needed to improve the simulated snow covered area especially at the beginning or end of 395 the season, or for lower situated ski resorts where natural snow precipitation is low. Further, it allows to correct the simulated ablation process at the end of the season and the snow in process at the beginning of the season.
The combination of both techniques allows to evaluate and initialize the simulations: imagery can be used for primary digitization of the snow cover where GNSS can be used for in-situ observation of the snow height for the simulations. A detailed analysis of the differences between the two methods will allow us to make better decisions about when and how much slopes can be extracted. This allows us to improve the models. However, this is not easy to extract and was not within the scope of this paper but should be considered for further studies.

Conclusions and Outlook
The initiative for this study emerged within the H2020 PROSNOW project to evaluate the snow simulations over the nine 405 PROSNOW pilot ski resorts by comparing model outputs with local and remotely-sensed measurements in terms of snow coverage, persistence and snow depth. The three snowpack models AMUNDSEN, Crocus and SNOWPACK/Alpine3D include all piste management modules and were evaluated using both, ground-based data (GNSS measured snow depth) and spaceborne Copernicus Sentinel-2 derived snow maps (i.e., machine learning was exploited to derive information with low uncertainty about presence/absence of snow). The evaluations were performed for five winter seasons (2015-2020) from 1st of October until 410 end of May and have been performed in a stratified manner in order to assess the performance of the snow simulations under different conditions. Particular attention has been devoted to characterize the spatial performance of the snowpack models with integrated snow management moduls. Our presented results show high accuracy of the simulations representing the 'reality' well.
An inter-comparison of the three snowpack models applied to the same resort would be a logical next step from the model 415 development perspective. Differences in the simulated results of the three models for a given ski resort would be mainly due to the different implementations of the snow management configurations into each model and due to the different snowpack energy and mass balance approaches. Such effects should have to be disentangled and discussed accordingly to have a reasonable comparison. However, this is not straightforward and is out of scope for this paper, but should be considered for the future.
Nevertheless, this work showed that all three snowpack models applied for piste management reproduced plausible and robust 420 results on the ski slope scale and the overall accuracy of the results is mainly dependent on the degree to which the real-world snow management practices are integrated. Competing interests. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. PROSNOW is a project aiming at producing an operational climate service in order to 440 transfer it as a commercial service.
Acknowledgements. We thank our project partners and pilot ski resorts for many constructive discussions and providing data to improve the manuscript. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 730203.

A1 Further information on: Ski resort reference unit -SRU
In our approach, the spatial representations of ski areas and of the interpolated meteorological fields as well as the simulated snowpack information differs in their saptial representation: the geometry type used for the ski slope is a vector-based polygon, whereas the input and output of the snowpack models are based on a discrete approach, using regular grids of points. The challenge for PROSNOW was to define an intermediary spatial object. This should be consistent with representations balanced 450 between the heterogeneity of meteorological conditions within ski slopes, the accuracy of snowpack models and the computation resources and data volume required for a daily update. Also it should take into account the localization of the snowmaking facilities. The SRU, standing for Ski resort Reference Unit aims at fulfilling all these requirement. It can be end-user defined including specific needs of the ski resort but it also can be processed automatically by chaining several operations. We stored the vectorial GIS and attributes data of all nine pilot ski resorts in a PostgreSQL 10,7 DBMS and the crossing operations be-455 tween rasters and vectors were performed with Python 2,7 with the GDAL/OGR along with NumPy. For automatic processing of the SRUs we considered the following steps: 1. The association of each snowgun with a single ski slope is based on the spatial relation of the nearest neighbor to the ski slope.
each snowgun. Considering that the mean surface covered by a single snowgun is approximately 1/3 ha, we applied a PLPG/SQL function to calculate the intersection between a slope and incremental 5 m buffer around the snowgun point until it reached at least 3,000 m 2 . This buffer was then crossed with the combined ASTER-SRTM DEM v1.1 made available by the Copernicus European organization and we kept the buffer's altitude bounds which were then aggregated at the slope level to cover a continuous surface.

465
3. Once the snowtype attribute ("grooming only" or "with snowmaking") is defined for every slope, the according areas were then divided in smaller parts based on the elevation resolution. The initial DEM topography was re-classified with respect to the targeted SRU resolution. A value in the according numeric series was assigned to a pixel which value is in the range of the target value approximately half the resolution (i.e. for a 50 m resolution, the value 300 will be assigned at all DEM pixel which value is more or equal to 275 m and less than 325 m, and for a 300 m resolution between 150 m 470 and 450 m. Contiguous pixels with the same values were merged and polygonized.
4. As small SRUs might occur by applying the above mentioned steps, e.g. at the beginning or ending of single pistes, we merged them in a post processing step with other small adjacent piste fragments having the same snowtype attribute.

Snow
Snow-Free OA (%)  Table C1. Inter-comparing the model simulations with Sentinel-2 data and GNSS measured snow depth for each ski resort.

Resort
Sentinel-2 GNSS Arosa-Lenzerheide Slight overestimation of the snow line at the beginning of the season.
Ephemeral snowfall generates high disagreement. Increasing error throughout the season with a overestimation of the simulation.

Colfosco
Systematic error at the end of the seasons leading to an overestimation of the snow depth Good agreement throughout the winter with slightly overestimation of the simulation. Especially at southwest exposed slopes encountered an overestimation of snow Garmisch-Classic Systematic errors at the beginning and end of the seasons lead to an overestimation of the snow coverage in these periods. However, it has to be noted that the snow season is shorter than in the other ski resorts and the area is very heterogeneous.
Good agreement throughout the winter with underestimation a the beginning and overestimation at end of the season of the simulations. Especially at northeast exposed slopes encountered an overestimation of snow in general.

La Plagne
Systematic error at the end of the seasons leading to an overestimation of the results. Especially at north exposed slopes encountered an overestimation of the snow depth.
Good agreement throughout the winter with slightly overestimation of the simulation.

Les Saisies
Systematic error at the end of the seasons leading to an overestimation of the snow depth. Especially at south exposed slopes encountered an overestimation of the snow.
No data available.