Evaluation of E3SM Land Model snow simulations over the Western United States

. Seasonal snow has crucial impacts on climate, ecosystems and humans, but it is vulnerable to global warming. The 10 land component (ELM) of the Energy Exascale Earth System Model (E3SM), mechanistically simulates snow processes from accumulation, canopy interception, compaction, snow aging to melt. Although high-quality field measurements, remote sensing snow products and data assimilation products with high spatio-temporal resolution are available, there has been no systematic evaluation of the snow properties and phenology in ELM. This study comprehensively evaluates ELM snow simulations over the western United States at 0.125° resolution during 2001-2019 using the Snow Telemetry (SNOTEL) in 15 situ networks, MODIS remote sensing products (i.e., MCD43 surface albedo product, the spatially and temporally complete (STC) Snow-Covered Area and Grain Size (MODSCAG) and MODIS Dust and Radiative Forcing in Snow (MODDRFS) products (STC-MODSCAG/STC-MODDRFS), and the Snow Property Inversion from Remote Sensing (SPIReS) product) and two data assimilation products of snow water equivalent and snow depth (i.e., University of Arizona (UA) and SNOw Data Assimilation System (SNODAS)). Overall the ELM simulations are consistent with the benchmarking datasets and 20 reproduce the spatio-temporal patterns, interannual variability and elevation gradients for different snow properties including snow cover fraction ( f sno ), surface albedo ( 𝛼 sur ) over snow cover regions, snow water equivalent (SWE) and snow

studies only evaluated a few snow variables in LSMs mostly at coarse spatial resolutions (Table A2), although more highresolution remote sensing observations and data assimilation products of snow variables (e.g., snow albedo ( sno), snow grain size (Ssno) and snow albedo reduction induced by LAPs in snow (Rsno)) have become available. The snow phenology in LSMs has rarely been evaluated explicitly and how LSMs capture the interannual variability of snow variables and how those 70 variables vary along an elevation gradient have not been well investigated.
A series of high-quality field snow measurements, remote sensing and data assimilation snow datasets/products with high spatio-temporal resolution are available over the WUS. The in situ Snow Telemetry (SNOTEL) stations widely distributed across the WUS provide long-term SWE field measurements (Serreze et al., 1999). Optical remote sensing data has been 75 widely used to map snow dynamics (Dietz et al., 2012;Dong, 2018). The Moderate Resolution Imaging Spectroradiometer (MODIS) reflectance data at 463 m spatial resolution have been used to retrieve multiple key snow-related variables including sur (Schaaf et al., 2002), fsno (Bair et al., 2021c;Painter et al., 2009), sno, Ssno, and Rsno (Bair et al., 2021c;Painter et al., 2012). These MODIS data accurately capture snow dynamics during accumulation and melt Wang et al., 2018) and the high daily temporal resolution of these datasets is helpful for capturing rapid snow variations. 80 Some available remote sensing snow phenology products (Chen et al., 2015;Metsämäki et al., 2018;Takala et al., 2009) adopt different optical or microwave satellite observations to extract snow phenology date and duration. Besides, they use different snow phenology definitions, and include different snow phenology metrics, which can affect their use as a reference. Alternatively, the same phenology extraction methods can be used to derive snow phenology metrics for both LSMs and MODIS daily fsno data, avoiding inconsistencies of definitions and extraction methods. Data assimilated SWE and 85 snow depth (Dsno) products are also available that integrate field measurements, remote sensing observations, and model simulations (Center, 2004;. These data assimilation products have high spatial resolution of <5 km and higher reliability over mountainous and forested regions due to the constraints of in situ networks .
These datasets provide good opportunity for comprehensively evaluating the accuracy of snow variables and snow phenology in LSMs. 90 The aim of this study is to systematically evaluate the high-resolution 0.125° ELM simulations of key snow variables and snow phenology over the WUS, using in situ, remote sensing and data assimilation snow products. Specifically, offline ELM simulations with new improvements related to snow processes over the WUS were conducted during 2001-2019. Field snow measurements, three MODIS remote sensing products, and two data assimilation snow products were collected as 95 benchmarking datasets for the ELM simulations (see Section 2.3 for details). All the ELM outputs and benchmarking datasets were regridded to an identical spatio-temporal resolution of 0.125° and daily. Snow properties variables including sur, fsno, sno, Ssno, Rsno, SWE and Dsno were used in the analysis. Multiple snow phenology metrics were derived from both ELM and remote sensing products using the same definitions and extraction methods (see Section 2.4 for details). The spatial patterns, temporal correlations, interannual variabilities, elevation gradients, and change with forest cover of snow properties and snow 100  (Golaz et al., 2019). 105 ELM uses a multi-layer scheme (up to 5 layers by default) to dynamically simulate various snow processes, e.g., snow accumulation, melting, aging (i.e., the evolution of snow grain size), compaction, metamorphism, aerosol deposition and redistribution, and canopy snow interception and unloading. Specifically, ELM uses the Snow, Ice, and Aerosol Radiative (SNICAR) model to calculate snow albedo and vertically-resolved absorption of solar radiation, considering the evolving snow grain size, solar zenith angles (SZAs), sky conditions, underlying background and snow impurities (e.g., black carbon (BC) 110 and dust) (Flanner et al., 2007). ELM uses the snow water equivalent (SWE) and standard deviation of elevation to estimate snow cover fraction (fsno). The hysteresis of snow accumulation and ablation is also accounted for in ELM (Swenson and Lawrence, 2012).
Compared to CLM4.5, some key updates related to snow processes have been included in ELM. First, the original SNICAR 115 model has been replaced by a hybrid model (SNICAR-AD) of SNICAR and delta-Eddington adding-doubling radiative transfer solver, which corrects the snow albedo bias for large SZAs and can better represent the shortwave radiative properties of snow (Dang et al., 2019). Second, compared to only external mixing in CLM4.5, both external mixing and internal mixing of hydrophilic BC-snow and dust-snow are now represented in ELM (Hao et al., 2022a;Wang et al., 2020). Third, the direct and diffuse irradiance under different atmospheric profiles and their dependence on SZA are included (Hao et al., 2022a). 120 Fourth, the effects of non-spherical snow grain shape on snow albedo are considered (Hao et al., 2022a). Fifth, a new parameterization of sub-grid topographic effects on solar radiation has been implemented in ELM to account for the impacts of macro-scale shadow, occlusion and multi-scattering between adjacent terrain on surface albedo (Hao et al., 2021;Hao et al., 2022b).

Model setup and experiment design 125
Selected for this study, the WUS has heterogeneous topography with diverse elevations ranging from 0 to above 3 km ( Figure   1a). The WUS includes three major mountain ranges: the Cascades Range, Sierra Nevada, and Rocky Mountains, which are characterized by frequent snow cover. The elevation data was acquired from the Shuttle Radar Topography Mission (SRTM) DEM dataset (Rabus et al., 2003). The forest cover data in 2010 shown in Figure 1b was acquired from the 30 m Landsat Vegetation Continuous Fields (VCF) tree cover datasets derived from the GFCC Surface Reflectance product (Sexton et al., 130 2013). Both the DEM and forest cover data were aggregated to 0.125° using the area-weighted average method. For analysis, elevations were divided into different intervals (see Figure 1c). Elevations less than 0.5 km are not included in the statistical analysis as snow cover is close to 0. The forest cover was divided into five levels (see Figure 1d). The area fractions of different intervals of elevation and forest cover are shown in Figure 1c and 1d, respectively.

135
ELM simulations at 0.125° spatial resolution were conducted over the WUS from 1979 to 2019 driven by hourly meteorological forcing data from the National Land Data Assimilation System phase 2 (NLDAS-2) with spatial resolution of 0.125° (Xia et al., 2012). Specifically, the prescribed satellite phenology (SP) mode was used with input of MODIS leaf area index data (Myneni et al., 2002). The climatological monthly aerosol deposition data (e.g., black carbon and dust) with a spatial resolution of 1.9°×2.5° from the Community Atmosphere Model version 5 coupled with chemistry (Lamarque et al., 2010) 140 was used, which was temporally and spatially downscaled to half-hourly and 0.125° using bilinear interpolation. For the snow albedo module, SNICAR-AD was configured with: 1) the SZA-dependence solar irradiance under the mid-latitude winter atmosphere, 2) spherical snow grain shape, 3) internal mixing of hydrophilic BC-snow, (4) external mixing of dust-snow, and (5) neglect of organic carbon due to its high uncertainties. The sub-grid topographic effects on solar radiation were included in the ELM configuration. The model was run at a half-hourly step. The first 31-year run from 1979 to 2000 was used to spin-145 up the model to reach equilibrium and then the remaining 19-year run (i.e. 2001-2019) was used in the analysis. The variables of interest were output at half-hourly, daily and monthly scales.

Benchmarking datasets
In situ Bias Correction and Quality Control (BCQC) SNOTEL daily SWE data from 2001-2019 (Table 1) were used as the 155 benchmarking dataset to evaluate the performance of ELM. SNOTEL stations, operated by the U.S. Department of Agriculture Natural Resources Conservation Service (NRCS) provide long-term, widely-distributed and high-quality field measurements of SWE across the WUS (https://www.nrcs.usda.gov/). BCQC SNOTEL eliminated data outliers and erroneous values, fixed the inconsistencies of different variables, and corrected the bias of the raw data Yan et al., 2018). Specifically, 788 SNOTEL sites in the WUS were included in the study (Figure 1a). 160 Three daily 463 m MODIS-based remote sensing products from 2001-2019 were used to evaluate the performance of ELM (Table 1). The first one is the MCD43A3 surface albedo version 6 product (named as MCD43 hereafter). The MCD43 product provides black-sky and white-sky surface albedo at local solar noon (Schaaf et al., 2002), which could well capture the snow effects on sur . This dataset represents the albedo of the entire MODIS pixel which could include vegetation 165 or soil if the observed pixel is not 100% snow cover, and thus it will underestimate snow albedo for fractionally covered pixels as vegetation and soil have darker broadband albedos. The second one is the spatially and temporally complete (STC) MODIS Snow-Covered Area and Grain size (MODSCAG) and MODIS Dust and Radiative Forcing in Snow (MODDRFS) product (we hereafter refer to as STC-MODSCAG/STC-MODDFRS). The third one is the Snow Property Inversion From Remote Sensing (SPIReS) product. These two products provide fsno, sno, Ssno, and Rsno at around 10:30 am local solar time and represent 170 sno (i.e. excluding soil and vegetation portions of the observed pixel). STC-MODSCAG first estimates fsno and Ssno based on the spectral unmixing and physically-based snow radiative transfer models (Painter et al., 2009)

. STC-MODDRFS then uses
Ssno to calculate the sno of the clean snow with difference between clean and dirty (observed) snow for computing Rsno (Painter et al., 2012). SPIReS adopts a physically-based approach without empirical assumptions to simultaneously estimate fsno, sno, Ssno, and Rsno (Bair et al., 2021c). Both STC-MODSCAG/STC-MODDRFS and SPIReS are interpolated and smoothed to 175 reduce the effects of data noise, cloud contamination and sun-sensor geometry (Bair et al., 2021c;Dozier et al., 2008;Rittger et al., 2020). Both of the fsno products show good performance with the basin-wide root mean square error (RMSE) values of 6.5% and 6.7% against airborne lidar datasets (Stillinger et al., 2022). Initial validation against field measurements for Ssno at a single site for the original MODSCAG shows a 51 µm mean absolute error for a clear sky day (Painter et al., 2009). The gap filled MODSCAG/MODDRFS at three sites in the WUS has an accuracy (RMSE) of 118 µm for Ssno and 0.0036 for Rsno (Bair 180 et al., 2019) considering both clear and cloud days. SPIReS has a sno RMSE of 4.6% against the 3-year field measurements at Mammoth Mountain, CA (Bair et al., 2021c), nearly identical to the reported accuracy of 4.8% RMSE for STC-MODDRFS agaist the field measurements at the same site (Bair et al., 2019). Note that there is an underestimation of fsno in the northern WUS region in winter occur because of a known issue in current versions of STC-MODSCAG (https://nsidc.org/reports/snowtoday?title=6). Specifically, MOD09GA surface reflectance processed to produce STC-MODSCAG at the Jet Propulsion 185 Laboratory (JPL) is not processed when SZA is larger than 67.5°. This issue is being resolved during the transfer of processing during 2022 to 2023 from JPL to the National Snow and Ice Data Center Distributed Active Archive. We conservatively excluded data north of 42° in latitude during the winter in our comparisons in Section 3.1.
Two data assimilation SWE and Dsno products from 2001-2019 were used to compare with ELM (Table 1). The first one is the 190 University of Arizona (UA) daily snow product version 1 with the spatial resolution of 4 km over the Conterminous US (Zeng et al., 2018). This product was generated by fully utilizing the field measurements from multiple in situ networks including SNOTEL constrained by the gridded precipitation and temperature data in the 4 km Parameter-elevation Regressions on Independent Slopes Model (PRISM). A series of algorithm robustness tests and independent accuracy evaluations against remote sensing and airborne lidar measurements showed that the UA product is reliable as a reference snowpack dataset (Zeng 195 et al., 2018). The second one is the SNOw Data Assimilation System (SNODAS) daily product with 1 km spatial resolution developed by the NOAA National Weather Service's National Operational Hydrologic Remote Sensing Center (Center, 2004).
SNODAS uses a physically consistent modeling and data assimilation framework to integrate physically-based model estimates and multi-source snow data from satellite remote sensing, airborne-based observations, and in situ measurements including SNOTEL. SNODAS has shown a similar performance as UA . The SNODAS product is available 200 from October, 2003 and thus only the data from 2004-2019 were used in the study. UA and SNODAS both assimilate the SNOTEL observations in their models directly, so better performance relative to those observations is expected, while the ELM simulations are not constrained by the SNOTEL data.

Snow phenology extraction and data processing
Time series of fsno from ELM and two remote sensing snow products (i.e., STC-MODSCAG and SPIReS) were used to 210 extract the snow phenology ( Figure 2). First, based on the observed seasonal cycle of snow cover over the WUS (Brutel-Vuilmet et al., 2013;Peng et al., 2013;, the snow accumulation and snowmelt seasons are defined as the periods from September to January and from February to August, respectively. Next, four snow timing dates and one duration metric were retrieved from ELM and remote sensing products that include: (1) snow accumulation onset date Accumulation_onset_date for year t is defined as the first continuous five days with fsno>0.05 during the snow accumulation season from September (year t-1) to January (year t), and End_date is defined as the last continuous five days with fsno>0.05 during the snowmelt season of the year t, to avoid the interference of ephemeral snow. Note that using different thresholds 220 (e.g., 0.00, 0.03, 0.05, 0.10, 0.15) of fsno to defining Accumulation_onset_date and End_date can lead to different date estimates but the same conclusions, which are not shown in the paper. Duration was calculated as the number of days between Accumulation_onset date and End_date. Depletion_onset_date and Midpoint_date were determined by fitting the fsno time series during the snowmelt season using the sigmoid function (Anttila et al., 2018;Böttcher et al., 2014;Kouki et al., 2019): 225 where DOY is day of year, and a, b, c and d are the fitted parameters. Specifically, the nonlinear Least Squares method was used to fit a sigmoid function. Following Anttila et al.(2018), Depletion_onset_date is defined as the date when the fitted sigmoid curve reaches 99% of its variation range, and Depletion_midpoint_date is defined as the date at the midpoint of the curve change ( Figure 2). To reduce the impacts of noise, the retrievals at the individual pixels for a specific year was deemed 230 as unsuccessful when (1) the fsno difference at the start and end date of snowmelt season is smaller than 0.05; and (2) for the sigmoid fitting, the coefficient of determination (R 2 ) between observed and fitted fsno is smaller than 0.95 and RMSE is larger than 0.2. Only the pixels with successful retrievals of snow timing metrics for at least 10 years were used in the subsequent analysis.
235 Figure 2: Time series of snow cover fraction (fsno) and sigmoid curve fitting at a typical pixel, represented by black and red lines, respectively. The blue lines indicate four phenology dates and one duration, and the shaded area shows the snow accumulation season.

240
MODIS data and ELM outputs were adjusted for temporal consistency and to unify the variable definitions. MCD43 only provides black-sky and white-sky albedo, and thus the ELM-derived ratio of diffuse to total solar radiation was used as a weighting factor to calculate sur for the blue sky. For ELM, the average values of sur from 11:30 am to 12:30 pm local solar time were calculated to match the time of MODIS MCD43 product, and those of fsno, sno, Ssno, and Rsno from 10:00 am to 11:00 am local solar time were calculated for ELM to match the time of STC-MODSCAG/STC-MODDRFS and SPIReS. 245 The snow timing metrics and snow variables in the remote sensing and data assimilation products (Table 1) were aggregated to 0.125° using the area-weighted average method. They were temporally upscaled to seasonal, annual and multi-year average scales. For a specific year, only the pixels with fsno>0 were used to calculate the regional average values for sur, fsno, sno, Ssno, and Rsno, SWE and Dsno using the area-weighted average method. 250

Evaluation methods
Using the field measurements, remote sensing products, and data assimilation products as the reference, the spatio-temporal distributions of ELM snow outputs were evaluated. For spatial correlation, multiple statistical metrics were calculated for the multi-year average seasonal ELM outputs: correlation coefficient (R), Bias, relative Bias (rBias, calculated as the ratio of Bias 255 to the average value), root mean square deviations (RMSD), and relative RMSD (rRMSD, calculated as the ratio of RMSD to the average value). This study mainly focused on winter (DJF) and spring (MAM) in the analysis, and there is little or no snow cover for the WUS in Summer (JJA) and Autumn (SON) in the ELM simulations ( Figure S1). For the temporal correlation, R between ELM and the reference datasets was calculated only for the grids where there are at least 10 snow covered days for 260 one year excluding highly ephemeral snow.
The long-term trends of snow variables over the whole WUS were detected using the non-parametric Mann-Kendall (MK) test. However, the MK test showed that there is no significant increasing or decreasing trend (p-value > 0.05) for all the snow variables, and thus the corresponding results are not included in the paper. The interannual variabilities (IAVs), defined as the 265 standard deviation of the annual values, were calculated to evaluate whether ELM can capture the interannual variations of snow processes. In addition, the distributions of snow variables along the elevation gradients and forest cover for winter and spring were also analyzed.

Snow cover fraction
The ELM simulated fsno has heterogeneous spatial patterns in the WUS for both winter and spring (Figure 3a-b). The regional average fsno is 0.41 and 0.15, respectively for winter and spring. Overall, ELM also shows similar spatial patterns with both STC-MODSCAG and SPIReS for all the seasons ( Figure S1). STC-MODSCAG underestimates fsno over the northern regions in winter due to the known issues ( Figure S1, see Section 2.3 for details). When excluding December and January with larger 275 SZAs, STC-MODSCAG shows similar spatial distribution as SPIReS for February ( Figure S2). In spring, compared to STC-MODSCAG, ELM underestimates fsno over the western mountains in spring (Figure 3d). Compared to SPIReS, ELM has an overestimation over most regions in winter but performs well in spring (Figure 3g-h). Overall ELM has a high spatial correlation to both STC-MODSCAG and SPIReS, with a higher relative accuracy in winter than spring (Table 2). For temporal correlation, ELM has a low correlation in the mountainous areas with both STC-MODSCAG and SPIReS in winter ( Figure  280 3e,i), but has a relatively high correlation with those data in spring (Figure 3f,j). The winter-spring contrast in skill is possibly due to the smaller change of fsno in winter than spring.  Figure S3). Considering the uncertainties of the remote sensing retrievals, the ELM regional average fsno is within the range of STC-MODSCAG and SPIReS ( Figure   5a-b and S4). 300 Figure 3: Spatial distributions of (a,b) snow cover fraction (fsno) in ELM and (c,d,g,h) the fsno difference between ELM and two remote sensing products (i.e., STC-MODSCAG and SPIReS) and (e,f,i,j) their temporal correlations (Rs) for different seasons: (a,c,e,g,i) winter and (b,d,f,h,j) spring. In all panels, regions with no snow cover are masked with white color. The area-weighted average values are labelled in each panel.    Table 2. Evaluation of snow properties in ELM against two remote sensing products (STC-MODSCAG/STC-MODDRFS and SPIReS) and two data assimilation products (UA and SNODAS) for winter and spring. Here, the snow properties include snow cover fraction (fsno), surface albedo ( sur), snow albedo ( sno), snow grain size (Ssno), snow albedo reduction (Rsno), snow water equivalent (SWE), and snow depth (Dsno). The statistical metrics were calculated using the data over the WUS, except that those against STC-MODSCAG/STC-MODDRFS in winter were calculated using the data over the WUS regions below 42° in latitude.

Surface albedo and snow albedo
Overall the ELM simulated sur over snow cover regions shows similar spatio-temporal distribution with MCD43 for both winter and spring ( Figure 6-7). Compared to MCD43, ELM overestimates sur over Sierra Nevada and Rocky Mountains in winter, possibly due to the bias in snow cover (Figure 3c-d). The mean biases of ELM are -0.01 and 0.00, respectively for 330 winter and spring. The spatial R values between ELM and MCD43 are 0.77 and 0.71, respectively for winter and spring (Table   2). ELM shows a low temporal correlation to MCD43 over most regions in winter, but has a relatively higher temporal correlation in spring especially over the mountain areas and northern regions (Figure 6e    For sno, ELM overall shows good consistencies with STC-MODDRFS and SPIReS over mountainous regions, but has an 355 underestimation over other regions (Figure 8). Against STC-MODDRFS, the mean biases of ELM are -0.08 for winter over the WUS regions below 42° in latitude and -0.11 for spring over the WUS. Against SPIReS, the mean biases of ELM are -0.13 and -0.08, respectively for winter and spring. The spatial R values between ELM and two remote sensing products are lower than 0.30 (Table 2). ELM shows a low temporal correlation to two remote sensing products over most regions, and has a relatively higher temporal correlation over the Rocky Mountains (Figure 8e-f). Larger inconsistencies between ELM and two 360 remote sensing products are founded in terms of interannual variations, elevation gradients and change with forest cover (Figure 9 and S5).   There are large differences in the magnitudes and spatio-temporal patterns of Ssno between ELM, STC-MODSCAG/SPIReS ( Figure 10 and 11). ELM has larger Ssno in spring than in winter (Figure 10a-b), with large negative biases over the western mountains and positive biases over the central and eastern regions compared to STC-MODSCAG with the mean biases of -71.6 μm for spring (Figure 10c-d). ELM has positive biases over most regions compared to SPIReS, with the mean bias of 380 93.9 μm and 31.6 μm, for winter and spring, respectively (Figure 10g-h). Ssno in ELM has a poor spatial correlation to the two MODIS products for both winter and spring (Table 2). ELM has varying temporal correlations with STC-MODSCAG and  There are also large spatial biases and low temporal correlations of Rsno between ELM, STC-MODDRFS and SPIReS ( Figure   11 and S4). In ELM, Rsno shows extremely high values in the northeastern corner for winter (Figure 11a), due to the large 395 aerosol deposition in the aerosol deposition data (see Section 2.2). Apart from the northeastern corner, ELM is more similar to SPIReS in winter (Figure 11c-g). For spring, ELM is more similar to STC-MODSCAG, and has large negative biases relative to SPIReS (Figure 11d-h). ELM has higher temporal correlations with both remote sensing products in winter than spring, and shows higher correlations with SPIReS than STC-MODDRFS in spring (Figure 11 e-f and i-j). For interannual variability, ELM is more identical to STC-MODSCAG in spring ( Figure S8a-b and S9a-b) than SPIRES. However, note that 400 ELM simulations in the study used climatological monthly aerosol deposition data, so they are not comparable to the remote sensing data in any specific year. In spring, Rsno in all the three datasets shows an increasing trend with elevation ( Figure S8d and S9d). All the three data show larger differences across different forest cover ( Figure S8e-f and S9e-f). Overall, Rsno is within the uncertainty ranges of STC-MODSCAG and SPIReS (Figure 5e-f and S4). 405 Figure 11: Spatial distributions of (a,b) snow albedo reduction (Rsno) in ELM and (c,d,g,h) the Rsno difference between ELM and two remote sensing products (i.e., STC-MODDRFS and SPIReS) and (e,f,i,j) their temporal correlations (Rs) for different seasons: (a,c,e,g,i) winter and (b,d,f,h,j) spring. In all panels, regions with no snow cover are masked with white color. The area-weighted average values are labelled in each panel. 410

Snow water equivalent and snow depth
ELM shows higher SWE values over the mountainous areas (Figure 12a-b), but also has larger underestimations over the mountainous areas, compared to both UA and SNODAS in both winter and spring (Figure 12c-d,g-h). Against UA and SNODAS, ELM has a mean bias of -20.7 mm (35.9%) and -20.4 mm (-35.5%), respectively in spring, while those in winter are -13.8 mm (-27.8%) and -10.2 mm (-22.2%), respectively. Overall ELM has a high spatial similarity with both UA and 415 Formatted: Not Superscript/ Subscript SNODAS, and ELM has higher spatial consistency with UA than SNODAS in spring (Table 2). For temporal correlation (Figure 12e-f and i-j), ELM has high mean R values of 0.64 and 0.65 for winter and spring, compared to UA, and the R values are 0.53 and 0.54, respectively compared to SNODAS. ELM captures the interannual variabilities and elevation gradients of SWE well, but some underestimations of the regional average values are observed (Figure 13a-d). In winter, ELM has similar IAV values to UA and SNODAS, but has a lower value of 11.7 mm compared to UA (16.7 mm) and SNODAS (18.1 mm) in 420 spring. Overall, ELM shows larger differences from UA and SNODAS, when there is a higher forest cover, especially for spring (Figure 13e-f). Dsno shows very similar results to SWE (Figure S10-S11).

Snow phenology
ELM well reproduces the snow phenology, compared to two remote sensing products (Figure 15-16). As expected, over mountainous areas, ELM shows earlier snow onset, later depletion and thus longer snow duration compared to flat and generally lower elevation areas (first column of Figure 15). Compared to STC-MODSCAG and SPIReS (second and third columns of Figure 15), ELM shows later Accumulation_onset_date over the whole WUS with a mean bias of +17.3 and +12.4 450 days, respectively, which may be caused by the bias in the meteorological forcing data of NLDAS-2 and the simple parameterizations of the partitioning of precipitation into rainfall or snowfall; and has later Depletion_onset_date, but earlier Midpoint_date and End_date. For instance, ELM melts off earlier with a mean bias of -35.5 and -26.8 days, respectively than STC-MODSCAG and SPIReS, suggesting that ELM has higher snowmelt rate. Thus ELM has a short snow duration with a mean bias of -52.9 and -39.5 days, respectively compared to the two remote sensing products. The large biases exist in the 455 western mountains for End_date (Figure 15k-l and n-o). Overall snow phenology in ELM has a high spatial correlation with that of the remote sensing products (Table 3). Although ELM overestimates Accumulation_onset_date and Depletion_onset_date and underestimates Midpoint_date, End_date and Duration, ELM well captures the IAVs of all five snow phenology metrics (first column of Figure 16), As the elevation increases, Accumulation_onset_date decreases but the other four metrics increases for all the three datasets (second column of Figure 16). ELM also has similar magnitudes and 460 distributions for each elevation interval compared to the remote sensing products, while the three data show larger and larger differences with the increase of forest cover (third column of Figure 16).

475
IAV values of different data are shown in (a,d,g,j,m). In panels (b-c,e-f,h-i,k-l,n-o), the white dots represent the average values. The evaluation results suggest an overall good performance of ELM in simulating snow properties, while some biases and uncertainties still exist, especially over mountainous areas with dense forest cover. Compared to the remote sensing products, ELM well reproduces the spatio-temporal pattern, interannual variabilities and elevation gradients of fsno and sur (Figure 3-6 The NLDAS-2 data used in the ELM simulations has large negative precipitation biases and high air temperature uncertainties over high-elevation terrain compared to both field measurements and PRISM over the WUS (Henn et al., 2018;O'Neill et al., 2021;Pan et al., 2003;Schreiner-McGraw and Ajami, 2022), which can partly explain the negative SWE bias in ELM. Besides, 490 a 0.125° grid may have high sub-grid variabilities of snow especially in mountainous areas (Meromy et al., 2013) and SNOTEL stations in mountains located on flat surface may not capture the sub-grid spatial variabilities (Toure et al., 2016). Overall ELM can well track the snow phenology, but shows a late start of snow accumulation in winter which is consistent with the underestimation of SWE and may be related to the precipitation and air temperature bias in the meteorological forcing data of NLDAS-2 and the partitioning of precipitation into rainfall or snowfall in ELM. An earlier snowmelt is also found in ELM, 495 and there are similar issues in other LSMs, e.g., CLM4 (Toure et al., 2018) and Noah with Multi-Parameterization (Noah-MP) Formatted Table   Deleted: 10 (Xiao et al., 2021). Note that in this study, we defined snow season and phenology based on fsno rather than SWE, and thus how ELM captures the date of peak snowpack, snowmelt timing, and snowmelt rate needs further investigation based on SWE.

500
There are still some uncertainties in the benchmarking datasets used in this study. First, the MCD43 product performs well in representing sur during snow cover periods, but may have poor performance for ephemeral snow due to its assumptions of stable land surface status within 16 days (Wang et al., 2012;Wang et al., 2014). Besides, frequent cloud cover and a lack of explicit representations of topographic effects can affect the accuracy of the MCD43 product over mountainous areas Hao et al., 2018a;Hao et al., 2018b). There are some inconsistencies between STC-MODSCAG and SPIReS ( Figure  505 3-4), due to the different algorithms and data processing (e.g., interpolation and filtering). Although the physically-based STC-MODSCAG and SPIReS provide higher quality unbiased fsno estimates than the MOD10A1 snow product based on empirical algorithms against field measurements across different forest cover, snow cover, snow climate and viewing angles (Bair et al., 2021c;Rittger et al., 2013;Stillinger et al., 2022), the issues of reflectance errors, one to many problems intrinsic to spectral unmixing, cloud contamination, topographic shadows, sun-sensor geometric effects, and the impacts of forest cover can still 510 affect their reliabilities (Bair et al., 2021b;Raleigh et al., 2013;Stillinger et al., 2022). These issues can also affect the accuracy of extracted snow phenology (Section 2.4). Uncertainties of Ssno and Rsno in STC-MODSCAG/STC-MODDRFS and SPIReS exist (Bair et al., 2019). In summary, the heterogeneity of snow within pixels, relatively low spectral resolution, and interference from clouds limits the diagnostic capabilities of snow properties from MODIS. Ongoing and upcoming hyperspectral remote sensing missions (e.g., the recently launched Environmental Mapping and Analysis Program 515 (https://www.enmap.org/) and NASA's Surface Biology and Geology (Cawse-Nicholson et al., 2021) will enhance the abilities of remote sensing to monitor snow properties. There are also some discrepancies between UA and SNODAS (Figure 11-12).
The uncertainties in the PRISM data over complex terrain (Henn et al., 2018) may degrade the performance of UA. Compared to ground survey data, SWE in SNODAS over alpine areas has degraded performance due to the neglect of wind redistribution of snow (Clow et al., 2012). Compared to GPS interferometric reflectometry snow depth data, SNODAS still needs to be 520 improved over complex terrain and areas with high vegetation heterogeneities (Boniface et al., 2015). The independent comparisons also have shown the underestimations and overestimations of SNODAS Dozier, 2011;Dozier et al., 2016). Developing reliable benchmarking datasets for advancing snow modeling is still challenging but necessary (Ménard et al., 2019).

525
There is significant room for improving simulations of snow processes in ELM, ranging from the input forcing data to parameter settings and model structure. Meteorological forcing data has been demonstrated to have large impacts on snow simulations (Günther et al., 2019). The NLDAS-2 forcing data was used to drive ELM in the study, which is rather coarse to represent the sub-grid heterogeneity of precipitation over mountainous areas (Tesfa et al., 2020). Although NLDAS-2 has many improvements compared to NLDAS-1 (Xia et al., 2012), there are still some spatio-temporal discontinuities in the 530 precipitation of NLDAS-2 (Ferguson and Mocko, 2017;Xia et al., 2019). Besides, there are still some documented systematic precipitation and air temperature biases in NLDAS-2 especially over mountainous areas (Henn et al., 2018;O'Neill et al., 2021;Pan et al., 2003). The 1.9°×2.5° climatological aerosol deposition data used in the ELM simulations is too coarse to 535 capture the fine-scale spatial variations of BC and dust, which limits the accuracy of simulated Rsno and thus sno. The model structures used in different LSMs have different complexities, assumptions and simplifications Magnusson et al., 2015). In ELM, some snow processes are modelled empirically, and some parameters were set empirically or from the literature, which may contain large uncertainties. For instance, in the ELM snow albedo model, spherical snow grain shape, internal mixing of BC-snow and external mixing of dust-snow are default settings, which may be oversimplified (Hao et al., 540 2022) and can potentially affect the accuracy of Rsno and sno. The large uncertainty of Ssno is relevant to the unrealistic snow aging representations in ELM (Qian et al., 2014), which can further affect sno. The bias of sno can further affect the accuracy of absorbed energy by snow and sur (contains the contributions from snow and non-snow vegetation/soil), and thus the change of SWE and Dsno. The uncertainties of SWE can further propagate to fsno, because ELM uses SWE to estimate fsno (Swenson and Lawrence, 2012). In the snow cover parameterization of ELM, snow accumulation ratio and snowmelt shape factor are 545 empirically set as fixed values without spatio-temporal variations (Swenson and Lawrence, 2012), which can also affect the accuracy of fsno. The snow cover over complex terrain was simply parameterized as a function of the standard deviation of elevation, which may explain the large biases of fsno (Figure 3) over mountainous areas (Swenson and Lawrence, 2012). All of these uncertainties contribute to the bias of snow phenology in ELM (Section 3.2). Besides, some important processes are missing in ELM, such as the snow redistribution and sublimation by blowing snow , and the interaction 550 between vegetation and snow, which possibly lead to the degraded performance of ELM (Section 3). Developing accurate forcing data, improving/choosing suitable snow models/parameterizations, and calibrating/optimizing model parameters are all important for accurate simulations of snow processes in LSMs.
Further studies are needed to conduct systematic diagnosis and attributions of ELM simulation biases and evaluate the ability 555 of ELM in capturing the long-term trends and climate effects of snow. Attributing the snow simulation biases to the specific parameterizations or processes is still challenging but necessary to identify and locate the major sources of errors. Because the snow processes are coupled and impacted by each other, further sensitivity analysis and numerical experiments varying factors one at a time are needed. An international coordinated project of the intercomparison of snow schemes in Earth system models, ESM-SNOWMIP, provides a good opportunity for ELM to identify crucial processes leading to large biases in simulated snow 560 and compare with other LSMs from local to global scales (Krinner et al., 2018;Menard et al., 2021). In this study, we found no significant increasing or decreasing trend of snow from 2001 to 2019 over the WUS for both ELM and other benchmarking datasets. However, 19 years are not long enough to characterize long-term trend of snow and analysis was not performed on discrete river basin or elevation subsets that may be experiencing change nor during the JJA time period. To reduce the impacts of the uncertainties from atmospheric forcing, this study focused on evaluating the offline ELM simulations forced by NLDAS-565 2, since errors in both simulated temperature and precipitation have been recognized as the main drivers of snowpack errors in E3SM (Brunke et al., 2021). However, snow-related land-atmosphere interactions are neglected in the land-only simulations. Deleted: In the SNICAR-AD snow albedo models of ELM, spherical snow grain shape, internal mixing of BC-snow and external mixing of dust-snow were set by default, which may be oversimplified (Hao et al., 2022). Further efforts are needed to calibrate these parameters using remote sensing or data assimilation products. The model structures used in different LSMs have different complexities, assumptions and simplifications Magnusson et al., 2015). Some snow processes are modelled empirically, e.g., Additional studies are required to evaluate E3SM's ability to capture the impacts of snow on regional climate by performing coupled E3SM simulations with active land and atmosphere model. 605

Conclusions
Snow over the WUS plays an important role in regional climate and hydrological and ecological systems as well as human society. This study systematically evaluated the snow properties (including sur, sno, fsno, Ssno, Rsno, SWE and Dsno) and snow phenology (including four snow dates and one snow duration) simulated by ELM using SNOTEL field measurements, MODIS remote sensing products and two data assimilation products. Overall, the ELM snow simulations agree well with the 610 benchmarking datasets in terms of spatio-temporal distributions, interannual variabilities and elevation gradients for different snow properties. However, ELM has large biases of fsno for dense forest cover and sur in the Rocky Mountains and Sierra Nevada, while underestimating SWE and Dsno, especially over mountainous areas with dense forest cover for both winter and spring. The ELM simulations shows large inconsistencies with the remote sensing retrievals of sno, Ssno and Rsno. Compared to SNOTEL, ELM has larger negative biases of SWE, probably because there are some systematic biases of precipitation and 615 air temperature in NLDAS-2. Besides, there is a large spatial-scale mismatch between point-scale field measurements and grid-level simulations, which can contribute to the large biases of ELM. There are also some inconsistencies of snow phenology between ELM and remote sensing products with ELM showing later snow onset, earlier depletion and shorter snow duration, consistent with the underestimation of SWE. This study documents the ELM performance in simulating snow processes and demonstrates the necessity for further improving the snow properties and snow phenology represented in LSMs. Further efforts 620 are needed to improve the accuracy of snow properties, especially Ssno and Rsno in both ELM simulations and remote sensing retrievals and resolve the early melt-off of snow in spring and underestimations of SWE in ELM especially over the complex terrain of the WUS. are available at https://doi.org/10.5281/zenodo.7194702. The SPIReS code is publicly available at https://github.com/edwardbair/SPIRES. UA and SNODAS data can be accessed at https://nsidc.org/data/nsidc-0719 and https://nsidc.org/data/g02158, respectively. BCQC SNOTEL data is available at https://www.pnnl.gov/data-products. Codes to process data, generate all results and produce all figures are archived at https://github.com/daleihao/snow_evaluation_ELM.

650
Author contributions DH conceived the study, designed the experiments, collected the evaluation data, conducted the simulations, analyzed the results, and drafted the original manuscript. GB conceived the study, designed the experiments, discussed the results, and edited the manuscript. EB, KR, and TS provided the remote sensing data and edited the manuscript. YG and LRL discussed the results and edited the manuscript. All authors contributed to reviewing and improving the manuscript. 655

Competing interests
A co-author is a member of the editorial board of The Cryosphere. The peer-review process was guided by an independent editor, and the authors have no other competing interests to declare. Figure S1: Spatial distributions of snow cover fraction (fsno) in ELM and two remote sensing products (i.e., STC-MODSCAG and SPIReS) for different seasons: (a) winter, (b) spring, (c) summer and (d) autumn. In all panels, regions with no snow cover are masked with white color.      Figure S8, except for the statistics over the WUS regions below 42° in latitude.