Validation of Pan-Arctic Soil Temperatures in Modern Reanalysis and Data Assimilation Systems

. Reanalysis products provide spatially homogeneous coverage for a variety of climate variables in regions where observational data are limited. However, very little validation of reanalysis soil temperatures in the Arctic has been performed to date, because widespread in situ reference observations have historically been unavailable there. Here we validate pan-Arctic soil temperatures from eight reanalysis and LDAS products, using a newly-assembled database of in situ data from diverse measurement networks across Eurasia and North America. We find that most products have soil temperatures that are biased 5 cold by 2-7 K across the Arctic, and that biases and RMSE are generally largest in the cold season. Monthly mean values from most products correlate well with in situ data (R > 0.9) in the warm season, but show lower correlations (r = 0.6 - 0.8), in many cases, over the cold season. Similarly, the magnitude of monthly variability in soil temperatures is well captured in summer, but overestimated by 20 % to 50 % for several products in winter. The suggestion is that soil temperatures in reanalysis products are subject to much higher uncertainty when the soil is frozen and/or when the ground is snow-covered. We also validate 10 the ensemble mean of all products, and find that when all seasons, and metrics are considered, the ensemble mean generally outperforms any individual product in terms of its correlation and variability, while maintaining relatively low biases. As such, we recommend the ensemble mean soil temperature product for a wide range of applications - such as the validation of soil temperatures in climate models, and to inform models that require soil temperature inputs, such as hydrological models, or for permafrost simulations.


Introduction
Soil temperature is an important control of many physical, hydrological, and land surface processes, as soils act as a reservoir for energy and moisture underground. It provides an important initial condition for numerical weather prediction, as energy and water fluxes from the land are important for convective processes (Dirmeyer et al., 2006;Kim and Wang, 2007;Siqueira et al., 2009). As soils react relatively slowly to variations in weather, soil temperature is also an important predictor of seasonal 20 and mid-term weather forecasts (Xue et al., 2012). Soils over large portions of the Arctic are perennially frozen (permafrost soil). Roughly 800 gigatonnes of carbon (GtC) is estimated to be stored in permafrost soils across the Northern Hemisphere (Hugelius et al., 2014); about twice the amount of carbon currently residing in the atmosphere (Tarnocai et al., 2009). Continued by CFSR and GLDAS-Noah. CFSR uses the Noah-LSM in a fully coupled mode to obtain a first-guess land-atmosphere simulation, before operating in a semi-coupled mode with GLDAS to obtain information about the state of the land surface (Saha et al., 2010). GLDAS, however, is run in an offline mode, utilizing meteorological forcing from Princeton University between 1948 to 2000 (Sheffield et al., 2006), and a combination of model and observational data from 2000 -onwards (Rui et al., 2018). 75 ERA-Interim, ERA5 and ERA5-Land use versions of the Tiled ECMWF Scheme for Surface Exchanges over Land (TES-SEL) land model (Viterbo, 1995;Viterbo and Betts, 1999). In the case of ERA-Interim, TESSEL is informed by empirical corrections from 2m (surface) air temperature and humidity (Dee et al., 2011). Meanwhile, ERA5 and ERA5-Land use an updated version of TESSEL, known as the Hydrology-Tiled ECMWF Scheme for Surface Exchanges over Land (HTESSEL) . In ERA5, a weak coupling exists between the land surface and atmosphere, while an advanced LDAS 80 that incorporates information regarding the near-surface air temperature, relative humidity, as well as snow cover (de Rosnay et al., 2014), along with satellite estimates of soil moisture and soil temperature from the top 1m of soil (de Rosnay et al., 2013). ERA5-Land, unlike ERA5, does not directly assimilate observational data. Instead, the ERA5 meteorology (such as air temperature, humidity and atmospheric pressure) is used as forcing information for HTESSEL; allowing it to be run at higher resolutions (Muñoz-Sabater et al., 2021). It includes an improved parameterization of soil thermal conductivity allow-85 ing for it to account for ice content in frozen soil; improvements to soil water balance conservation; and the ability to capture rain-on-snow events (Muñoz-Sabater et al., 2021).
Both GLDAS-CLSM and MERRA2 utilize the the Catchment Land Surface Model Koster et al., 2000). Though MERRA2 does not include a land surface analysis (Gelaro et al., 2017), CLSM is informed using an updated version of the Climate Prediction Center unified gauge-based analysis of global daily precipitation (CPCU) precipitation cor-  . In the case of GLDAS-CLSM, CLSM is run in an offline mode, in a similar configuration to GLDAS-Noah. Finally, JRA-55 uses the Simple Biosphere Model (SiB) (Onogi et al., 2007;Sato et al., 1988;Sellers et al., 1986) in an offline mode, forced by atmospheric data and data from land surface analyses that incorporate microwave satellite retrievals of snow cover (Kobayashi et al., 2015). 95

Observational Data
Owing to the lack of dense soil temperature monitoring networks in the Arctic, most of the observed soil temperature record is characterized by sparse measurements with inconsistencies in the temporal periods that they cover (Yi et al., 2019). Rather than limit our validation to a small geographic region in the permafrost zone, as several prior studies have done (Hu et al., 2019;Qin et al., 2020;Wu et al., 2018;Ma et al., 2021;Li et al., 2020c), we choose to make use of data from a variety of 100 sparse networks. Such an approach has been used to validate soil temperature and permafrost performance in ERA5-Land (Cao et al., 2020), and allows for the examination of larger geographic regions, as well as for the inclusion of a more diverse set of vegetation types across the continent Ma et al. (2021).
To the authors' knowledge, this is one of the first studies to compile a comprehensive set of in situ soil temperature measurements across the Eurasian and North American Arctic, from multiple diverse sparse networks. We incorporate data 105 from the Roshydromet Network in Russia (Sherstiukov, 2012), Nordicana series D (Nordicana) (Allard et al., 2020;CEN, 2020a, b, c, d, e, f, g), Global Terrestrial Network for Permafrost (GTN-P) (GTN-P, 2018), and Kropp et al. (2020) -in an attempt to provide a representative estimate of soil temperature across the circumpolar Arctic. Our validation data also includes sites from outside regions typically underlain by permafrost, in order to facilitate a comparison of the performance of reanalysis soil temperatures at high latitudes with their performance in regions outside the permafrost zone. This provides a unique 110 baseline upon which to perform a hemispheric wide assessment of soil temperature in reanalysis and LDAS systems.
The authors do acknowledge that there is a large discrepancy in the sampling of grid cells across North America and Eurasia, however we have done so in order to make use of all available data. Unfortunately, data availability is limited across much of the Canadian North, as it has been historically under-sampled relative to Eurasia (Metcalfe et al., 2018).
In order to compare with data from reanalysis and LDAS products, temperatures were averaged across two depth bins: a near 115 surface layer (0 cm to 30 cm), and soil temperatures at depth (30 cm to 300 cm). For each site, temperatures from all depths residing within a layer were averaged, producing an estimated layer averaged temperature for every time-step. Data from 51 North American stations, and 247 Eurasian station were used to estimate estimate monthly average temperatures in the near surface layer, while 38 (256) stations from North America (Eurasia) contributed to estimates of monthly average temperatures at depth.

120
Many of the in situ (station) sites reported measurements at hourly or daily frequency, however we chose to perform the analysis at monthly time scales, as soil temperatures do not vary much over shorter timescales. As such, once a layer average had been calculated for a site, a monthly average temperature was then estimated. Daily average temperatures (T day ) were calculated for each station with a timestep smaller than 1 day, using all available data, when there was at least one temperature • x 0.31  detection was performed on in situ data before monthly averaging, by removing datapoints that fell outside 3.5σ from the mean soil temperature of the dataset.
Since the station data often included days with missing observations, the sensitivity of the monthly averages to missing data was tested, by computing monthly averages in five ways: using all months with at least one valid day in a month, using all months with at least 25, 50, and 75 percent valid data, and finally using all months with no missing data in a month. It was 130 found that T soil was not substantially impacted by the inclusion or exclusion of months containing missing data.
In order to be considered as a validation location, the grid cell was required to include soil temperature data for all eight reanalysis/LDAS products, and be collocated with at least one in situ station. Duplicate stations across datasets were excluded.
In situ locations were only included if there was at least 2 years worth of in situ data, in order to properly assess the station's seasonal cycle. For grid cells containing multiple in situ stations, the value used in the comparison is a simple spatial average  Figure 1 shows the spatial standard deviation of monthly surface soil temperatures for grid cells with more 140 than two stations included. The number of stations included in each grid cell ranges between two to 12. The median standard deviation is generally ≤ 2 • C, though the temporal variation (spread) is quite large for several stations. This suggests that while differences in T soil at in situ stations within a grid cell are generally within a few degrees of one another, there are times when the differences can be quite large.

Validation Metrics
Reanalysis/LDAS and observational (station) soil temperature data were collocated with one another spatially and temporally. Grid-cell level soil temperatures from each product were compared against in situ soil temperatures using the following statistical metrics: bias (Eq. 1), root-mean-squared-error (RMSE) (Eq. 2), normalized standard deviation (SDV norm ) (Eq. 3 and Eq. 4), and the Pearson correlation (R) (Eq. 5). Statistical metrics were calculated as follows:

RM SE
temperature values, SDV Tp and SDV Ti are the standard deviations of the reanalysis product soil temperatures and in situ soil temperatures, respectively, while x refers to the T soil (of a dataset).
Metrics were calculated separately for each individual grid cell, and then averaged to calculate a pan-Arctic estimate. Estimates for the permafrost zone and the zone with little to no permafrost were also calculated by averaging together metrics from grid cells falling within a particular zone.

Binning of Datasets by Season and Mean Annual Air Temperature
Datasets were binned into a cold season and warm season using the Berkeley Earth Surface Temperature (BEST) 2m air temperature (T air ) for each grid cell. Cold season months are those where T air ≤ -2 • C, while the warm season refers to months with T air > -2 • C, where T air is the monthly mean air temperature. Sensitivity testing on the cold/warm season revealed no substantive impact on our conclusions using a threshold of 0 • C, -5 • C, and -10 • C. We also tested the impact of 170 using a different temperature dataset to perform the binning; the ERA5 2m air temperature, which resulted in similar findings.
Soil temperatures were also also separated based on the 1981-2010 BEST mean annual air temperature, in order to estimate continuous (≥ 90% permafrost), discontinuous (50 -89.9% permafrost), and regions with little to no permafrost (< 50% permafrost). Smith and Riseborough (2002) found that the continuous permafrost zone is roughly consistent with a MAAT between -6 • C to -8 • C, while the discontinuous permafrost zone boundary varies somewhat depending on soil organic matter 175 content. For regions with primarily inorganic (mineral) soil, the southern boundary of the discontinuous permafrost zone was roughly consistent with a MAAT of -2 • C, whereas for areas underlain by soils with a heavy organic matter content, the boundary was closer to the +1 • C MAAT isotherm. As soil organic matter content was not available for the reanalysis products, the following MAAT boundaries were used in this study as a general approximation of permafrost type across the study domain:

Warm Season
Most products show small to moderate negative (cold) biases over the warm season ( Figure 2); a finding similar to previous studies investigating reanalysis soil temperatures over the mid-latitudes and the Qinghai-Tibetan Plateau (Hu et al., 2017;Qin 190 et al., 2020;Yang et al., 2020). Warm season biases ( Figure 2, Panels C and D) tend to be slightly larger at depth for most products (by 1 • C -2 • C). JRA-55, however, displays substantially more negative biases near the surface -suggesting that it underestimates summer soil temperatures near the surface.
Over the warm season, most reanalysis products display correlations greater than 0.95 near the surface, and close to 0.  of Cao et al. (2020). At depth, half of the products show smaller cold season biases (relative to the warm season), while the biases are of a similar magnitude for the remaining products. Most products show a maximum bias when T soil is between -2 • C to -10 • C, and there is a tendency for biases to decrease or flip sign over the coldest temperatures ( Figure 5).
Cold season correlations are generally lower; particularly near the surface, where correlations are often 25% to 35% smaller than they are over the warm season ( Figure 3, panel A). In addition, there is less agreement between products, with a spread of 210 0.26 between the product with the lowest correlation (JRA-55), and the product most correlated with in situ (Ensemble Mean).
At depth, seasonal differences in correlations are smaller, though cold season correlations for most products are 5% to 10% lower than their warm season counterparts (Figure 3, panel B).
Several products overestimate the in situ standard deviation in the cold season -particularly over Eurasia, though ERA5-Land underestimates the cold season standard deviation by approximately 25% to 50% (Figure 3, panel A and B) -arising in 215 part due to the wintertime positive biases over large portions of the northern hemisphere (see Figure S2 and Cao et al. (2020) for further details). The increased variance in T soil in the cold season suggests that reanalysis products have a greater degree of temperature variability in the winter. Along the topmost row (near-surface) and leftmost column (depth) of Figure 4 are scatterplots of soil temperatures from each of the eight reanalysis products, relative to in situ data. There is a larger range of temperatures for a given observed soil temperature in the cold season (blue scatter) than there is over the warm season (red 220 scatter), both at depth, and near the surface. Second, there is a greater spread in standard deviations between products at colder   temperatures. The spread is largest over the coldest temperature ranges (Figure 6), which suggests that colder climate regimes are likely an important controlling factor on reanalysis performance. Snow has a low thermal conductivity that insulates the soil, decoupling the air temperature above from the soil below the snow pack Zhang (2005). Cao et al. (2020) note that winter warm biases in T soil in ERA5-Land could be partially explained by HTESSEL having a snow density that was biased low 225 (relative to observations), leading to a snowpack thermal conductivity that was too low, and an associated overestimate of the insulating effects of snow. Snow was also cited as a major controlling factor in soil temperature biases in ECMWF's Integrated Forecast System, which also uses the HTESSEL land surface model .
Most products show a maximum standard deviation over the coldest T soil , whereas JRA-55 shows a maximum standard deviation when the observed T soil are near freezing and shows a decline in standard deviation over the coldest temperatures 230 (Figure 6), particularly near the surface (panel A).

Influence of Frozen Soils
Over our domain, there are grid cells over regions with substantial permafrost cover, as well as grid cells that fall outside the region typically covered by permafrost. As there are differences in the physical processes governing soil temperatures across regions covered by permafrost, it is important to mention differences in performance between these two regions. Bias and 235 RMSE are typically 2 • C to 4 • C larger over the permafrost zone (in both seasons) (Fig. S2). The mean bias and RMSE are typically 1 • C to 3 • C smaller over North America, relative to the permafrost zone in Eurasia (see Figure S3); however with fewer grid cells over North America, the uncertainty is also larger -as evidenced by the larger error bars.
Correlations between the permafrost region and the zone with little to no permafrost are generally quite similar, rarely differing by more than 0.1 (not shown). Individual models are more likely to overestimate the near surface variance over the 240 zone with little to no permafrost, while the opposite is true at depth (not shown). In the permafrost zone, however, the spread in standard deviation, at depth, is about 2.75 times larger than it is over the zone with little to no permafrost (not shown).
Over North America, several products overestimate the observed standard deviation in the warm season, while there is good agreement between the reanalysis products and the observed standard deviation in Eurasia. Conversely, the cold season variance is more likely to be overestimated over Eurasia (not shown). It is likely that sampling variability in T soil is a large contributing 245 factor to the differences in variance between North America and Europe.
Several factors may explain the increased variability in soil temperatures over permafrost regions. First -snow cover is present for longer, and may persist for longer on slopes with a northern aspect or where it has accumulated around vegetation. Munkhjargal et al. (2020) observed an average difference in soil temperatures of 3 • C to 4 • C across the Qinghai-Tibetan plateau in response to aspect. Second, near surface soil temperature is particularly sensitive to vegetation cover; shrub height 250 and vegetation type were found to account for roughly half the variability in late winter and spring soil temperatures in the low Arctic (Grünberg et al., 2020). Moreover, in the discontinuous permafrost zone, the presence of permafrost is often ecosystem protected or ecosystem driven. Variability in vegetation and drainage, for example, may lead to localized portions of the landscape where soil temperatures are colder than in surrounding areas (Jorgenson et al., 2010). Thirdly, latent heat interactions in the active layer during spring can lead to long periods of time where the soil remains at or close to freezing (the zero-curtain 255 Figure 6. Reanalysis soil temperature standard deviation as a function of station soil temperature for a) the near surface (0 cm to 30 cm) layer, and b) at depth (30 cm to 300 cm). Station temperatures are binned into 4 • C intervals, beginning with the -32 • C to -28 • C bin, and ending with the 28 • C to 32 • C bin. The midpoint of each temperature bin is plotted along the x-axis. period). In regions where the active layer is deep (such as over the discontinuous permafrost zone), the zero curtain period is often longer and more pronounced  than it is over the continuous permafrost zone or regions with seasonally frozen soil. Many of the processes that control the zero-curtain effect, such as freeze-thaw parameterizations are relatively simplistic in many land models (Cao et al., 2020;Chen et al., 2015), and their coarse resolution would fail to capture local scale variations in the zero-curtain period.

Validation
Having validated the performance of the individual products, here we construct a soil temperature product based on the ensemble mean of the products, and investigate its performance. Taking the ensemble mean of all eight soil temperature products produces a soil temperature dataset that shows closer agreement with observed soil temperatures than most/all individual 265 products. The ensemble mean bias and RMSE are generally quite close in magnitude to the best performing products over all regions, seasons, and depths ( Figure 2). Moreover, by sequentially increasing the number of products included in the ensemble mean, we are able to demonstrate that greater number of products in the ensemble mean leads to a reduction in overall bias and RMSE (Figure 7, panels A and B). Interestingly, the number of products included appears to be of greater importance to the performance of the ensemble mean than which products are included, suggesting that most products share similar uncertainties 270 and signal-to-noise ratios, and resulting, to a certain degree, in the offsetting of biases or errors.
In addition, the ensemble mean maintains a high correlation (often the highest of all products) over all seasons and depths ( Figure 3), and analogous improvements to correlation are seen as a greater number of models are included (Figure 7). Finally, the ensemble mean standard deviation is close to the observed value over all seasons and depths (Figure 3), suggesting that the seasonal variability in T soil is similar to the variability seen in the station data. Thus, the ensemble mean soil temperature 275 dataset provides the best estimate of in situ temperatures for the broadest range of conditions. Values are ordered based on cold season RMSE (from smallest to largest). Note that the y-axis scale is from -8 • C to +10 • C (rather than -10 • C to +10 • C).   Having established that an ensemble mean of products provides the best choice for T soil in most situations, in this Section we characterize the pan-Arctic mean, variability and trends of the ensemble mean T soil product. Unlike in previous sections,

Climatology
where the air temperature of a grid cell was used to separate the cold and warm season, this was not possible for a discussion  While, not directly comparable to the climatology figure, cold season biases in the Ensemble Mean product are typically -4 • C to -5 • C over the permafrost zone, and -2 • C to -3 • C over much of the zone with little to no permafrost (Figure 9, panel A). There is little difference in the cold season biases at depth, except for being slightly smaller in western Eurasia (Figure 9, appear to show a small to moderate warm bias on the order of +1 • C to +3 • (Figure 9, panel B). At depth, the biases are somewhat larger than they are at the surface, particularly over Eurasia, but the pattern remains similar to the near surface 300 (Figure 9, panel D).

Variance and Trends (1981-2010)
Generally speaking, coastal regions show the greatest variability in T soil , with inland regions showing substantially smaller variation. Temperature variability is larger in DJF than it is for JJA (Figure 8, Panels C and D). A similar pattern is generally seen at depth, but with reduced variability (pattern correlation greater than 0.9 in DJF). In JJA, the pattern in variability is

Discussion and Conclusions
This study conducted a validation of pan-Arctic soil temperatures for eight reanalysis products, and validated a new ensemble mean pan-Arctic soil temperature dataset. Most reanalysis products show a negative (cold) soil temperature bias of several degrees across the Arctic; one that is most prominent during the cold season, and over the highest latitudes. There is also 320 greater disagreement between the reanalysis products' estimates of soil temperature variability and correlation during the cold season. When all depths and seasons are considered, the ensemble mean product performs better than any individual product.
All estimates of T soil are subject to uncertainty, and while we have not been able to present a complete quantitative assessment of these sources of uncertainty, we have made some progress in their characterization. In particular, we identify four main categories of uncertainty, and provide qualitative estimates of their importance: 325 1. Uncertainties in data assimilation and configurations of atmospheric model Many reanalysis products exhibit a positive air temperature bias near the surface over the Arctic (Beesley et al., 2000;Lindsay et al., 2014;Tjernström and Graversen, 2009); particularly in winter, in association with temperature inversions that are weaker than observed (Serreze et al., 2012;Tjernström and Graversen, 2009). Differences between products in tainty in atmospheric reanalyses (Boisvert et al., 2018;Taylor et al., 2018), along with the near surface Arctic energy budget (Graham et al., 2019). Reanalysis estimates of precipitation amount and phase are another important source of uncertainty, and are especially hard to constrain over the Arctic, due to a lack of observations (Boisvert et al., 2018).

Uncertainties in the land model configuration and/or parameterizations:
The methods used here did not allow for explicit study of the impacts of land model configuration on soil temperature.

335
Nevertheless, it is important to acknowledge that most land models fail to simulate important aspects of permafrost soils, such as the influence of phase change on thermal conductivity (Cao et al., 2020), or include relatively crude parameterizations of freeze-thaw processes (Chen et al., 2015). Moreover, many current generation land models fail to capture feedbacks that exist between vegetation and permafrost (Chadburn et al., 2015) -a factor particularly important in the discontinuous permafrost zone, where permafrost presence (or absence) is heavily influenced by vegetation (Jorgenson 340 et al., 2010). Land surface models, such as the Simple Biosphere Model (used in JRA-55), that use the force restore method to estimate soil temperature, are prone to overestimating diurnal soil temperature range (Gao et al., 2004;Kahan et al., 2006), as well as the seasonal cycle of soil temperatures (Luo et al., 2003). This is because they underestimate heat capacity, and overestimate temporal variation in ground heat flux compared to more complex land models (Hong and Kim, 2010). The force restore method assumes a strong diurnal forcing from above. In regions with substantial snow 345 cover, such as the Arctic, this assumption is likely violated (Tilley and Lynch, 1998) because snow cover leads to a decoupling of the surface forcing from the soil below, and may, in part, explain why JRA-55 displays a reduced standard deviation in temperature over the cold season relative to most other products.

Instrumental uncertainties
The approach used here assumes that uncertainties in the in situ data are negligible, and hence are the 'true' soil temper-350 ature in a particular grid cell. In situ soil temperature measurements, however, are taken as a point measurement, and as such they may not be representative of the average soil temperature over a grid cell. As this is hard to explicitly account for in linear metrics such as bias, RMSE, and correlation, it can result in non-negligible errors (Crow et al., 2012;Loew and Schlenz, 2011); particularly when a limited number of stations are contributing to the spatially averaged station temperature (as is the case over large parts of Eurasia). While the use of data from dense networks may improve the 355 spatial representativeness of validation data (Zeng et al., 2015), their drawback is that they are often limited in terms of the land cover types they represent (Zeng et al., 2015;Ma et al., 2021). ). This is particularly true in the Arctic where data from dense networks is extremely limited.
An estimate of the uncertainty of the validation data was calculated using grid cells where multiple stations were available. The uncertainty is represented by the standard deviation of soil temperatures between stations within a grid cell 360 for a particular timestep. While the median standard deviation of soil temperature between stations is generally ≤ 2 • C for eight of the nine grid cells, there are periods of time when the standard deviation is substantially larger than this ( Figure 1, Panel B); consistent with the findings of in situ studies examining soil temperature variability at the site level (Scharringa, 1976;Gubler et al., 2011;Gruber et al., 2018) and at regional scales (Kropp et al., 2021;Grünberg et al., 2020;Way and Lapalme, 2021;Zhang et al., 2021). Kropp et al. (2021) note that T soil can vary widely over small spatial 365 scales in response to vegetation canopy structure; on scales of tens of metres in many cases Gruber et al. (2018). This is particularly true in the cold season, as the presence or absence of vegetation has a strong control on snow accumulation.
By trapping snow, trees and shrubs lead to substantially warmer soil temperatures in their vicinity (Way and Lapalme, 2021). Moreover, the impact of snow cover on soil temperature is generally more pronounced over permafrost regions (regions of seasonal frost) .

370
Uncertainties can also arise from processes that occur within the soil itself; particularly in seasonally frozen and permafrost soils. For example, rapid temperature fluctuations are often seen during the melt period, as melt-water can induce strong, localized latent heat fluxes in the active layer -factors not easily captured in monthly mean temperatures over large areas (Hinkel and Outcalt, 1995). Frost heave can lead to seasonal shifts in the position of the soil temperature probes, resulting in seasonal discontinuities in soil temperature datasets between summer and winter (Streletskiy et al.,375 2017; Urban and Clow, 2017).

Sampling uncertainties
Uncertainties exist due to inconsistencies in data availability, both temporally and spatially. Some sites (particularly those in Siberia, for example), included 30 years or more of data, while others were limited to a data record of a few years; resulting in inconsistencies in the time period over which metrics were calculated for each grid cell. to alter the fundamental conclusions of the study.
It is possible that validation sites in the discontinuous permafrost zone were preferentially located in regions where permafrost is present. In parts of Alaska, and northern Quebec, for example, there are localized areas where warm biases exist (Figure 9). This may be expected in regions where permafrost presence is ecosystem protected, as reanalysis products are unable to account for local scale feedbacks of vegetation on permafrost presence. Given, however, that the discontinuous zone is a relatively small part of the validation dataset, and that warm biases in the discontinuous zone are generally limited to a few localized areas, the authors do not believe that selection bias has fundamentally influenced the overall conclusions.
Time-series did not have a consistent start or end date -meaning that the metrics are calculated using different climatologies across different grid cells. Thus, it is not possible to calculate a true uncertainty estimate. We've qualitatively acknowledged 400 the main sources of uncertainty, and here we show an approximate estimate of the magnitude of uncertainty associated with each component, using the standard deviation in soil temperature across time and as a function of station temperature. The median standard deviation of in situ soil temperatures, for grid cells with more than 2 stations, is generally between 0.5 • C and 2 • C (Figure 1, Panel B) -an order of magnitude smaller than the standard deviation of reanalysis soil temperatures for a given station soil temperature; particularly over the cold season ( Figure 6). Reanalysis air temperatures maintain a relatively 405 consistent standard deviation between 1.25 • C to 1.75 • C for most products, except over the coldest in situ temperatures (not shown). Unlike with soil temperature, the standard deviation of reanalysis 2m air temperature only increases modestly over the cold season; along with the spread in standard deviation between products (not shown). This would suggest that the largest degree of uncertainty in reanalysis soil temperatures over the Arctic is most likely contributed by differences in the land models between products, rather than from uncertainties in observed soil temperatures, or from differences in product air temperatures.

410
Future work should aim to investigate how differences in snow cover and snow density between the reanalysis products may influence biases in the individual products. On a related note, future studies should emphasize how differences in the land models of reanalysis products may account for the spread in soil temperatures. The ensemble mean data product has potential applications in several fields, including the validation of model soil temperatures, species distribution models and to inform models that require soil temperature inputs, such as hydrological models, or for permafrost simulations. A robust 415 ensemble mean can be computed with four products (Figure 7), which means a higher resolution ensemble mean data product could be created using a subset of higher resolution reanalysis products. For example, ERA-Interim, MERRA2, and JRA-55 have resolutions lower than 0.5 degrees. A 0.31 degree ensemble mean product, based on ERA5, ERA5-Land, CFSR, and GLDAS-Noah is being explored, using a similar methodology.