Evaluating permafrost physics in the CMIP6 models and their sensitivity to climate change

Permafrost is an ubiquitous phenomenon in the Arctic. Its future evolution is likely to control changes in northern high latitude hydrology and biogeochemistry. Here we evaluate the permafrost dynamics in the global models participating in the Coupled Model Intercomparison Project (present generation CMIP6; previous generation CMIP5) along with the the sensitivity of permafrost to climate change. Whilst the northern high latitude air temperatures are relatively well simulated by the climate models, they do introduce a bias into any subsequent model estimate of permafrost. Therefore evaluation metrics 5 are defined in relation to the air temperature. This paper shows that the climate, snow and permafrost physics of the CMIP6 multi-model ensemble is very similar to that of the CMIP5 multi-model ensemble. The main differences are that a small number of models have demonstrably better snow insulation in CMIP6 than in CMIP5 and a small number have a deeper soil profile. These changes lead to a small overall improvement in the representation of the permafrost extent. There is little improvement in the simulation of maximum summer thaw depth between CMIP5 and CMIP6. We suggest that more models should include 10 a better resolved and deeper soil profile as a first step towards addressing this. We use the annual mean thawed volume of the top 2m of the soil defined from the model soil profiles for the permafrost region to quantify changes in permafrost dynamics. The CMIP6 models project that the annual mean frozen volume in the top 2 m of the soil could decrease by 10-40 % / C of global mean surface air temperature increase.

communities (Larsen et al., 2014). The latest generation of the Coupled Model Intercomparison Project (CMIP6; Eyring et al. 25 (2016)) provides an opportunity to increase our understanding of these potential impacts under future climate change.
CMIP6 provides a coordinated set of Earth System Model simulations designed, in part, to understand how the earth system responds to forcing and to make projections for the future. Here we derive and apply a set of metrics to benchmark the ability of the coupled CMIP6 models to represent permafrost physical processes. Biases in the simulated permafrost arise from 30 (1) biases in the simulated surface climate and (2) biases in the underlying land surface model. Where possible, this paper isolates the land surface component from the surface climate, and focuses on the land surface component. Both Koven et al. (2013) and Slater et al. (2017) evaluated the previous generation of global climate models (CMIP5) and found that the spread of simulated present-day permafrost area within that ensemble is large and mainly caused by structural weaknesses in snow physics and soil hydrology within some of the models. Here we assess any improvements in the CMIP6 multi-model ensemble 35 over the CMIP5 multi-model ensemble. Koven et al. (2013) and Slater and Lawrence (2013) also found a wide variety of permafrost states projected by the CMIP5 multi-model ensemble in 2100. We evaluate whether the sensitivity of permafrost to climate change is different in this current generation of CMIP models.
Permafrost dynamics can be described by the mean annual ground temperature at the top of the permafrost (MAGT) and 40 the maximum thickness of the near surface seasonally thawed layer (the active layer or ALT). To first order and at large scale the presence of permafrost is controlled by the mean annual air temperature (MAAT). In general, if the MAAT is less than 0 • C, there is a chance of finding permafrost. This is modulated by the seasonal cycle of air temperature (Karjalainen et al., 2019), snow cover, topography, hydrology, soil properties and vegetation (Chadburn et al., 2017). In winter the snow cover insulates the soil from cold air temperatures, causing the soil to be warmer than the air (winter offset; Smith and Riseborough (2002)). 45 In summer any vegetation present should insulate the soil from warm air temperatures and cause the air to be warmer than the soil (summer offset; Smith and Riseborough (2002)). The thermal offset between the soil surface and the top of the permafrost is mainly due to the seasonal changes of the thermal conductivity between the soil surface and the top of the permafrost -the top of the permafrost tends to be slightly colder than the soil surface temperature (Smith and Riseborough, 2002). Figure 1 shows a schematic of this climate-permafrost relationship which was parameterised by (Smith and Riseborough, 2002;Obu 50 et al., 2019). In fact Obu et al. (2019) developed a large-scale and high resolution observations-based estimate of mean annual ground temperature and probability of permafrost using this framework. Figure 1. Schematic of the mean, minimum and maximum annual temperature profile from the surface boundary layer to below the bottom of the permafrost. MAAT is the mean annual air temperature, MAGST is the mean annual ground surface temperature at 0.2 m; MAGT is the temperature at the top of the permafrost; ALT is the seasonal thaw depth in any given year. The surface offset is the difference between the MAAT and the MAGST and the thermal offset is the difference between the MAGST and the MAGT. Dzaa is the depth of zero annual amplitude of ground temperature.
The summer thaw depth depends strongly on the incoming solar radiation as well as soil moisture, soil organic content and topography and responds to short term climate variations (Karjalainen et al., 2019). In particular, soils with a higher ice content thaw more slowly than those with lower ice content, resulting in a shallower maximum thaw depth or ALT. Gradual thaw will 55 occur as the global temperature increases, leading to an increase in both the ALT and the time over which the near surface soil is thawed. These two factors can be represented jointly by the annual mean thawed fraction of the soil (Harp et al., 2016), which can also be used as a proxy for the soil carbon exposure to decomposition. Abrupt thaw processes caused by the melting of excess ground ice will also occur with the landscape destabilising and collapsing (Turetsky et al., 2020). These thermokarst processes are not currently represented in Earth System Models and are not assessed here. 60 This paper evaluates the ability of the CMIP6 models to represent present-day permafrost dynamics in terms of the presence or absence of permafrost, the mean annual ground temperature (MAGT), the maximum active layer thickness (ALT) and the annual mean thawed fraction. This is accompanied by an analysis of the improvement in the models' structure when compared with the CMIP5 multi-model ensemble. Finally the simulated sensitivity to climate change of northern high latitude soils is 65 quantified in order to explore their potential fate in the future and any consequent climate impacts.

CMIP model data
Historical and future monthly mean data were retrieved for a subset of coupled climate models from the CMIP6 ; Table 1) and the CMIP5 (Taylor et al. (2012);  Taylor et al. (2012)) which combine scenarios of land use and emissions to give a range of future outcomes through to 2100. When available, RCP8.5 (high pathway), RCP4.5 (intermediate pathway) and RCP2.6 (peak and decline pathway) are used here. The CMIP6 projections (O'Neill et al., 2016) are based on scenarios that combine Shared Socioeconomic Pathways (SSPs) with updated RCP emission pathways. The most widely used scenarios are SSP5-8.5 (a fossil 75 fuel intensive development socio-economic pathway, updating RCP8.5), SSP3-7.0 (a "regional rivalry" SSP with unmitigated fossil fuel emissions at the medium to high end of the range), SSP2-4.5 (a "middle of the road" SSP with an emission scenario updating RCP4.5), and SSP1-2.6 (a sustainable pathway with low end emissions, updating RCP2.6).
Monthly diagnostics processed are surface air temperature (tas; equivalent to 2-m temperature), snow depth (snd), and ver-80 tically resolved soil temperatures (tsl) for latitudes greater than 20 • N for the first ensemble member of each model where available (i.e., simulation r1i1p1 or similar). Each model is left at its native grid. In addition, grid cells with exposed ice or glaciers at the start of the historical simulation are masked out and the land fractions in the models are accounted for in any area-based assessment of permafrost. Air temperature observations over land at 2 m were taken from the WATCH Forcing Data methodology applied to ERA-Interim (WFDEI) data set (Weedon et al., 2014). These were generated by applying monthly bias corrections from Climate Research Unit (CRU) (Mitchell and Jones, 2005) to the Era-Interim reanalysis data (ECMWF, 2009 (Brown and Brasnett, 2010). The analysis is performed using real-time, in-situ daily snow depth observa-95 tions, and optimal interpolation with a first-guess field generated from a simple snow accumulation and melt model driven with temperatures and precipitation from the Canadian forecast model. The analysed snow depths are available at approximately 24 km resolution for the period between 1998 and 2016 and were converted from daily to monthly means. It should be noted that snow depths exhibit high spatial variability and difficult to measure because of land surface heterogeneity. In addition there is little evaluation data available for the Arctic.

Permafrost extent
The International Permafrost Association (IPA) map of permafrost presence (Brown et al., 1998) gives a historical permafrost distribution compiled for the period between 1960 and 1990. It separates continuous (90-100%), discontinuous (50-90%), sporadic (10-50%), and isolated (<10%) permafrost. This distribution was generated from the original 1:10,000,000 paper map This CCI-PF analysis of permafrost extent is within the range but slightly lower than the estimate of Brown et al. (1998). We use a version of the CCI-PF which has been re-gridded to 0.5 • resolution.   Table 2. Permafrost areas from three of the available observational data sets defined both as an areal extent and as a fraction of the area where the observed MAAT is below the given threshold. area where the MAAT is less than 0 • C. Table 2 shows that this is the case for the Brown et al. (1998)

Site specific observations
The Circumpolar Active Layer Monitoring Network [CALM: Brown (1998)] is a network of over 100 sites at which ongoing measurements of the end of season thaw depth or the ALT are taken. Measurements are available from the early 1990s, when the network was formed, to present.

140
MAAT, MAGT, snow depth and MAGST (mean annual ground surface temperature at 0.2m) were available for a range of sites in Russia and Canada (Zhang et al., 2018). The data at Russian meteorological stations are from All-Russian Research

Evaluation metrics
These metrics are derived from both the models and the observations in a consistent manner.

Effective snow depth
Snow has a big impact on the soil temperatures and presence/absence of permafrost in the northern high latitudes (Wang et al., 2016;Zhang, 2005). Here we use the effective snow depth, S depth,eff (Slater et al., 2017) which describes the insulation of 155 snow over the cold period. S depth,eff is an integral value such that the mean snow depth in each month, m (S m in meters) is weighted by its duration: It is assumed that the snow can be present anytime from October (m = 1) to March (m = 6) with the maximum duration, M equal to 6 months. This weights early snowfall more than late snowfall as it will have a greater overall insulating value. The 160 insulation capacity of the snow changes little with snow depth when S depth,eff increases above ∼0.25 m (Slater et al., 2017), and seasons with an earlier snowfall will generally have a greater S depth,eff than seasons with a later snowfall.

Winter, summer and thermal offsets
The winter offset is defined as the difference between the mean soil temperature at 0.2 m and the mean air temperature for the period from December to February. This is expected to be positive with the soil temperature warmer than the air temperature.

165
The summer offset is defined in a similar manner for the period between June and August. This is expected to be slightly negative with the soil temperature cooler than the air temperature. The surface offset is the sum of the summer and winter offset, but is dominated by the winter offset. The thermal offset is the temperature difference between the annual mean soil temperature at 0.2m (mean annual ground surface temperature: MAGST) and the annual mean soil temperature at the top of the permafrost (mean annual ground temperature: MAGT). This is expected to be slightly negative with MAGT slightly colder than 170 MAGST.

Diagnosing permafrost in the model
In this paper the preferred method of defining permafrost is to diagnose the temperature at the depth of zero annual amplitude (D zaa ). D zaa is defined as the minimum soil depth where the variation in monthly mean temperatures within a year is less 175 than 0.1 • C. If the temperature at the D zaa is less than 0 • C for a period of 2 years or more there is assumed to be permafrost 8 in that grid cell. However, only 6 of the CMIP6 models have a soil profile deep enough to identify the D zaa (Table 1). In the remainder of the models, the maximum soil depth is less than the D zaa , so an alternative method of identifying the presence of permafrost is required. For these models permafrost is assumed to be present in grid cells where the 2-year mean soil temperature of the deepest model level is less than 0 • C. This definition was used by Slater and Lawrence (2013), who suggested that 180 if the mean soil temperature of the deepest model level is below 0 • C and assuming constant soil heat capacity, there is likely to be permafrost deeper in the soil profile. However, this definition does not explicitly recognise permafrost in the soil profile -in order to do that, the maximum soil temperature of the deepest model level should be less than 0 • C. The main issue with this latter method is that, if the soil profile does not extend deep enough, the deepest model level may fall in the ALT and the permafrost extent will be under-estimated.

185
Subgrid scale variability is not taken into account in this assessment -the models are assumed either to have permafrost or no permafrost in each grid cell. However, the observations are either very high resolution (CCI-PF is 1 km resolution in its original format) or supply a probability of permafrost for each grid cell (Brown et al., 1998;Chadburn et al., 2017). Therefore in order to compare the observed extent with those from the models, we assume that any grid cell where the observations have 190 50% permafrost should be identified by the models as having permafrost and any grid cells with < 50% will be identified as not having permafrost. The observed values of this threshold (PF 50 % ) are shown in Table 2 and are approximately equal to the permafrost extent (PF ex , also shown in Table 2), which is defined as the observed area of permafrost which takes into account the proportion of permafrost in each of the grid cells.

Thaw depth and associated metrics
The thawed depth from the surface is defined for each month using the depth-resolved monthly mean soil temperatures. The soil temperatures were interpolated between the centre of each model level and the thaw depth defined at the minimum depth where it reaches 0 • C. Some of the models have a very poorly resolved soil temperature profile which will introduce some biases into this estimate (Chadburn et al., 2015). In addition, taliks (unfrozen patches within the frozen part of the soil) will not 200 be identified using this method. The annual maximum active layer thickness (ALT in m) is defined as the maximum monthly thaw depth for that year.
Under increasing temperature both the ALT and the time the soil is thawed will increase via an earlier thaw and later freeze up.
Therefore, Harp et al. (2016) defined the annual mean thawed fraction ( D) for permafrost soils which can be expressed in units 205 of m 3 /m 3 : where t is the time in months; z is the depth in m; T is the soil temperature at time t and depth z; and z max is the maximum depth of the soil under consideration. Here we assume z max is 2 m which is relatively shallow but enables the models with shallower soil depths to be included consistently within the analysis. The annual mean frozen fraction ( F ) is the frozen component of the 210 soil and given by: One advantage of using D over ALT is that it enables taliks to be identified, although this will become more relevant when considering soils deeper than 2 m. In addition, D is a first order proxy for the soil carbon exposure to decomposition in any particular grid cell.

215
The annual thawed volume ( D tot in m 3 ) is the sum of the area-weighted values of z max D for each grid cell in the present day permafrost region. Any non-permafrost grid cells are masked. Similarly the annual frozen volume ( F tot in m 3 ) is the sum of the area-weighted z max F for each grid cell again defined for the present day permafrost region. For any future projections, if there is no longer freezing in a specific grid cell (which had permafrost in the present day), D is set to 1 and F is set to 0.

220
We can derive an observational-based estimate of present-day D tot using the available site specific data between D and MAGT described in Section 2.2.4 and shown in Figure 2. The CCI-PF dataset was then used in conjunction with this relationship to estimate D over the permafrost region. D is related non-linearly to the MAGT -the warmer the ground temperature the bigger the annual mean thawed fraction. Therefore a second order polynomial was fitted to the site-specific relationship be-225 tween D and the MAGT-green line in Figure 2. The dashed black lines show the relationship for the 95% confidence intervals.
These three curves were then used in conjunction with the CCI-PF data set to derive the mean and range of D for each grid cell with permafrost present. Assuming z max is 2 m and summing over the CCI-PF permafrost area gives a present-day D tot of 5.6 × 10 3 km 3 (range 2.1 -9.5 ×10 3 km 3 ). F tot is then 22.2 × 10 3 km 3 (range 18.4 -25.8 ×10 3 km 3 ).
230 Figure 2. The relationship between the MAGT and D for observed sites and years where both monthly thawdepths and MAGT are available (Zhang et al., 2018). The green solid line is the best fit and the black dashed lines are the 95% confidence intervals.

Results
In a global climate model the permafrost dynamics are affected by both the driving climate and by the parameterisations used to translate the meteorology into the presence or absence of permafrost namely the land surface module. Here we separate out these two factors and, where possible, identify the relative uncertainties introduced.   Figure 3 shows the differences between the observations and the CMIP6 models for relevant climate-related characteristics of the permafrost affected region (PF aff ) defined by the CCI-PF data ( Table 2). The horizontal grey lines on Figure 3 represent ±15% of the observed value. Absolute values for individual models and the observations are given in Table S1.1. These can also be compared with the CMIP5 multi-model ensemble ( Figure S2.1 and Table S2.2).

240
The MAAT is, to first order, the driver of the presence or absence of permafrost (Chadburn et al., 2017). The ability of the climate models to correctly simulate the northern high latitudes MAAT is assessed in the top two panels of Figure 3 (3a and 3b).
The CMIP6 models tend to be biased warm compared with the observations and the area where the land surface is less than 0 • C is biased low. However, in general the models fall within ±2 • C of the observed MAAT (-6.8 • C) and within ±4.0 × 10 6 km 2 of 245 the observed area where MAAT is less than 0 • C (24.4 ×10 6 km 2 ). These inter-model differences will be reflected in differences in any estimates of permafrost.  The CMIP6 multi-model ensemble can be compared with the CMIP5 multi-model ensemble ( Table 3)   The precipitation will also affect the presence or absence of permafrost -in particular any snowpack will insulate the soil.
The land surface scheme translates snowfall to snow lying on the surface and quantifies its insulating capacity. Therefore biases in both the snow amount and snow physics will influence the snow insulation. The snow amount can be represented by the S depth,eff . Figure 5 shows the multi-model ensemble median S depth,eff compared with the observations from the CMC snow depth analysis (Brown and Brasnett, 2010) for the time period 1998-2014 and Figure S1.2 shows S depth,eff for the individual 280 models. All grid cells with S depth,eff less than 2 cm are masked. S depth,eff for the Arctic is generally greater than 0.2 m in the multi-model ensemble mean. The observations have some regions at the northern tundra where the effective snow depth is slightly shallower than 0.2 m which are not reflected on the multi-model ensemble mean nor in the individual models. This results in a tendency for the models to slightly over-estimate the snow depth in the tundra. The snow region extends further south in the multi-model ensemble mean than in the observations which reflects some of the variability between models. The 285 large spatial variability in snow over the Arctic will not be well represented by either the models or the observations. Individual CMIP6 models (Figure S1.2 and Figure 3) that notably over-estimate the S depth,eff include BCC-CSM2-MR, EC-Earth3, FGOALS-f3-L, GISS-E2-1-G and IPSL-CM6A-LR. Only 16 % of the models have a mean S depth,eff within ±15% of the observations (Table 3). It should be noted that S depth,eff will be moderated by snow physics and in the case of snow the climate biases cannot be cleanly separated from the land surface biases.

Land surface module
The land surface modules translate the driving climate into the permafrost dynamics. In effect they quantify the offsets shown in Figure 1. Figure Figure S1.3 shows the variation is relatively large between the individual models.

300
For example, MPI-ESM1-2-HR has very little difference between the MAAT and the MAGT with all offsets of the order 1 • C or less, similarly ACCESS-ESM1-5 has a relatively small winter offset. In contrast a few models have very high winter offsets which reach over 10 • C at cold temperatures (UKESM1-0-LL, CAMS-CSM1-0, FGOALS-f3-L and TaiESM1). However, comparing the CMIP5 models with the CMIP6 models (Figures S1.3 and S2.5) suggests that there is a general improvement since CMIP5 when compared with the observations. 305 Figure 6. Multi-model ensemble spread of the median winter, median summer and median thermal offsets for the CMIP6 multi-model ensemble. Individual models for the CMIP6 multi-model ensemble are shown as lines and identified in Figure S1.3 and for the CMIP5 multimodel ensemble in Figure S2.5. The observed surface and thermal offsets summarised from the available point data (Zhang et al., 2018) are added in black. Although the observed surface offset is not directly comparable with the separate winter and summer offsets, these are shown for the models to illustrate their relative magnitudes. Figure 6 suggests that in order for a land surface module to accurately represent permafrost it needs to be able to represent the insulating ability of the lying snow. This is assessed in Figure 7 which shows the insulating capacity of the snow in terms of the difference between the winter air temperature and winter 0.2 m soil temperature. The offsets are a function of both MAAT (not 310 shown) and S depth,eff . A low MAAT and a high S depth,eff gives a bigger offset (see also Wang et al. (2016)). In Figure 7 only grid cells where the winter mean air temperatures are between -25 and -15 • C are shown, similarly for the observed sites. This ensures the comparisons are not biased by differences in air temperatures. The available models reflect the general increase in offset with increasing S depth,eff for the shallow snow and the saturation of this relationship for the deepest S depth,eff to varying degrees of accuracy. A few of the models (FGOALS-f3-L, TaiESM1 and UKESM1-0-LL) have relationships between offset 315 and S depth,eff which are indistinguishable from the observed relationship. The rest of the models have an offset at any given S depth,eff which is generally too small, suggesting that these models do not have enough snow insulation. The net impact of the snow offset needs to be interpreted carefully in combination with the S depth,eff in order to evaluate the impact of the snow insulation on permafrost dynamics. Figure 5 shows that Arctic snow depths are relatively shallow. Therefore, because there is a non-linear relationship between offset and S depth,eff , small differences in S depth,eff will have a big impact on the insulating 320 ability. The models typically tend to slightly over-estimate S depth,eff and slightly under-estimate the offset for any given value of S depth,eff .    winter offset than is observed for any given value of S depth,eff (Figure 7). However, it has a larger than observed S depth,eff which increases the insulation and ensures a good relationship between MAAT and MAGT in Figure 8. A comparison with the CMIP5 multi-model ensemble ( Figure S2.7) shows similar differences. It should be noted that this CCI-PF estimate of MAGT is a model-derived reanalysis and the observational uncertainties are likely under-estimated in the current analysis. served ALT is about 4.5 m. Any models with a maximum soil depth less than 4.5 m will be unable to represent this value (Table   1). This is the case for GISS-E2-1-G, where the depth of the middle of the bototm layer is 2.7 m and the ALT is constrained to 2.7 m at the warmer temperatures. Poor vertical discretisation of the soil such as the 3 layers in CanESM5 can introduce large variability into the derivation of ALT. Although the average number of soil layers and the average soil depth increases between 350 the CMIP5 and CMIP6 ensembles (CMIP6 : Table 1 and CMIP5: Table S2.1), this is not universally true.

Active layer thickness
About half of the models have relationships between ALT and MAAT that are comparable with the observations. These are mainly the models with deeper soils. Other models have very deep ALT for example, MPI-ESM1-2-HR and IPSL-CM6A-LR. These both have sufficiently deep soil profiles, but thaw too quickly in the summer, likely because these models do not 355 represent the latent heat of the water phase change. EC-Earth3 and UKESM1-0-LL have active layers around 2m irrespective of MAAT. In the case of UKESM1-0-LL, this is worse than the CMIP5 version of the model where ALT was dependent on MAAT, just with a smaller uncertainty range ( Figure S2.8). This is because there is a large increase in the MAGT between the CMIP versions (MAGT is -5.8 • C for CMIP5 and MAGT and -0.9 • C for CMIP6) caused by adding a multilayered snow scheme (Walters et al., 2019). The inclusion of organic soils and the addition of a moss layer improves the insulating capacity 360 and ability of the soil to hold water and will reduce this thaw depth (Chadburn et al., 2015).

Permafrost extent, annual thawed volume ( D tot ) and annual frozen volume ( F tot )
MAGT and ALT and their relationships with MAAT are important diagnostics for model physics. However, permafrost extent, annual thawed and frozen volume are more relevant when exploring the impacts of permafrost dynamics under future climate change. There are observational-based estimates of permafrost extent (Brown et al., 2003;Obu et al., 2019;Chadburn et al., 365 2017) available for evaluation and this paper introduces an observational-based estimate of annual thawed and frozen volume by extrapolating the site specific relationship between annual mean thawed fraction ( D) and MAGT to the larger scale using the CCI-PF data set. Figure 10 shows that the relationship between annual mean thawed fraction ( D) and MAGT for the available CALM and 370 GTNP data sets is relatively well constrained. As expected, D increases with increasing MAGT. Figure 10 also shows the ability of the models to replicate this relationship. At the warmer temperatures the models tend to show much more variability than the observations. In models such as EC-Earth3 and UKESM1-0-LL this is likely reflecting the sensitivity to the duration over which the soil is thawed -because in these models the ALT is very similar at all temperatures ( Figure 9). Overall the models follow a similar trend of increasing D with increasing MAGT. There are notable discrepancies for IPSL-CM6A-LR 375 and MPI-ESM1-2-HR which might be expected because these two models simulate very deep ALT. D can be converted to annual thawed and frozen volumes ( D tot and F tot ) using the monthly profiles of modelled soil temperature and compared with the observational-based estimate of D tot and F tot discussed in Section 2.3.4. Figure 11 summarises PF ex , D tot and F tot for each of the CMIP6 models. The top plot of Figure 11 shows a comparison of PF benchmark with the observations (empty black bars). All the models fall close to the observed values, as is expected from Table S1.2. PF ex (red bar) should be compared with PF benchmark and not the observations, and any differences between the two quantifies a bias added by the land surface model. Any differences here are again dominated by the snow insulation, which is a combination of the S depth,eff and the relationship between MAGST, MAAT and S depth,eff . When compared with the CMIP5 multi-model ensemble ( Figure S2.10), the spread of values is smaller, with the 385 differences between PF ex and PF benchmark significantly lower in some cases. Assessing Figure 11 in conjunction with Figure   S1.2 and Figure 7 suggests some improvement in snow insulation between the CMIP5 multi-model ensemble and the CMIP6 multi-model ensemble. Figure 11 also shows D tot and F tot for each CMIP6 model. These metrics are calculated for the present-day permafrost region defined by the model specific PF ex . Any differences between models are caused by a combination of differences in the 390

22
MAGT and differences in the relationships between D and MAGT. These lead to the considerable spread in D tot and F tot shown in Figure 11. There is a tendency for the models to over-estimate D tot and under-estimate F tot . This arises partly because of the higher likelihood of the larger values of D at the warmer air temperatures, i.e. in the discontinuous and sporadic permafrost ( Figure 10). The CMIP6 multi-model ensemble has a similar uncertainty to that of CMIP5 suggesting little improvement in the quantification of D tot and F tot between ensembles.

Evaluation metrics
This section summarises some basic evaluation metrics which can be applied to quantify the ability of the land surface modules to represent permafrost dynamics. Relevant climate-related metrics are summarised in Table 3 and shown in Table S1.1 for individual CMIP6 models and Table S2.2 for individual CMIP5 models. Table 4 shows a summary of the land-related evaluation metrics for the observations and the two different multi-model ensembles along with the percentage of each multi-model en-400 semble falling within the range with individual models shown in Table S1.2 and Table S2.3. These metrics should be relatively independent of climate and reflect the behaviour of the land surface module. Of particular note is the ALT, a key metric when simulating permafrost dynamics, which is much deeper than the observations for both CMIP5 and CMIP6.
There is better agreement with the observations for the climate-related metrics (Table 3) than for the land-surface related 405 metrics with only a very small number of models falling within the range of the observations. In addition, there is no consistency in model performance -no model performs well for every evaluation metric (Table S1.2 and Table S2.3). Overall, the percentage of the models which fall within the observed range is relatively low with the majority falling outside the range of the observations (Table 4). However, there are some improvements for all the metrics in the percentage of models that fall within the observed range between the CMIP5 multi-model ensemble and the CMIP6 multi-model ensemble.  The sensitivity of PF benchmark to increasing GSAT shows a loss of between 3.1 and 3.8 ×10 6 km 2 / • C (25 th to 75 th percentile; Table 5) in the CMIP6 multi-model ensemble. This derivation of PF benchmark uses the observed relationship between MAAT and the probability of permafrost from Chadburn et al. (2017) and assumes it is for an equilibrium state. Therefore this sensitivity is highly dependent on the Arctic amplification in the models. The variability in PF benchmark between the different 425 models and different multi-model ensembles is relatively small with no obvious outliers. The sensitivities fall to the lower end of the equilibrium sensitivity proposed by Chadburn et al. (2017) who present a loss of 4.0 +1.0 −1.1 × 10 6 km 2 / • C.
The sensitivity of PF ex to increasing GSAT is related to both the climate and the land surface module and has a wider range of values from 1.7 to 2.7 ×10 6 km 2 / • C for the 25 th to 75 th percentile (Table 5). This range is around 12 to 20 % / • C of the 430 present day permafrost and is comparable to the CMIP5 sensitivities. The loss of permafrost derived from PF ex is less than from PF benchmark . One reason for this is that PF ex includes the interactions of snow and soil thermal and hydrological dynamics. In addition PF ex represents a transient response to GSAT. Despite the shallow soil profile in the majority of the models, the methodology used to derive the PF ex means there will be some implicit time delay in the heat transfer from the surface. There are a few outliers in Figure 12b which have very different sensitivities of PF ex to increasing GSAT. These outliers include the 435 MIROC model which projects a low sensitivity of PF ex to GSAT in both the CMIP6 and CMIP5 multi-model ensembles (see also Figure S2.11). Although there is some improvement in the winter offset in MIROC between CMIP5 and CMIP6 ( Figure   S2.6 and 7), there is still too little snow insulation in the present day model because S depth,eff is relatively shallow. In addition the slope of the relationship between MAGT and MAAT is greater than 1. These factors mean MIROC has a 'permafrost-prone climate' (Slater and Lawrence (2013)).

440
The increase in mean thawed volume ( D tot ) or the decrease in mean frozen volume ( F tot ) is actually relatively consistent between the different models and ranges from 2.1 to 5.9 ×10 3 km 3 / o C (5 th to 95 th percentile; Table 5). This represents a mean loss of frozen volume of around 10-40% of the permafrost in the top 2m of the soil per degree increase in GSAT (5 th to 95 th percentile). Here MIROC is not an outlier -the sensitivity of D tot and F tot to GSAT in MIROC falls within the spread 445 26 of the other models and the relationship between D and MAGT is comparable with the observed relationship. This suggests that the summer thawing processes are well represented in MIROC despite some biases in the sensitivity of the mean deep soil temperature to temperature change. In contrast, the IPSL-CM6A-LR model has a much lower sensitivity of D tot and F tot to GSAT. This is because it does not represent the latent heat required for thawing and therefore has too deep an active layer.

Discussion and Conclusion
This paper examines the permafrost dynamics in both the CMIP6 multi-model ensemble and CMIP5 multi-model ensemble 455 using a wide range of metrics. As far as possible, the metrics were defined so as to identify the effect of biases in the climate separately to biases in the land surface module. Overall, the two multi-model ensembles are very similar in terms of climate, snow and permafrost physics and projected changes under future climate change. This paper does not attempt to document specific improvements to individual models in any detail -the CMIP5 and CMIP6 ensembles contains a slightly different set of models. However, it is apparent that the snow insulation is improved in a few of the models which results in overall less 460 variability in the permafrost extent (PF ex ) in the CMIP6 ensemble than in the CMIP5 ensemble. In general, the ability of the models to simulate of summer thaw depths is little improved between ensembles. One reason for this remains limitations caused by shallow and poorly resolved soil profiles.
Over the past few years there have been a lot of model developments that improved the representation of northern high to climate change. Many of these processes are yet to be included within the climate models.
In particular, excess ground ice which exists as ice lenses or wedges in permafrost soils is a key process that is not included in the current generation of CMIP models. Thawing of ice-rich permafrost ground will lead to landscape changes including subsidence, thaw slumps and active layer detachments and large-scale modification of the hydrological cycle (Liljedahl et al., 2016;Nitzbon et al., 2020). These ice-rich thermokarst landscapes are susceptible to abrupt changes and cover about 20 % of 475 the northern permafrost region (Olefeldt et al., 2016). Recent observations suggest that even very cold permafrost with near surface excess ice is highly vulnerable to rapid thermokarst development and degradation (Farquharson et al., 2019). The inclusion of these processes within the CMIP6 models will further perturb the hydrologcial cycle (e.g., Fraser et al., 2018) and result in additional permafrost degradation not yet quantified by current generation of climate models.

480
The CMIP6 models project a loss of permafrost under future climate change of between 1.7 and 2.7 ×10 6 km 2 / o C. A more impact relevant statistic is the decrease in annual mean frozen volume (3.0 to 5.3 ×10 3 km 3 / o C) or around 10-40 %/ o C. The projections presented here can be used to explore the consequences of permafrost degradation on the large scale hydrological and carbon cycles, for example, additional sea level rise (Zhang et al., 2000) and the additional loss of permafrost carbon (Turetsky et al., 2020;Burke et al., 2017b).
Author contributions. EJB and GK designed the analyses and EJB carried it out. YZ processed the relevant site observations. EJB prepared the manuscript with contributions from all co-authors.

Competing interests. No competing interests
490