Methodological data

Data sources used

The energy performance certificate database managed by LTK is transferred to the HCSO under OSAP data collection no. 2561. Each year, approximately 130–150 thousand dwellings are surveyed, and the detailed documentation prepared by certified experts is uploaded to the LTK system. At the end of 2023, the certification system was revised, and no direct conversion is possible between the old and the new frameworks. For this reason, we decided not to use certificates issued under the new system. To maximise the number of dwellings for which certified energy‑demand data are available, we also included certificates issued in the two years preceding the census, and thus relied on energy‑certificate data from the period 2020–2023. These data were linked to the addresses registered at the time of the census using the detailed address information available, and through these, to the census dwelling records.

As a first step, we assigned the address identifiers from the HCSO Address Register to the addresses contained in the transferred energy certificates. This procedure was successful in 67% of cases: of the certificates available for the four‑year period, almost 370 thousand were matched with an address identifier down to the sub‑unit level. Among these, approximately 35 thousand certificates proved to be duplicates, having been issued multiple times for the same dwelling during the period. After excluding older duplicates, we were able to use 335 thousand records.

The next step was linking the data to the 2022 census dwelling‑stock table, again using the address identifiers. Of the 335 thousand records, we successfully assigned an energy certificate to 279 thousand dwellings, representing 6.5% of the nearly 4.6 million dwellings. The territorial distribution of the matched records largely reflects the regional pattern of the census dwelling stock, although in Budapest the complexity of addresses in multi‑unit buildings (building, staircase, floor, door) somewhat reduced the matching rate compared with the simpler addressing of predominantly single‑family areas. Nevertheless, we achieved coverage exceeding 5% in every region

Table 1
Characteristics of the data sources used and key linkage indicators
Region Number of dwellings (Census 2022) Territorial distribution of dwellings, % Number of energy certificates linked to dwelling records Territorial distribution of linked certificates, % Share of dwellings with a certificate, %
Budapest 961 061    21.0    53 291    19.1    5.5   
Pest 519 420    11.3    32 819    11.8    6.3   
Central Transdanubia 474 371    10.4    29 652    10.6    6.3   
Western Transdanubia 457 369    10.0    26 208    9.4    5.7   
Southern Transdanubia 418 847    9.1    24 947    8.9    6.0   
Northern Hungary 510 187    11.1    36 281    13.0    7.1   
Northern Great Plain 625 854    13.7    41 853    15.0    6.7   
Southern Great Plain 613 429    13.4    33 969    12.2    5.5   
Total 4 580 538    100.0    279 020    100.0    6.1   

Regression analysis

Compared with the earlier analysis carried out in 2020, a key limitation was the absence of information on renovations undertaken on residential buildings, including energy‑saving improvements that respondents could report in the 2016 microcensus. As a result, the OLS regression based on census information was able to explain only around 60% of the variance in the observations, even after applying a 10% outlier filter. The variables used in the regression, together with their coefficients and significance levels, are presented primarily to illustrate the nature of the relationships between the explanatory variables and the specific primary energy consumption of dwellings, and to introduce the set of variables applied in subsequent analyses. Two separate models were estimated: one for detached houses and one for apartments in multi‑unit buildings. In both cases, the age of the building emerges as a fundamental determinant of a dwelling’s energy demand: the newer a residential building is, the lower its energy requirement tends to be, and this difference remains pronounced even between buildings constructed in 2016–2020 and those built in 2020–2022.

Table 2
Coefficients and p‑values of the regression models
Explanatory variables of the model
Dependent variable: specific primary energy consumption (as defined by the TNM regulation)
Apartment‑building model
R²= 0.597
Detached house model
R²= 0.616
B p B p
  Constant 371.6 0.000 457.062 0.000
Settlement‑level specific dwelling price, million HUF -7.624 0.000 -82.304 0.000
Floor area of the dwelling -0.275 0.000 -0.389 0.000
Region (reference category: Southern Great Plain)
  Southern Transdanubia -26.909 0.000 -4.769 0.000
  Northern Great Plain 2.625 0.000 2.257 0.001
  Northern Hungary -8.566 0.000 -0.940 0.192
  Central Transdanubia -10.509 0.000 -2.016 0.006
  Western Transdanubia -20.813 0.000 -10.907 0.000
Type of settlement
  Budapest -26.556 0.000 4.390 0.000
  county seat -25.989 0.000 -8.203 0.000
  town -16.919 0.000 -1.348 0.007
Year of construction (reference category: before 1919)
  1919–1945 -12.365 0.000 13.599 0.000
  1946–1960 -42.289 0.000 16.336 0.000
  1961–1980 -64.704 0.000 13.771 0.000
  1981–2000 -69.075 0.000 -43.050 0.000
  2001–2010 -121.199 0.000 -104.281 0.000
  2011–2015 -131.618 0.000 -128.983 0.000
  2016–2020 -152.124 0.000 -148.669 0.000
  2021–2022 -167.467 0.000 -168.851 0.000
Wall structure (reference category: brick) concrete wall -8.260 0.000 -3.210 0.008
  adobe, timber or other wall type -6.728 0.006 1.975 0.000
  panel wall -18.573 0.000
Number of dwellings (reference category: 4–12 units)
  13 units or more -0.817 0.127
Fuel type (reference category: piped natural gas)
  other -5.708 0.827 -19.338 0.140
  LPG -21.534 0.729 1.150 0.912
  coal 29.185 0.000 -116.043 0.055
  electricity 23.286 0.322 -29.646 0.106
Building height (reference category: single‑storey)
  2–3 storeys -42.493 0.000
  4 storeys -57.800 0.000
  5 storeys or more -61.886 0.000
Present in the dwelling/house
internet connection -4.004 0.000 -21.120 0.000
heat‑pump heating -12.240 0.000 11.529 0.000
air‑conditioning unit -2.630 0.000 -27.286 0.000
photovoltaic panel -17.342 0.000 -97.923 0.000
solar thermal collector 35.626 0.000 109.706 0.000
Heating and fuel type(s) (reference: room heating with piped gas)
electricity, room‑by‑room 67.436 0.000 16.254 0.377
wood, room‑by‑room 64.585 0.300 -22.693 0.000
coal, room‑by‑room 92.726 0.135
other fuel, room‑by‑room 41.782 0.111 -28.430 0.031
piped gas and electricity. room‑by‑room 7.306 0.000 3.822 0.062
piped gas and wood, room‑by‑room 18.800 0.013 -8.270 0.000
piped gas and other fuel, room‑by‑room -74.862 0.000 -9.531 0.359
electricity and wood, room‑by‑room 62.546 0.000 7.528 0.683
electricity and other fuel, room‑by‑room 30.641 0.133 44.448 0.006
coal and wood, room‑by‑room 86.724 0.173 -22.511 0.000
multiple other fuels, room‑by‑room 62.765 0.123 10.714 0.322
central boiler, electricity 47.788 0.000 25.745 0.161
central boiler, wood 89.694 0.152 19.234 0.000
central boiler, coal 18.545 0.476 144.609 0.018
central boiler, other fuel 3.475 0.011 -20.449 0.111
central boiler, piped gas and electricity 0.721 0.446
central boiler, piped gas and wood -11.632 0.027 7.001 0.000
central boiler, piped gas and other fuel -12.825 0.326 12.853 0.008
central boiler, electricity and wood 40.628 0.004 28.431 0.123
central boiler, electricity and other fuel 21.968 0.211 8.357 0.638
central boiler, coal and wood 96.928 0.145 31.579 0.000
central boiler, multiple other fuels -32.395 0.000 -11.662 0.132
district heating -32.617 0.000
 
Hot‑water supply boiler, water heater etc. -11.149 0.016 5.579 0.000
district hot‑water network -33.375 0.000 -17.580 0.000
High‑rise panel building -5.296 0.000
Figure 1

The regression estimate reproduced the measured energy class in 34% of dwellings with known ratings, and in a further 37% it under‑ or overestimated the class by one category. The model performed particularly poorly for detached houses, while for apartments in multi‑unit buildings it missed by at most one category in 72% of cases.

Random forest modelling

In the case of energy performance certificates, the main limitation of linear regression is that it can capture the relationship between the specific energy indicator and dwelling characteristics only in linear form. We know, for example, that exceptionally high‑quality new dwellings and exceptionally poor‑quality older ones perform far better or worse than an average dwelling—meaning that the indicator to be estimated exhibits non‑linear behaviour. A method without such implicit assumptions is therefore required. Random forest regression is a machine‑learning approach particularly well suited to this task. Instead of relying on a single decision tree, it builds hundreds or thousands of trees in parallel and averages their outputs, as if each member of an expert panel evaluated the building independently before forming a collective judgement. This approach is especially appropriate for energy‑efficiency rating systems where scoring follows conditional logic rather than simple linear coefficients. Through recursive binary splitting, the algorithm naturally uncovers hidden scoring thresholds, effectively reconstructing the underlying evaluation system without explicit programming. This works well for energy‑efficiency problems because a building’s energy use is shaped by many interacting factors—its size, insulation, heating system, location and their complex interdependencies. Random forest can detect these hidden relationships without requiring a predefined mathematical formula. Moreover, the method tolerates strong correlations between variables (such as building age and wall type) and handles different data types—numerical, categorical or binary—without difficulty.

In our random forest model, we used the same dwelling characteristics and geographical variables as in the linear regression model. Because complex, high‑parameter models are prone to overfitting—meaning that the patterns defining the data may be captured with excessive precision and thus with limited generalisability—we applied cross‑validation. We created several partitions of the dataset, fitted a separate model to each, and compared these models based on their predictive performance and accuracy. The final model was then used to generate estimates for all dwellings with known energy certificates, and these estimates were compared with the observed values. The comparison was carried out both at the level of individual dwellings and across averages of different territorial units. The next figure illustrates the relationship between observed and predicted energy demand across the population of individual dwellings. On the graph, the horizontal axis shows the observed values and the vertical axis the estimated ones. The plot area is divided into cells whose colour indicates how many dwellings fall into each cell—lighter shades representing many dwellings, darker shades few (even a single one). It is clear that for a large share of dwellings the observed and estimated values lie very close to each other, indicating good model performance. At the same time, a small number of dwellings show substantial dispersion around the line of equality, meaning that the model makes larger‑than‑average errors for these cases. Such deviations are typically greatest where the observed score is exceptionally high (i.e. the dwelling is extremely energy‑inefficient), and the model cannot fully capture this. Even in these cases the model predicts a high value, but not high enough. Overall, with a dataset of this size and heterogeneity, such dispersion is to be expected, and the key criterion is that estimation errors remain small for the overwhelming majority of dwellings—which is indeed the case here.

Figure 2

When the energy efficiency of individual dwellings is assessed not through the continuous indicator but through the calculated energy classes, the difference between the measured and estimated ÉKM‑based classifications becomes clearly visible. The random forest model predicted the categories with greater accuracy than the OLS model. It is also noticeable that both methods tended to assign slightly worse energy classes than the observed ones for a number of dwellings.

Figure 3

However, the purpose of the calculations is not to determine the energy demand of individual dwellings, but to produce accurate estimates for territorial units (counties, regions) or other groups. It is important to emphasise that the county variable is included as a predictor in the model, so large discrepancies would indicate model failure. This is not the case: for many county–dwelling‑type combinations, average deviations remain within decimal ranges, and even the largest differences do not reach two digits.

Table 3
Observed and estimated county‑level averages by building type. ÉKM. 2022
County Detached house Apartment building
certified value estimated value certified value estimated value
Bács-Kiskun 311.1 316.7 190.7 190.9
Baranya 299.5 297.7 145.6 146.8
Békés 335.0 336.7 210.8 211.6
Borsod-Abaúj-Zemplén 327.7 330.8 180.9 180.3
Csongrád-Csanád 240.4 241.9 206.4 206.9
Fejér 308.9 305.7 182.0 181.1
Budapest 291.4 293.6 172.8 173.4
Győr-Moson-Sopron 262.0 263.2 166.6 167.5
Hajdú-Bihar 280.6 284.4 175.5 175.5
Heves 339.2 336.8 193.4 195.3
Jász-Nagykun-Szolnok 353.9 348.6 210.0 206.7
Komárom-Esztergom 294.8 294.2 194.2 190.1
Nógrád 359.7 358.7 218.0 221.2
Pest 262.6 260.3 184.5 185.4
Somogy 323.5 318.8 191.7 192.6
Szabolcs-Szatmár-Bereg 327.4 329.2 159.6 164.1
Tolna 311.9 323.4 170.2 173.1
Vas 290.9 298.7 191.5 192.3
Veszprém 296.2 295.7 174.6 177.5
Zala 309.7 306.6 173.2 175.8

Since county‑level information is included as a predictor in our model, accuracy at this level is not, in itself, evidence that the model can reliably predict territorial units it has not encountered before. Examining districts provides a more meaningful test of predictive performance, as district‑level information was not included among the model’s variables, and the algorithm therefore could not have learned the specific composition of each district. The distribution of differences between observed and estimated district‑level average scores shows that these differences span a much wider range than in the case of counties. At the same time, it is striking that for 75 districts the difference falls between –2 and +4 kWh/m²/a, which indicates particularly strong model performance, given that the energy‑efficiency score spans several hundred units. In the majority of districts—144 out of 198—the deviation lies between –9 and +14 kWh/m²/a, which also represents good performance. Only a few districts show larger discrepancies. These tend to be districts with low housing‑market activity, where few energy certificates are available, making the observed average values more uncertain. Such districts are typically rural, with below‑average housing stocks, located in less affluent regions. Our results illustrate why estimates for very small territorial units must be treated with caution: the smaller the unit, the greater the likelihood that it is characterised by unique local conditions that cannot be generalised from the full population, and the fewer observations are available—two factors that often reinforce each other.

Figure 4

The phenomenon whereby predictions for individual buildings exhibit substantial uncertainty, yet converge strikingly close to the true values when aggregated—whether at the level of districts, counties or regions—highlights one of the key strengths of random forest models in forecasting energy efficiency. This statistical behaviour stems from the model’s balanced error distribution, in which prediction errors do not systematically skew in one direction but instead disperse symmetrically around the true values. When these predictions are aggregated at regional scales, positive and negative errors effectively cancel each other out, producing averages that lie only 1–2 points from the observed values on a scale of several hundred—an impressive level of accuracy. This example demonstrates that random forests are highly effective at capturing macro‑level relationships between residential building characteristics and energy consumption, even though they exhibit greater uncertainty at the individual‑building level. These properties make the model particularly valuable for policy planning, regional energy‑efficiency assessment and trend analysis, even when predictions for individual buildings carry greater variability.