Methodological data
Data sources used
The energy performance certificate database managed by LTK is transferred to the HCSO under OSAP data collection no. 2561. Each year, approximately 130–150 thousand dwellings are surveyed, and the detailed documentation prepared by certified experts is uploaded to the LTK system. At the end of 2023, the certification system was revised, and no direct conversion is possible between the old and the new frameworks. For this reason, we decided not to use certificates issued under the new system. To maximise the number of dwellings for which certified energy‑demand data are available, we also included certificates issued in the two years preceding the census, and thus relied on energy‑certificate data from the period 2020–2023. These data were linked to the addresses registered at the time of the census using the detailed address information available, and through these, to the census dwelling records.
As a first step, we assigned the address identifiers from the HCSO Address Register to the addresses contained in the transferred energy certificates. This procedure was successful in 67% of cases: of the certificates available for the four‑year period, almost 370 thousand were matched with an address identifier down to the sub‑unit level. Among these, approximately 35 thousand certificates proved to be duplicates, having been issued multiple times for the same dwelling during the period. After excluding older duplicates, we were able to use 335 thousand records.
The next step was linking the data to the 2022 census dwelling‑stock table, again using the address identifiers. Of the 335 thousand records, we successfully assigned an energy certificate to 279 thousand dwellings, representing 6.5% of the nearly 4.6 million dwellings. The territorial distribution of the matched records largely reflects the regional pattern of the census dwelling stock, although in Budapest the complexity of addresses in multi‑unit buildings (building, staircase, floor, door) somewhat reduced the matching rate compared with the simpler addressing of predominantly single‑family areas. Nevertheless, we achieved coverage exceeding 5% in every region
Table 1
Characteristics of the data sources used and key linkage indicators
Region
Number of dwellings (Census 2022)
Territorial distribution of dwellings, %
Number of energy certificates linked to dwelling records
Territorial distribution of linked certificates, %
Share of dwellings with a certificate, %
Budapest
961 061
21.0
53 291
19.1
5.5
Pest
519 420
11.3
32 819
11.8
6.3
Central Transdanubia
474 371
10.4
29 652
10.6
6.3
Western Transdanubia
457 369
10.0
26 208
9.4
5.7
Southern Transdanubia
418 847
9.1
24 947
8.9
6.0
Northern Hungary
510 187
11.1
36 281
13.0
7.1
Northern Great Plain
625 854
13.7
41 853
15.0
6.7
Southern Great Plain
613 429
13.4
33 969
12.2
5.5
Total
4 580 538
100.0
279 020
100.0
6.1
Regression analysis
Compared with the earlier analysis carried out in 2020, a key limitation was the absence of information on renovations undertaken on residential buildings, including energy‑saving improvements that respondents could report in the 2016 microcensus. As a result, the OLS regression based on census information was able to explain only around 60% of the variance in the observations, even after applying a 10% outlier filter. The variables used in the regression, together with their coefficients and significance levels, are presented primarily to illustrate the nature of the relationships between the explanatory variables and the specific primary energy consumption of dwellings, and to introduce the set of variables applied in subsequent analyses. Two separate models were estimated: one for detached houses and one for apartments in multi‑unit buildings. In both cases, the age of the building emerges as a fundamental determinant of a dwelling’s energy demand: the newer a residential building is, the lower its energy requirement tends to be, and this difference remains pronounced even between buildings constructed in 2016–2020 and those built in 2020–2022.
Table 2
Coefficients and p‑values of the regression models
Explanatory variables of the model
Dependent variable: specific primary energy consumption (as defined by the TNM regulation)
Apartment‑building model
R²= 0.597
Detached house model
R²= 0.616
B
p
B
p
Constant
371.6
0.000
457.062
0.000
Settlement‑level specific dwelling price, million HUF
-7.624
0.000
-82.304
0.000
Floor area of the dwelling
-0.275
0.000
-0.389
0.000
Region (reference category: Southern Great Plain)
Southern Transdanubia
-26.909
0.000
-4.769
0.000
Northern Great Plain
2.625
0.000
2.257
0.001
Northern Hungary
-8.566
0.000
-0.940
0.192
Central Transdanubia
-10.509
0.000
-2.016
0.006
Western Transdanubia
-20.813
0.000
-10.907
0.000
Type of settlement
Budapest
-26.556
0.000
4.390
0.000
county seat
-25.989
0.000
-8.203
0.000
town
-16.919
0.000
-1.348
0.007
Year of construction (reference category: before 1919)
1919–1945
-12.365
0.000
13.599
0.000
1946–1960
-42.289
0.000
16.336
0.000
1961–1980
-64.704
0.000
13.771
0.000
1981–2000
-69.075
0.000
-43.050
0.000
2001–2010
-121.199
0.000
-104.281
0.000
2011–2015
-131.618
0.000
-128.983
0.000
2016–2020
-152.124
0.000
-148.669
0.000
2021–2022
-167.467
0.000
-168.851
0.000
Wall structure (reference category: brick)
concrete wall
-8.260
0.000
-3.210
0.008
adobe, timber or other wall type
-6.728
0.006
1.975
0.000
panel wall
-18.573
0.000
Number of dwellings (reference category: 4–12 units)
13 units or more
-0.817
0.127
Fuel type (reference category: piped natural gas)
other
-5.708
0.827
-19.338
0.140
LPG
-21.534
0.729
1.150
0.912
coal
29.185
0.000
-116.043
0.055
electricity
23.286
0.322
-29.646
0.106
Building height (reference category: single‑storey)
2–3 storeys
-42.493
0.000
4 storeys
-57.800
0.000
5 storeys or more
-61.886
0.000
Present in the dwelling/house
internet connection
-4.004
0.000
-21.120
0.000
heat‑pump heating
-12.240
0.000
11.529
0.000
air‑conditioning unit
-2.630
0.000
-27.286
0.000
photovoltaic panel
-17.342
0.000
-97.923
0.000
solar thermal collector
35.626
0.000
109.706
0.000
Heating and fuel type(s) (reference: room heating with piped gas)
electricity, room‑by‑room
67.436
0.000
16.254
0.377
wood, room‑by‑room
64.585
0.300
-22.693
0.000
coal, room‑by‑room
92.726
0.135
other fuel, room‑by‑room
41.782
0.111
-28.430
0.031
piped gas and electricity. room‑by‑room
7.306
0.000
3.822
0.062
piped gas and wood, room‑by‑room
18.800
0.013
-8.270
0.000
piped gas and other fuel, room‑by‑room
-74.862
0.000
-9.531
0.359
electricity and wood, room‑by‑room
62.546
0.000
7.528
0.683
electricity and other fuel, room‑by‑room
30.641
0.133
44.448
0.006
coal and wood, room‑by‑room
86.724
0.173
-22.511
0.000
multiple other fuels, room‑by‑room
62.765
0.123
10.714
0.322
central boiler, electricity
47.788
0.000
25.745
0.161
central boiler, wood
89.694
0.152
19.234
0.000
central boiler, coal
18.545
0.476
144.609
0.018
central boiler, other fuel
3.475
0.011
-20.449
0.111
central boiler, piped gas and electricity
0.721
0.446
central boiler, piped gas and wood
-11.632
0.027
7.001
0.000
central boiler, piped gas and other fuel
-12.825
0.326
12.853
0.008
central boiler, electricity and wood
40.628
0.004
28.431
0.123
central boiler, electricity and other fuel
21.968
0.211
8.357
0.638
central boiler, coal and wood
96.928
0.145
31.579
0.000
central boiler, multiple other fuels
-32.395
0.000
-11.662
0.132
district heating
-32.617
0.000
Hot‑water supply
boiler, water heater etc.
-11.149
0.016
5.579
0.000
district hot‑water network
-33.375
0.000
-17.580
0.000
High‑rise panel building
-5.296
0.000
Figure 1
The regression estimate reproduced the measured energy class in 34% of dwellings with known ratings, and in a further 37% it under‑ or overestimated the class by one category. The model performed particularly poorly for detached houses, while for apartments in multi‑unit buildings it missed by at most one category in 72% of cases.
Random forest modelling
In the case of energy performance certificates, the main limitation of linear regression is that it can capture the relationship between the specific energy indicator and dwelling characteristics only in linear form. We know, for example, that exceptionally high‑quality new dwellings and exceptionally poor‑quality older ones perform far better or worse than an average dwelling—meaning that the indicator to be estimated exhibits non‑linear behaviour. A method without such implicit assumptions is therefore required. Random forest regression is a machine‑learning approach particularly well suited to this task. Instead of relying on a single decision tree, it builds hundreds or thousands of trees in parallel and averages their outputs, as if each member of an expert panel evaluated the building independently before forming a collective judgement. This approach is especially appropriate for energy‑efficiency rating systems where scoring follows conditional logic rather than simple linear coefficients. Through recursive binary splitting, the algorithm naturally uncovers hidden scoring thresholds, effectively reconstructing the underlying evaluation system without explicit programming. This works well for energy‑efficiency problems because a building’s energy use is shaped by many interacting factors—its size, insulation, heating system, location and their complex interdependencies. Random forest can detect these hidden relationships without requiring a predefined mathematical formula. Moreover, the method tolerates strong correlations between variables (such as building age and wall type) and handles different data types—numerical, categorical or binary—without difficulty.
In our random forest model, we used the same dwelling characteristics and geographical variables as in the linear regression model. Because complex, high‑parameter models are prone to overfitting—meaning that the patterns defining the data may be captured with excessive precision and thus with limited generalisability—we applied cross‑validation. We created several partitions of the dataset, fitted a separate model to each, and compared these models based on their predictive performance and accuracy. The final model was then used to generate estimates for all dwellings with known energy certificates, and these estimates were compared with the observed values. The comparison was carried out both at the level of individual dwellings and across averages of different territorial units. The next figure illustrates the relationship between observed and predicted energy demand across the population of individual dwellings. On the graph, the horizontal axis shows the observed values and the vertical axis the estimated ones. The plot area is divided into cells whose colour indicates how many dwellings fall into each cell—lighter shades representing many dwellings, darker shades few (even a single one). It is clear that for a large share of dwellings the observed and estimated values lie very close to each other, indicating good model performance. At the same time, a small number of dwellings show substantial dispersion around the line of equality, meaning that the model makes larger‑than‑average errors for these cases. Such deviations are typically greatest where the observed score is exceptionally high (i.e. the dwelling is extremely energy‑inefficient), and the model cannot fully capture this. Even in these cases the model predicts a high value, but not high enough. Overall, with a dataset of this size and heterogeneity, such dispersion is to be expected, and the key criterion is that estimation errors remain small for the overwhelming majority of dwellings—which is indeed the case here.
Figure 2
When the energy efficiency of individual dwellings is assessed not through the continuous indicator but through the calculated energy classes, the difference between the measured and estimated ÉKM‑based classifications becomes clearly visible. The random forest model predicted the categories with greater accuracy than the OLS model. It is also noticeable that both methods tended to assign slightly worse energy classes than the observed ones for a number of dwellings.
Figure 3
However, the purpose of the calculations is not to determine the energy demand of individual dwellings, but to produce accurate estimates for territorial units (counties, regions) or other groups. It is important to emphasise that the county variable is included as a predictor in the model, so large discrepancies would indicate model failure. This is not the case: for many county–dwelling‑type combinations, average deviations remain within decimal ranges, and even the largest differences do not reach two digits.
Table 3
Observed and estimated county‑level averages by building type. ÉKM. 2022
County
Detached house
Apartment building
certified value
estimated value
certified value
estimated value
Bács-Kiskun
311.1
316.7
190.7
190.9
Baranya
299.5
297.7
145.6
146.8
Békés
335.0
336.7
210.8
211.6
Borsod-Abaúj-Zemplén
327.7
330.8
180.9
180.3
Csongrád-Csanád
240.4
241.9
206.4
206.9
Fejér
308.9
305.7
182.0
181.1
Budapest
291.4
293.6
172.8
173.4
Győr-Moson-Sopron
262.0
263.2
166.6
167.5
Hajdú-Bihar
280.6
284.4
175.5
175.5
Heves
339.2
336.8
193.4
195.3
Jász-Nagykun-Szolnok
353.9
348.6
210.0
206.7
Komárom-Esztergom
294.8
294.2
194.2
190.1
Nógrád
359.7
358.7
218.0
221.2
Pest
262.6
260.3
184.5
185.4
Somogy
323.5
318.8
191.7
192.6
Szabolcs-Szatmár-Bereg
327.4
329.2
159.6
164.1
Tolna
311.9
323.4
170.2
173.1
Vas
290.9
298.7
191.5
192.3
Veszprém
296.2
295.7
174.6
177.5
Zala
309.7
306.6
173.2
175.8
Since county‑level information is included as a predictor in our model, accuracy at this level is not, in itself, evidence that the model can reliably predict territorial units it has not encountered before. Examining districts provides a more meaningful test of predictive performance, as district‑level information was not included among the model’s variables, and the algorithm therefore could not have learned the specific composition of each district. The distribution of differences between observed and estimated district‑level average scores shows that these differences span a much wider range than in the case of counties. At the same time, it is striking that for 75 districts the difference falls between –2 and +4 kWh/m²/a, which indicates particularly strong model performance, given that the energy‑efficiency score spans several hundred units. In the majority of districts—144 out of 198—the deviation lies between –9 and +14 kWh/m²/a, which also represents good performance. Only a few districts show larger discrepancies. These tend to be districts with low housing‑market activity, where few energy certificates are available, making the observed average values more uncertain. Such districts are typically rural, with below‑average housing stocks, located in less affluent regions. Our results illustrate why estimates for very small territorial units must be treated with caution: the smaller the unit, the greater the likelihood that it is characterised by unique local conditions that cannot be generalised from the full population, and the fewer observations are available—two factors that often reinforce each other.
Figure 4
The phenomenon whereby predictions for individual buildings exhibit substantial uncertainty, yet converge strikingly close to the true values when aggregated—whether at the level of districts, counties or regions—highlights one of the key strengths of random forest models in forecasting energy efficiency. This statistical behaviour stems from the model’s balanced error distribution, in which prediction errors do not systematically skew in one direction but instead disperse symmetrically around the true values. When these predictions are aggregated at regional scales, positive and negative errors effectively cancel each other out, producing averages that lie only 1–2 points from the observed values on a scale of several hundred—an impressive level of accuracy. This example demonstrates that random forests are highly effective at capturing macro‑level relationships between residential building characteristics and energy consumption, even though they exhibit greater uncertainty at the individual‑building level. These properties make the model particularly valuable for policy planning, regional energy‑efficiency assessment and trend analysis, even when predictions for individual buildings carry greater variability.
Methodological data
Data sources used
The energy performance certificate database managed by LTK is transferred to the HCSO under OSAP data collection no. 2561. Each year, approximately 130–150 thousand dwellings are surveyed, and the detailed documentation prepared by certified experts is uploaded to the LTK system. At the end of 2023, the certification system was revised, and no direct conversion is possible between the old and the new frameworks. For this reason, we decided not to use certificates issued under the new system. To maximise the number of dwellings for which certified energy‑demand data are available, we also included certificates issued in the two years preceding the census, and thus relied on energy‑certificate data from the period 2020–2023. These data were linked to the addresses registered at the time of the census using the detailed address information available, and through these, to the census dwelling records.
As a first step, we assigned the address identifiers from the HCSO Address Register to the addresses contained in the transferred energy certificates. This procedure was successful in 67% of cases: of the certificates available for the four‑year period, almost 370 thousand were matched with an address identifier down to the sub‑unit level. Among these, approximately 35 thousand certificates proved to be duplicates, having been issued multiple times for the same dwelling during the period. After excluding older duplicates, we were able to use 335 thousand records.
The next step was linking the data to the 2022 census dwelling‑stock table, again using the address identifiers. Of the 335 thousand records, we successfully assigned an energy certificate to 279 thousand dwellings, representing 6.5% of the nearly 4.6 million dwellings. The territorial distribution of the matched records largely reflects the regional pattern of the census dwelling stock, although in Budapest the complexity of addresses in multi‑unit buildings (building, staircase, floor, door) somewhat reduced the matching rate compared with the simpler addressing of predominantly single‑family areas. Nevertheless, we achieved coverage exceeding 5% in every region
| Region | Number of dwellings (Census 2022) | Territorial distribution of dwellings, % | Number of energy certificates linked to dwelling records | Territorial distribution of linked certificates, % | Share of dwellings with a certificate, % |
|---|---|---|---|---|---|
| Budapest | 961 061 | 21.0 | 53 291 | 19.1 | 5.5 |
| Pest | 519 420 | 11.3 | 32 819 | 11.8 | 6.3 |
| Central Transdanubia | 474 371 | 10.4 | 29 652 | 10.6 | 6.3 |
| Western Transdanubia | 457 369 | 10.0 | 26 208 | 9.4 | 5.7 |
| Southern Transdanubia | 418 847 | 9.1 | 24 947 | 8.9 | 6.0 |
| Northern Hungary | 510 187 | 11.1 | 36 281 | 13.0 | 7.1 |
| Northern Great Plain | 625 854 | 13.7 | 41 853 | 15.0 | 6.7 |
| Southern Great Plain | 613 429 | 13.4 | 33 969 | 12.2 | 5.5 |
| Total | 4 580 538 | 100.0 | 279 020 | 100.0 | 6.1 |
Regression analysis
Compared with the earlier analysis carried out in 2020, a key limitation was the absence of information on renovations undertaken on residential buildings, including energy‑saving improvements that respondents could report in the 2016 microcensus. As a result, the OLS regression based on census information was able to explain only around 60% of the variance in the observations, even after applying a 10% outlier filter. The variables used in the regression, together with their coefficients and significance levels, are presented primarily to illustrate the nature of the relationships between the explanatory variables and the specific primary energy consumption of dwellings, and to introduce the set of variables applied in subsequent analyses. Two separate models were estimated: one for detached houses and one for apartments in multi‑unit buildings. In both cases, the age of the building emerges as a fundamental determinant of a dwelling’s energy demand: the newer a residential building is, the lower its energy requirement tends to be, and this difference remains pronounced even between buildings constructed in 2016–2020 and those built in 2020–2022.
| Explanatory variables of the model Dependent variable: specific primary energy consumption (as defined by the TNM regulation) |
Apartment‑building model R²= 0.597 |
Detached house model R²= 0.616 |
||||
|---|---|---|---|---|---|---|
| B | p | B | p | |||
| Constant | 371.6 | 0.000 | 457.062 | 0.000 | ||
| Settlement‑level specific dwelling price, million HUF | -7.624 | 0.000 | -82.304 | 0.000 | ||
| Floor area of the dwelling | -0.275 | 0.000 | -0.389 | 0.000 | ||
| Region (reference category: Southern Great Plain) | ||||||
| Southern Transdanubia | -26.909 | 0.000 | -4.769 | 0.000 | ||
| Northern Great Plain | 2.625 | 0.000 | 2.257 | 0.001 | ||
| Northern Hungary | -8.566 | 0.000 | -0.940 | 0.192 | ||
| Central Transdanubia | -10.509 | 0.000 | -2.016 | 0.006 | ||
| Western Transdanubia | -20.813 | 0.000 | -10.907 | 0.000 | ||
| Type of settlement | ||||||
| Budapest | -26.556 | 0.000 | 4.390 | 0.000 | ||
| county seat | -25.989 | 0.000 | -8.203 | 0.000 | ||
| town | -16.919 | 0.000 | -1.348 | 0.007 | ||
| Year of construction (reference category: before 1919) | ||||||
| 1919–1945 | -12.365 | 0.000 | 13.599 | 0.000 | ||
| 1946–1960 | -42.289 | 0.000 | 16.336 | 0.000 | ||
| 1961–1980 | -64.704 | 0.000 | 13.771 | 0.000 | ||
| 1981–2000 | -69.075 | 0.000 | -43.050 | 0.000 | ||
| 2001–2010 | -121.199 | 0.000 | -104.281 | 0.000 | ||
| 2011–2015 | -131.618 | 0.000 | -128.983 | 0.000 | ||
| 2016–2020 | -152.124 | 0.000 | -148.669 | 0.000 | ||
| 2021–2022 | -167.467 | 0.000 | -168.851 | 0.000 | ||
| Wall structure (reference category: brick) | concrete wall | -8.260 | 0.000 | -3.210 | 0.008 | |
| adobe, timber or other wall type | -6.728 | 0.006 | 1.975 | 0.000 | ||
| panel wall | -18.573 | 0.000 | ||||
| Number of dwellings (reference category: 4–12 units) | ||||||
| 13 units or more | -0.817 | 0.127 | ||||
| Fuel type (reference category: piped natural gas) | ||||||
| other | -5.708 | 0.827 | -19.338 | 0.140 | ||
| LPG | -21.534 | 0.729 | 1.150 | 0.912 | ||
| coal | 29.185 | 0.000 | -116.043 | 0.055 | ||
| electricity | 23.286 | 0.322 | -29.646 | 0.106 | ||
| Building height (reference category: single‑storey) | ||||||
| 2–3 storeys | -42.493 | 0.000 | ||||
| 4 storeys | -57.800 | 0.000 | ||||
| 5 storeys or more | -61.886 | 0.000 | ||||
| Present in the dwelling/house | ||||||
| internet connection | -4.004 | 0.000 | -21.120 | 0.000 | ||
| heat‑pump heating | -12.240 | 0.000 | 11.529 | 0.000 | ||
| air‑conditioning unit | -2.630 | 0.000 | -27.286 | 0.000 | ||
| photovoltaic panel | -17.342 | 0.000 | -97.923 | 0.000 | ||
| solar thermal collector | 35.626 | 0.000 | 109.706 | 0.000 | ||
| Heating and fuel type(s) (reference: room heating with piped gas) | ||||||
| electricity, room‑by‑room | 67.436 | 0.000 | 16.254 | 0.377 | ||
| wood, room‑by‑room | 64.585 | 0.300 | -22.693 | 0.000 | ||
| coal, room‑by‑room | 92.726 | 0.135 | ||||
| other fuel, room‑by‑room | 41.782 | 0.111 | -28.430 | 0.031 | ||
| piped gas and electricity. room‑by‑room | 7.306 | 0.000 | 3.822 | 0.062 | ||
| piped gas and wood, room‑by‑room | 18.800 | 0.013 | -8.270 | 0.000 | ||
| piped gas and other fuel, room‑by‑room | -74.862 | 0.000 | -9.531 | 0.359 | ||
| electricity and wood, room‑by‑room | 62.546 | 0.000 | 7.528 | 0.683 | ||
| electricity and other fuel, room‑by‑room | 30.641 | 0.133 | 44.448 | 0.006 | ||
| coal and wood, room‑by‑room | 86.724 | 0.173 | -22.511 | 0.000 | ||
| multiple other fuels, room‑by‑room | 62.765 | 0.123 | 10.714 | 0.322 | ||
| central boiler, electricity | 47.788 | 0.000 | 25.745 | 0.161 | ||
| central boiler, wood | 89.694 | 0.152 | 19.234 | 0.000 | ||
| central boiler, coal | 18.545 | 0.476 | 144.609 | 0.018 | ||
| central boiler, other fuel | 3.475 | 0.011 | -20.449 | 0.111 | ||
| central boiler, piped gas and electricity | 0.721 | 0.446 | ||||
| central boiler, piped gas and wood | -11.632 | 0.027 | 7.001 | 0.000 | ||
| central boiler, piped gas and other fuel | -12.825 | 0.326 | 12.853 | 0.008 | ||
| central boiler, electricity and wood | 40.628 | 0.004 | 28.431 | 0.123 | ||
| central boiler, electricity and other fuel | 21.968 | 0.211 | 8.357 | 0.638 | ||
| central boiler, coal and wood | 96.928 | 0.145 | 31.579 | 0.000 | ||
| central boiler, multiple other fuels | -32.395 | 0.000 | -11.662 | 0.132 | ||
| district heating | -32.617 | 0.000 | ||||
| Hot‑water supply | boiler, water heater etc. | -11.149 | 0.016 | 5.579 | 0.000 | |
| district hot‑water network | -33.375 | 0.000 | -17.580 | 0.000 | ||
| High‑rise panel building | -5.296 | 0.000 | ||||
The regression estimate reproduced the measured energy class in 34% of dwellings with known ratings, and in a further 37% it under‑ or overestimated the class by one category. The model performed particularly poorly for detached houses, while for apartments in multi‑unit buildings it missed by at most one category in 72% of cases.
Random forest modelling
In the case of energy performance certificates, the main limitation of linear regression is that it can capture the relationship between the specific energy indicator and dwelling characteristics only in linear form. We know, for example, that exceptionally high‑quality new dwellings and exceptionally poor‑quality older ones perform far better or worse than an average dwelling—meaning that the indicator to be estimated exhibits non‑linear behaviour. A method without such implicit assumptions is therefore required. Random forest regression is a machine‑learning approach particularly well suited to this task. Instead of relying on a single decision tree, it builds hundreds or thousands of trees in parallel and averages their outputs, as if each member of an expert panel evaluated the building independently before forming a collective judgement. This approach is especially appropriate for energy‑efficiency rating systems where scoring follows conditional logic rather than simple linear coefficients. Through recursive binary splitting, the algorithm naturally uncovers hidden scoring thresholds, effectively reconstructing the underlying evaluation system without explicit programming. This works well for energy‑efficiency problems because a building’s energy use is shaped by many interacting factors—its size, insulation, heating system, location and their complex interdependencies. Random forest can detect these hidden relationships without requiring a predefined mathematical formula. Moreover, the method tolerates strong correlations between variables (such as building age and wall type) and handles different data types—numerical, categorical or binary—without difficulty.
In our random forest model, we used the same dwelling characteristics and geographical variables as in the linear regression model. Because complex, high‑parameter models are prone to overfitting—meaning that the patterns defining the data may be captured with excessive precision and thus with limited generalisability—we applied cross‑validation. We created several partitions of the dataset, fitted a separate model to each, and compared these models based on their predictive performance and accuracy. The final model was then used to generate estimates for all dwellings with known energy certificates, and these estimates were compared with the observed values. The comparison was carried out both at the level of individual dwellings and across averages of different territorial units. The next figure illustrates the relationship between observed and predicted energy demand across the population of individual dwellings. On the graph, the horizontal axis shows the observed values and the vertical axis the estimated ones. The plot area is divided into cells whose colour indicates how many dwellings fall into each cell—lighter shades representing many dwellings, darker shades few (even a single one). It is clear that for a large share of dwellings the observed and estimated values lie very close to each other, indicating good model performance. At the same time, a small number of dwellings show substantial dispersion around the line of equality, meaning that the model makes larger‑than‑average errors for these cases. Such deviations are typically greatest where the observed score is exceptionally high (i.e. the dwelling is extremely energy‑inefficient), and the model cannot fully capture this. Even in these cases the model predicts a high value, but not high enough. Overall, with a dataset of this size and heterogeneity, such dispersion is to be expected, and the key criterion is that estimation errors remain small for the overwhelming majority of dwellings—which is indeed the case here.
When the energy efficiency of individual dwellings is assessed not through the continuous indicator but through the calculated energy classes, the difference between the measured and estimated ÉKM‑based classifications becomes clearly visible. The random forest model predicted the categories with greater accuracy than the OLS model. It is also noticeable that both methods tended to assign slightly worse energy classes than the observed ones for a number of dwellings.
However, the purpose of the calculations is not to determine the energy demand of individual dwellings, but to produce accurate estimates for territorial units (counties, regions) or other groups. It is important to emphasise that the county variable is included as a predictor in the model, so large discrepancies would indicate model failure. This is not the case: for many county–dwelling‑type combinations, average deviations remain within decimal ranges, and even the largest differences do not reach two digits.
| County | Detached house | Apartment building | ||
|---|---|---|---|---|
| certified value | estimated value | certified value | estimated value | |
| Bács-Kiskun | 311.1 | 316.7 | 190.7 | 190.9 |
| Baranya | 299.5 | 297.7 | 145.6 | 146.8 |
| Békés | 335.0 | 336.7 | 210.8 | 211.6 |
| Borsod-Abaúj-Zemplén | 327.7 | 330.8 | 180.9 | 180.3 |
| Csongrád-Csanád | 240.4 | 241.9 | 206.4 | 206.9 |
| Fejér | 308.9 | 305.7 | 182.0 | 181.1 |
| Budapest | 291.4 | 293.6 | 172.8 | 173.4 |
| Győr-Moson-Sopron | 262.0 | 263.2 | 166.6 | 167.5 |
| Hajdú-Bihar | 280.6 | 284.4 | 175.5 | 175.5 |
| Heves | 339.2 | 336.8 | 193.4 | 195.3 |
| Jász-Nagykun-Szolnok | 353.9 | 348.6 | 210.0 | 206.7 |
| Komárom-Esztergom | 294.8 | 294.2 | 194.2 | 190.1 |
| Nógrád | 359.7 | 358.7 | 218.0 | 221.2 |
| Pest | 262.6 | 260.3 | 184.5 | 185.4 |
| Somogy | 323.5 | 318.8 | 191.7 | 192.6 |
| Szabolcs-Szatmár-Bereg | 327.4 | 329.2 | 159.6 | 164.1 |
| Tolna | 311.9 | 323.4 | 170.2 | 173.1 |
| Vas | 290.9 | 298.7 | 191.5 | 192.3 |
| Veszprém | 296.2 | 295.7 | 174.6 | 177.5 |
| Zala | 309.7 | 306.6 | 173.2 | 175.8 |
Since county‑level information is included as a predictor in our model, accuracy at this level is not, in itself, evidence that the model can reliably predict territorial units it has not encountered before. Examining districts provides a more meaningful test of predictive performance, as district‑level information was not included among the model’s variables, and the algorithm therefore could not have learned the specific composition of each district. The distribution of differences between observed and estimated district‑level average scores shows that these differences span a much wider range than in the case of counties. At the same time, it is striking that for 75 districts the difference falls between –2 and +4 kWh/m²/a, which indicates particularly strong model performance, given that the energy‑efficiency score spans several hundred units. In the majority of districts—144 out of 198—the deviation lies between –9 and +14 kWh/m²/a, which also represents good performance. Only a few districts show larger discrepancies. These tend to be districts with low housing‑market activity, where few energy certificates are available, making the observed average values more uncertain. Such districts are typically rural, with below‑average housing stocks, located in less affluent regions. Our results illustrate why estimates for very small territorial units must be treated with caution: the smaller the unit, the greater the likelihood that it is characterised by unique local conditions that cannot be generalised from the full population, and the fewer observations are available—two factors that often reinforce each other.
The phenomenon whereby predictions for individual buildings exhibit substantial uncertainty, yet converge strikingly close to the true values when aggregated—whether at the level of districts, counties or regions—highlights one of the key strengths of random forest models in forecasting energy efficiency. This statistical behaviour stems from the model’s balanced error distribution, in which prediction errors do not systematically skew in one direction but instead disperse symmetrically around the true values. When these predictions are aggregated at regional scales, positive and negative errors effectively cancel each other out, producing averages that lie only 1–2 points from the observed values on a scale of several hundred—an impressive level of accuracy. This example demonstrates that random forests are highly effective at capturing macro‑level relationships between residential building characteristics and energy consumption, even though they exhibit greater uncertainty at the individual‑building level. These properties make the model particularly valuable for policy planning, regional energy‑efficiency assessment and trend analysis, even when predictions for individual buildings carry greater variability.