Estimating the Primary Energy Demand of the Hungarian Housing Stock*

Methodological data

Data sources used

The energy performance certificate database managed by LTK is transferred to the HCSO under OSAP data collection no. 2561. Each year, approximately 130–150 thousand dwellings are surveyed, and the detailed documentation prepared by certified experts is uploaded to the LTK system. At the end of 2023, the certification system was revised, and no direct conversion is possible between the old and the new frameworks. For this reason, we decided not to use certificates issued under the new system. To maximise the number of dwellings for which certified energy‑demand data are available, we also included certificates issued in the two years preceding the census, and thus relied on energy‑certificate data from the period 2020–2023. These data were linked to the addresses registered at the time of the census using the detailed address information available, and through these, to the census dwelling records.

As a first step, we assigned the address identifiers from the HCSO Address Register to the addresses contained in the transferred energy certificates. This procedure was successful in 67% of cases: of the certificates available for the four‑year period, almost 370 thousand were matched with an address identifier down to the sub‑unit level. Among these, approximately 35 thousand certificates proved to be duplicates, having been issued multiple times for the same dwelling during the period. After excluding older duplicates, we were able to use 335 thousand records.

The next step was linking the data to the 2022 census dwelling‑stock table, again using the address identifiers. Of the 335 thousand records, we successfully assigned an energy certificate to 279 thousand dwellings, representing 6.5% of the nearly 4.6 million dwellings. The territorial distribution of the matched records largely reflects the regional pattern of the census dwelling stock, although in Budapest the complexity of addresses in multi‑unit buildings (building, staircase, floor, door) somewhat reduced the matching rate compared with the simpler addressing of predominantly single‑family areas. Nevertheless, we achieved coverage exceeding 5% in every region

Table 1

Characteristics of the data sources used and key linkage indicators

Region	Number of dwellings (Census 2022)	Territorial distribution of dwellings, %	Number of energy certificates linked to dwelling records	Territorial distribution of linked certificates, %	Share of dwellings with a certificate, %
Budapest	961 061	21.0	53 291	19.1	5.5
Pest	519 420	11.3	32 819	11.8	6.3
Central Transdanubia	474 371	10.4	29 652	10.6	6.3
Western Transdanubia	457 369	10.0	26 208	9.4	5.7
Southern Transdanubia	418 847	9.1	24 947	8.9	6.0
Northern Hungary	510 187	11.1	36 281	13.0	7.1
Northern Great Plain	625 854	13.7	41 853	15.0	6.7
Southern Great Plain	613 429	13.4	33 969	12.2	5.5
Total	4 580 538	100.0	279 020	100.0	6.1

Regression analysis

Compared with the earlier analysis carried out in 2020, a key limitation was the absence of information on renovations undertaken on residential buildings, including energy‑saving improvements that respondents could report in the 2016 microcensus. As a result, the OLS regression based on census information was able to explain only around 60% of the variance in the observations, even after applying a 10% outlier filter. The variables used in the regression, together with their coefficients and significance levels, are presented primarily to illustrate the nature of the relationships between the explanatory variables and the specific primary energy consumption of dwellings, and to introduce the set of variables applied in subsequent analyses. Two separate models were estimated: one for detached houses and one for apartments in multi‑unit buildings. In both cases, the age of the building emerges as a fundamental determinant of a dwelling’s energy demand: the newer a residential building is, the lower its energy requirement tends to be, and this difference remains pronounced even between buildings constructed in 2016–2020 and those built in 2020–2022.

Table 2

Coefficients and p‑values of the regression models

Explanatory variables of the model Dependent variable: specific primary energy consumption (as defined by the TNM regulation)			Apartment‑building model R²= 0.597		Detached house model R²= 0.616
			B	p	B	p
		Constant	371.6	0.000	457.062	0.000
Settlement‑level specific dwelling price, million HUF			-7.624	0.000	-82.304	0.000
Floor area of the dwelling			-0.275	0.000	-0.389	0.000
Region (reference category: Southern Great Plain)
		Southern Transdanubia	-26.909	0.000	-4.769	0.000
		Northern Great Plain	2.625	0.000	2.257	0.001
		Northern Hungary	-8.566	0.000	-0.940	0.192
		Central Transdanubia	-10.509	0.000	-2.016	0.006
		Western Transdanubia	-20.813	0.000	-10.907	0.000
Type of settlement
		Budapest	-26.556	0.000	4.390	0.000
		county seat	-25.989	0.000	-8.203	0.000
		town	-16.919	0.000	-1.348	0.007
Year of construction (reference category: before 1919)
		1919–1945	-12.365	0.000	13.599	0.000
		1946–1960	-42.289	0.000	16.336	0.000
		1961–1980	-64.704	0.000	13.771	0.000
		1981–2000	-69.075	0.000	-43.050	0.000
		2001–2010	-121.199	0.000	-104.281	0.000
		2011–2015	-131.618	0.000	-128.983	0.000
		2016–2020	-152.124	0.000	-148.669	0.000
		2021–2022	-167.467	0.000	-168.851	0.000
Wall structure (reference category: brick)		concrete wall	-8.260	0.000	-3.210	0.008
		adobe, timber or other wall type	-6.728	0.006	1.975	0.000
		panel wall	-18.573	0.000
Number of dwellings (reference category: 4–12 units)
		13 units or more	-0.817	0.127
Fuel type (reference category: piped natural gas)
		other	-5.708	0.827	-19.338	0.140
		LPG	-21.534	0.729	1.150	0.912
		coal	29.185	0.000	-116.043	0.055
		electricity	23.286	0.322	-29.646	0.106
Building height (reference category: single‑storey)
		2–3 storeys	-42.493	0.000
		4 storeys	-57.800	0.000
		5 storeys or more	-61.886	0.000
Present in the dwelling/house
	internet connection		-4.004	0.000	-21.120	0.000
	heat‑pump heating		-12.240	0.000	11.529	0.000
	air‑conditioning unit		-2.630	0.000	-27.286	0.000
	photovoltaic panel		-17.342	0.000	-97.923	0.000
	solar thermal collector		35.626	0.000	109.706	0.000
Heating and fuel type(s) (reference: room heating with piped gas)
	electricity, room‑by‑room		67.436	0.000	16.254	0.377
	wood, room‑by‑room		64.585	0.300	-22.693	0.000
	coal, room‑by‑room				92.726	0.135
	other fuel, room‑by‑room		41.782	0.111	-28.430	0.031
	piped gas and electricity. room‑by‑room		7.306	0.000	3.822	0.062
	piped gas and wood, room‑by‑room		18.800	0.013	-8.270	0.000
	piped gas and other fuel, room‑by‑room		-74.862	0.000	-9.531	0.359
	electricity and wood, room‑by‑room		62.546	0.000	7.528	0.683
	electricity and other fuel, room‑by‑room		30.641	0.133	44.448	0.006
	coal and wood, room‑by‑room		86.724	0.173	-22.511	0.000
	multiple other fuels, room‑by‑room		62.765	0.123	10.714	0.322
	central boiler, electricity		47.788	0.000	25.745	0.161
	central boiler, wood		89.694	0.152	19.234	0.000
	central boiler, coal		18.545	0.476	144.609	0.018
	central boiler, other fuel		3.475	0.011	-20.449	0.111
	central boiler, piped gas and electricity				0.721	0.446
	central boiler, piped gas and wood		-11.632	0.027	7.001	0.000
	central boiler, piped gas and other fuel		-12.825	0.326	12.853	0.008
	central boiler, electricity and wood		40.628	0.004	28.431	0.123
	central boiler, electricity and other fuel		21.968	0.211	8.357	0.638
	central boiler, coal and wood		96.928	0.145	31.579	0.000
	central boiler, multiple other fuels		-32.395	0.000	-11.662	0.132
	district heating		-32.617	0.000

Hot‑water supply		boiler, water heater etc.	-11.149	0.016	5.579	0.000
		district hot‑water network	-33.375	0.000	-17.580	0.000
High‑rise panel building			-5.296	0.000

Figure 1

The regression estimate reproduced the measured energy class in 34% of dwellings with known ratings, and in a further 37% it under‑ or overestimated the class by one category. The model performed particularly poorly for detached houses, while for apartments in multi‑unit buildings it missed by at most one category in 72% of cases.

Random forest modelling

In the case of energy performance certificates, the main limitation of linear regression is that it can capture the relationship between the specific energy indicator and dwelling characteristics only in linear form. We know, for example, that exceptionally high‑quality new dwellings and exceptionally poor‑quality older ones perform far better or worse than an average dwelling—meaning that the indicator to be estimated exhibits non‑linear behaviour. A method without such implicit assumptions is therefore required. Random forest regression is a machine‑learning approach particularly well suited to this task. Instead of relying on a single decision tree, it builds hundreds or thousands of trees in parallel and averages their outputs, as if each member of an expert panel evaluated the building independently before forming a collective judgement. This approach is especially appropriate for energy‑efficiency rating systems where scoring follows conditional logic rather than simple linear coefficients. Through recursive binary splitting, the algorithm naturally uncovers hidden scoring thresholds, effectively reconstructing the underlying evaluation system without explicit programming. This works well for energy‑efficiency problems because a building’s energy use is shaped by many interacting factors—its size, insulation, heating system, location and their complex interdependencies. Random forest can detect these hidden relationships without requiring a predefined mathematical formula. Moreover, the method tolerates strong correlations between variables (such as building age and wall type) and handles different data types—numerical, categorical or binary—without difficulty.

In our random forest model, we used the same dwelling characteristics and geographical variables as in the linear regression model. Because complex, high‑parameter models are prone to overfitting—meaning that the patterns defining the data may be captured with excessive precision and thus with limited generalisability—we applied cross‑validation. We created several partitions of the dataset, fitted a separate model to each, and compared these models based on their predictive performance and accuracy. The final model was then used to generate estimates for all dwellings with known energy certificates, and these estimates were compared with the observed values. The comparison was carried out both at the level of individual dwellings and across averages of different territorial units. The next figure illustrates the relationship between observed and predicted energy demand across the population of individual dwellings. On the graph, the horizontal axis shows the observed values and the vertical axis the estimated ones. The plot area is divided into cells whose colour indicates how many dwellings fall into each cell—lighter shades representing many dwellings, darker shades few (even a single one). It is clear that for a large share of dwellings the observed and estimated values lie very close to each other, indicating good model performance. At the same time, a small number of dwellings show substantial dispersion around the line of equality, meaning that the model makes larger‑than‑average errors for these cases. Such deviations are typically greatest where the observed score is exceptionally high (i.e. the dwelling is extremely energy‑inefficient), and the model cannot fully capture this. Even in these cases the model predicts a high value, but not high enough. Overall, with a dataset of this size and heterogeneity, such dispersion is to be expected, and the key criterion is that estimation errors remain small for the overwhelming majority of dwellings—which is indeed the case here.

Figure 2

When the energy efficiency of individual dwellings is assessed not through the continuous indicator but through the calculated energy classes, the difference between the measured and estimated ÉKM‑based classifications becomes clearly visible. The random forest model predicted the categories with greater accuracy than the OLS model. It is also noticeable that both methods tended to assign slightly worse energy classes than the observed ones for a number of dwellings.

Figure 3

However, the purpose of the calculations is not to determine the energy demand of individual dwellings, but to produce accurate estimates for territorial units (counties, regions) or other groups. It is important to emphasise that the county variable is included as a predictor in the model, so large discrepancies would indicate model failure. This is not the case: for many county–dwelling‑type combinations, average deviations remain within decimal ranges, and even the largest differences do not reach two digits.

Table 3

Observed and estimated county‑level averages by building type. ÉKM. 2022

County	Detached house		Apartment building
County	certified value	estimated value	certified value	estimated value
Bács-Kiskun	311.1	316.7	190.7	190.9
Baranya	299.5	297.7	145.6	146.8
Békés	335.0	336.7	210.8	211.6
Borsod-Abaúj-Zemplén	327.7	330.8	180.9	180.3
Csongrád-Csanád	240.4	241.9	206.4	206.9
Fejér	308.9	305.7	182.0	181.1
Budapest	291.4	293.6	172.8	173.4
Győr-Moson-Sopron	262.0	263.2	166.6	167.5
Hajdú-Bihar	280.6	284.4	175.5	175.5
Heves	339.2	336.8	193.4	195.3
Jász-Nagykun-Szolnok	353.9	348.6	210.0	206.7
Komárom-Esztergom	294.8	294.2	194.2	190.1
Nógrád	359.7	358.7	218.0	221.2
Pest	262.6	260.3	184.5	185.4
Somogy	323.5	318.8	191.7	192.6
Szabolcs-Szatmár-Bereg	327.4	329.2	159.6	164.1
Tolna	311.9	323.4	170.2	173.1
Vas	290.9	298.7	191.5	192.3
Veszprém	296.2	295.7	174.6	177.5
Zala	309.7	306.6	173.2	175.8

Since county‑level information is included as a predictor in our model, accuracy at this level is not, in itself, evidence that the model can reliably predict territorial units it has not encountered before. Examining districts provides a more meaningful test of predictive performance, as district‑level information was not included among the model’s variables, and the algorithm therefore could not have learned the specific composition of each district. The distribution of differences between observed and estimated district‑level average scores shows that these differences span a much wider range than in the case of counties. At the same time, it is striking that for 75 districts the difference falls between –2 and +4 kWh/m²/a, which indicates particularly strong model performance, given that the energy‑efficiency score spans several hundred units. In the majority of districts—144 out of 198—the deviation lies between –9 and +14 kWh/m²/a, which also represents good performance. Only a few districts show larger discrepancies. These tend to be districts with low housing‑market activity, where few energy certificates are available, making the observed average values more uncertain. Such districts are typically rural, with below‑average housing stocks, located in less affluent regions. Our results illustrate why estimates for very small territorial units must be treated with caution: the smaller the unit, the greater the likelihood that it is characterised by unique local conditions that cannot be generalised from the full population, and the fewer observations are available—two factors that often reinforce each other.

Figure 4

The phenomenon whereby predictions for individual buildings exhibit substantial uncertainty, yet converge strikingly close to the true values when aggregated—whether at the level of districts, counties or regions—highlights one of the key strengths of random forest models in forecasting energy efficiency. This statistical behaviour stems from the model’s balanced error distribution, in which prediction errors do not systematically skew in one direction but instead disperse symmetrically around the true values. When these predictions are aggregated at regional scales, positive and negative errors effectively cancel each other out, producing averages that lie only 1–2 points from the observed values on a scale of several hundred—an impressive level of accuracy. This example demonstrates that random forests are highly effective at capturing macro‑level relationships between residential building characteristics and energy consumption, even though they exhibit greater uncertainty at the individual‑building level. These properties make the model particularly valuable for policy planning, regional energy‑efficiency assessment and trend analysis, even when predictions for individual buildings carry greater variability.