Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy

Fassò, A., Rodeschini, J., Moro, A.F., Shaboviq, Q., Maranzano, P., Cameletti, M., Finazzi, F., Golini, N., Ignaccolo, R., Otto, P. (2023): Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. nature Scientific Data (DOI)

Abstract

The air in the Lombardy region, Italy, is one of the most polluted in Europe because of limited air circulation and high emission levels. There is a large scientific consensus that the agricultural sector has a significant impact on air quality. To support studies quantifying the role of the agricultural and livestock sectors on the Lombardy air quality, this paper presents a harmonised dataset containing daily values of air quality, weather, emissions, livestock, and land and soil use in the years 2016–2021, for the Lombardy region. The daily scale is obtained by averaging hourly data and interpolating other variables. In fact, the pollutant data come from the European Environmental Agency and the Lombardy Regional Environment Protection Agency, weather and emissions data from the European Copernicus programme, livestock data from the Italian zootechnical registry, and land and soil use data from the CORINE Land Cover project. The resulting dataset is designed to be used as is by those using air quality data for research.

Background

The Lombardy region in Northern Italy suffers from some of Europe’s highest air pollution levels. A large share of particulate matter (PM10, PM2.5) is formed from ammonia emissions, of which more than 90% originate from agriculture and livestock. To understand these dynamics, the Agrimonia dataset provides harmonised, open-access spatiotemporal data for the years 2016–2021.

Key takeaways

Scope: Daily observations for 141 monitoring stations, covering air quality, weather, emissions, livestock, and land/soil use.
Integration: Combines multiple European and Italian sources (EEA, ARPA Lombardia, Copernicus, BDN, CORINE Land Cover) into a single dataset.
Accessibility: Data are available in CSV, RData, and MATLAB formats, ready for direct analysis.
Applications: Designed for research on air quality, agriculture’s role in pollution, sustainable development, and health impacts.
Validation: Includes metadata and uncertainty estimates, especially for missing data imputation via Kalman smoothing.

Main idea

The dataset harmonises heterogeneous sources — from satellite reanalysis to ground monitoring and livestock registries — into a daily, station-level panel. This enables reproducible analyses of how livestock activity and meteorological conditions drive air quality patterns in Lombardy. By standardising formats and resolutions, Agrimonia lowers the barrier for environmental statisticians and data scientists to work with complex spatiotemporal data.

Applications

Air quality research: disentangling agricultural contributions to PM formation.
Health studies: linking pollution exposure to morbidity and mortality.
Policy evaluation: assessing the effectiveness of ammonia reduction measures.
Sustainability: studying land use, soil management, and ecosystem impacts.
Comparative studies: benchmarking Lombardy against other European regions.

Practical advice

Use the dataset’s metadata files for correct interpretation of variables, land-use classes, and imputation uncertainties.
Bear systematic missingness in mind (e.g. few ammonia sensors) and check the uncertainty columns when modelling.

Data access

The full Agrimonia dataset (version 2.0.1) and metadata are openly available on Zenodo: https://doi.org/10.5281/zenodo.6620529. Source code for data processing is on GitHub: Agrimonia_Data repository.

Methodological follow-up: comparative modelling study

The Agrimonia dataset has also been used in a follow-up study (Otto et al., 2024), which systematically compared spatiotemporal models for PM2.5 concentrations in Lombardy (2016–2020). Three classes with of models were investigated:

Hidden Dynamic Geostatistical Models (HDGM): state-space based models with latent processes, capturing spatiotemporal dependence explicitly and offering interpretable parameter estimates.
Generalised Additive Mixed Models (GAMM): regression-based models with spline terms and random effects, balancing flexibility and interpretability.
Random Forest Spatiotemporal Kriging (RFSTK): a hybrid machine learning approach combining random forests with residual kriging to exploit both nonlinear predictor-response relationships and spatial autocorrelation.

The comparison showed that all three approaches successfully captured spatiotemporal dependence. Cross-validation confirmed HDGM as the most robust choice, providing stable predictions and interpretable dynamics. RFSTK achieved high predictive accuracy but required careful treatment to avoid overfitting, while GAMM offered a transparent compromise between model structure and flexibility. This study highlights Agrimonia’s value not only as an environmental dataset but also as a benchmark for testing advanced statistical and machine learning models.

Otto, P., Alessandro Fusta Moro, Jacopo Rodeschini, Qendrim Shaboviq, Rosaria Ignaccolo, Natalia Golini, Michela Cameletti, Paolo Maranzano, Francesco Finazzi, Alessandro Fassò (2024): Spatiotemporal modelling of PM2.5 concentrations in Lombardy (Italy) -- A comparative study. Environmental and Ecological Statistics

Research interests

I am interested in statistical data science and statistical methodology for data in multidimensional spaces, e.g., geo-referenced data. Most of my papers have one theory in common - Tobler's first law of geography: "everything is related to everything else, but near things are more related than distant things."