Phytoplankton blooms threaten aquatic ecosystems worldwide, with implications going beyond their apparent ecological aspects. Management solutions are needed to control the appearance of phytoplankton blooms and alleviate their impacts. Such solutions are supported by scientific results, many of which derive from modeling approaches. Data-driven models are now routinely deployed for the short-term (day to weeks) forecasting of phytoplankton dynamics. Nonetheless, such data-oriented efforts are often plagued by two issues, i.e., the lack of sufficient data and interpretability. On one hand, insufficient data result in overfitting, which produces poorly generalizable models that are unreliable under extrapolating conditions. On the other hand, the lack of interpretability hinders the contribution of such models in decision-making, since acting upon model predictions relies heavily on understanding of the model hypothesis. These two challenges motivated the present work, which aspired to investigate the suitability of multi-spectral satellite imagery as a source of phytoplankton-related data for the development of credible and accountable data-driven models. To this end, first, satellite-derived chlorophyll-a times series were created using Sentinel-2 and Landsat 8 imagery and a physics-based modular inversion and processing system. Then, two machine learning algorithms, i.e. a Random Forest (RF) and a Gaussian Process (GP) regression algorithm, were trained to map hydrometeorological drivers to the satellite-derived chlorophyll-a time series. The two algorithms were benchmarked against each other and against a naïve alternative, i.e., the persistence method, in terms of accuracy, uncertainty, and interpretability in three cases: (a) the mesotrophic Mulargia reservoir in Italy, (b) the hypereutrophic Harsha Lake in the USA, and (c) Lake Hume in Australia, a reservoir facing an increasing number of algal bloom events over the last 10 years. Results indicate that both machine learning models forecasted surface phytoplankton dynamics more accurately compared to their naïve alternative up to ten days ahead in the future. It should be noted though that forecasting accuracy deteriorated with increasing forecasting windows, mostly due to the uncertainty of meteorological forecasts. When the machine learning methods were compared to each other, the RF-based models were marginally better compared to their GP counterparts; they produced slightly more accurate and more certain chlorophyll-a predictions. RF-based models are also preferable in terms of interpretability. Their predictions unveiled specific patterns in hydrometeorological data that could explain phytoplankton dynamics in each case. On the contrary, it remained obscure how chlorophyll-a predictions were made by the GP regression models. More importantly this work offers evidence supporting that multi-spectral satellite data allow for the development of theory-guided, data-driven models for the forecasting of phytoplankton dynamics in lakes and reservoirs.


Phytoplankton, Aquatic Ecosystems, Management Solutions, Data-driven models, Short-term Forecasting, Overfitting, Interpretability, Model Predictions


Kandris, K., Romas, E., Tzimas, A., Pechlivanidis, I., Bauer, P., Joehnk, K., Bresciani, M., Giardino, C., Anstee, J., Schaeffer, B. A., and Dessena, M.-A.: Assessing the suitability of multi-spectral satellite data for the development of data-driven models of phytoplankton dynamics in lakes and reservoirs, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-2209, https://doi.org/10.5194/egusphere-egu22-2209