Maike Buchin, Anne Driemel, Koen van Greevenbroek, Ioannis Psarros, Dennis Rohde
20th International Workshop, WAOA 2022, Potsdam, Germany.
We study variants of the mean problem under the p-Dynamic Time Warping (p-DTW) distance, a popular and robust distance measure for sequential data. In our setting we are given a set of finite point sequences over an arbitrary metric space and we want to compute a mean point sequence of given length that minimizes the sum of p-DTW distances, each raised to the qth power, between the input sequences and the mean sequence. In general, the problem is NP-hard and known not to be fixed-parameter tractable in the number of sequences. On the positive side, we show that restricting the length of the mean sequence significantly reduces the hardness of the problem. We give an exact algorithm running in polynomial time for constant-length means. We explore various approximation algorithms that provide a trade-off between the approximation factor and the running time. Our approximation algorithms have a running time with only linear dependency on the number of input sequences. In addition, we use our mean algorithms to obtain clustering algorithms with theoretical guarantees.
Alexandros Kalimeris, Ioannis Psarros, Giorgos Giannopoulos, Manolis Terrovitis, George Papastefanatos, Gregory Kotsis
IEEE Journal of Photovoltaics
Soiling is the accumulation of dirt in solar panels which leads to a decreasing trend in solar energy yield and may be the cause of vast revenue losses. The effect of soiling can be reduced by washing the panels, which is, however, a procedure of non-negligible cost. Moreover, soiling monitoring systems are often unreliable or very costly. We study the problem of estimating the soiling ratio in photo-voltaic (PV) modules, i.e., the ratio of the real power output to the power output that would be produced if solar panels were clean. A key advantage of our algorithms is that they estimate soiling, without needing to train on labelled data, i.e., periods of explicitly monitoring the soiling in each park, and without relying on generic analytical formulas which do not take into account the peculiarities of each installation. We consider as input a time series comprising a minimum set of measurements, that are available to most PV park operators. Our experimental evaluation shows that we significantly outperform current state-of-the-art methods for estimating soiling ratio.
Ioannis Psarros, Dennis Rohde
39th International Symposium on Computational Geometry (SoCG 2023)
Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimensionality of the data is by applying some dimension reduction technique that preserves the structural properties of the ambient space. The dissimilarity between two time series is often measured by ``discrete'' notions of distance, e.g. the dynamic time warping or the discrete Fr\'echet distance. Since all these distance functions are computed directly on the points of a time series, they are sensitive to different sampling rates or gaps. The continuous Fr\'echet distance offers a popular alternative which aims to alleviate this by taking into account all points on the polygonal curve obtained by linearly interpolating between any two consecutive points in a sequence. We study the ability of random projections \`a la Johnson and Lindenstrauss to preserve the continuous Fr\'echet distance of polygonal curves by effectively reducing the dimension. In particular, we show that one can reduce the dimension to $O(\epsilon^{-2} \log N)$, where $N$ is the total number of input points while preserving the continuous Fr\'echet distance between any two determined polygonal curves within a factor of $1\pm \epsilon$. We conclude with applications on clustering.
Dhaval Salwala, Seshu Tirupathi, Brian Quanz, Wesley Gifford, Stuart Siegel
INFORMS 2022
Terabytes of daily raw data generated by IoT sensors are indispensable for investigating time-series problems like short-term forecasting of the target variable, and failure predictions. Pure batch learning algorithms can be challenging with this high frequency and high-volume data as concept drifts would require frequent retraining of the deployed models leading to significant downtimes. Therefore, incremental models or coupled batch-incremental models are gaining increasing importance to handle these problems. In this talk, we will present a distributed computing system that can scale to perform incremental learning for big data and efficiently perform a parameter search in big data applications to dynamically generate the most efficient incremental modelling pipelines with every stream of new incoming data, followed by synthetic and real world use cases.
D. Tsitsigkos, P. Bouros, K. Lampropoulos, N. Mamoulis, and M. Terrovitis
IEEE Transactions on Knowledge and Data Engineering (TKDE), to appear.
Non-point spatial objects (e.g., polygons, linestrings, etc.) are ubiquitous. We study the problem of indexing non-point objects in memory for range queries and spatial intersection joins. We propose a secondary partitioning technique for space-oriented partitioning indices (e.g., grids), which improves their performance significantly, by avoiding the generation and elimination of duplicate results. Our approach is easy to implement and can be used by any space-partitioning index to significantly reduce the cost of range queries and intersection joins. In addition, the secondary partitions can be processed independently, which makes our method appropriate for distributed and parallel indexing. Experiments on real datasets confirm the advantage of our approach against alternative duplicate elimination techniques and data-oriented state-of-the-art spatial indices. We also show that our partitioning technique, paired with optimized partition-to-partition join algorithms, typically reduces the cost of spatial joins by around 50%.