MORE | Management of Real-time Energy Data - Publications


Approximating Length-Restricted Means under Dynamic Time Warping

Maike Buchin, Anne Driemel, Koen van Greevenbroek, Ioannis Psarros, Dennis Rohde

20th International Workshop, WAOA 2022, Potsdam, Germany.

Abstract

We study variants of the mean problem under the p-Dynamic Time Warping (p-DTW) distance, a popular and robust distance measure for sequential data. In our setting we are given a set of finite point sequences over an arbitrary metric space and we want to compute a mean point sequence of given length that minimizes the sum of p-DTW distances, each raised to the qth power, between the input sequences and the mean sequence. In general, the problem is NP-hard and known not to be fixed-parameter tractable in the number of sequences. On the positive side, we show that restricting the length of the mean sequence significantly reduces the hardness of the problem. We give an exact algorithm running in polynomial time for constant-length means. We explore various approximation algorithms that provide a trade-off between the approximation factor and the running time. Our approximation algorithms have a running time with only linear dependency on the number of input sequences. In addition, we use our mean algorithms to obtain clustering algorithms with theoretical guarantees.

Tight Bounds for Approximate Near Neighbor Searching for Time Series under the Fréchet Distance

Karl Bringmann, Anne Driemel, André Nusser, Ioannis Psarros

ACM-SIAM Symposium on Discrete Algorithms (SODA) 2022.

Abstract

We study the c-approximate near neighbor problem under the continuous Fréchet distance: Given a set of n polygonal curves with m vertices, a radius δ > 0, and a parameter k ≤ m, we want to preprocess the curves into a data structure that, given a query curve q with k vertices, either returns an input curve with Fréchet distance at most c · δ to q, or returns that there exists no input curve with Fréchet distance at most δ to q. We focus on the case where the input and the queries are one-dimensional polygonal curves—also called time series—and we give a comprehensive analysis for this case. We obtain new upper bounds that provide different tradeoffs between approximation factor, preprocessing time, and query time.

Our data structures improve upon the state of the art in several ways. We show that for any 0 < ∊ ≤ 1 an approximation factor of (1 + ∊) can be achieved within the same asymptotic time bounds as the previously best result for (2 + ∊). Moreover, we show that an approximation factor of (2 + ∊) can be obtained by using preprocessing time and space O(nm), which is linear in the input size, and query time in , where the previously best result used preprocessing time in and query time in O(1)k. We complement our upper bounds with matching conditional lower bounds based on the Orthogonal Vectors Hypothesis. Interestingly, some of our lower bounds already hold for any super-constant value of k. This is achieved by proving hardness of a one-sided sparse version of the Orthogonal Vectors problem as an intermediate problem, which we believe to be of independent interest.

Efficient and Distributed Temporal Pattern Mining

Nguyen Ho, Van Long Ho, Torben Bach Pedersen, Mai Vu

IEEE BigData, 2021.

Abstract

The widespread deployment of IoT systems in the real world today has enabled the generation and collection of an enormous amount of sensor times series. One of the important mining techniques to extract patterns from time series is temporal pattern mining (TPM). Unlike the sequential pattern mining, TPM adds an additional temporal dimension, i.e., time intervals, into extracted patterns making them more informative. However, adding the extra temporal dimension into patterns results in an additional exponential factor to the growth of the search space, and thus, significantly increases the mining complexity. Current TPM approaches work sequentially, therefore, cannot scale to large datasets. In this paper, we propose Distributed Hierarchical Pattern Graph TPM (DHPG-TPM), the first distributed solution that supports large-scale TPM using the leading distributed platform Apache Spark. Moreover, DHPG-TPM employs efficient data structures, distributed bitmap and distributed Hierarchical Pattern Graph that are carefully designed to work efficiently in a distributed environment to enable fast computations of support and confidence. To address the exponential search space of TPM, we design effective distributed pruning techniques based on the Apriori principle and the transitivity property of temporal relations to reduce the search space while minimizing the communication overhead between the cluster nodes. We conduct extensive experiments on real-world and synthetic datasets, showing that DHPG-TPM outperforms the sequential baselines and scales to very large datasets.

Extreme-Scale Model-Based Time Series Management with ModelarDB (Invited Talk)

Torben Bach Pedersen

28th International Symposium on Temporal Representation and Reasoning (TIME), 2021.

Abstract

To monitor critical industrial devices such as wind turbines, high quality sensors sampled at a high frequency are increasingly used. Current technology does not handle these extreme-scale time series well [Søren Kejser Jensen et al., 2017], so only simple aggregates are traditionally stored, removing outliers and fluctuations that could indicate problems. As a remedy, we present a model-based approach for managing extreme-scale time series that approximates the time series values using mathematical functions (models) and stores only model coefficients rather than data values. Compression is done both for individual time series and for correlated groups of time series. The keynote will present concepts, techniques, and algorithms from model-based time series management and our implementation of these in the open source Time Series Management System (TSMS) ModelarDB[Søren Kejser Jensen et al., 2018; Søren Kejser Jensen et al., 2019; Søren Kejser Jensen et al., 2021] . Furthermore, it will present our experimental evaluation of ModelarDB on extreme-scale real-world time series, which shows that that compared to widely used Big Data formats, ModelarDB provides up to 14× faster ingestion due to high compression, 113× better compression due to its adaptability, 573× faster aggregatation by using models, and close to linear scale-out scalability. ModelarDB is being commercialized by the spin-out company ModelarData.

Efficient Temporal Pattern Mining in Big Time Series Using Mutual Information

Van Long Ho, Nguyen Ho, Torben Bach Pedersen

PVLDB 15(3), 2021.

Abstract

Very large time series are increasingly available from an ever wider range of IoT-enabled sensors deployed in different environments. Significant insights can be gained by mining temporal patterns from these time series. Unlike traditional pattern mining, temporal pattern mining (TPM) adds event time intervals into extracted patterns, making them more expressive at the expense of increased time and space complexities. Existing TPM methods either cannot scale to large datasets, or work only on pre-processed temporal events rather than on time series. This paper presents our Frequent Temporal Pattern Mining from Time Series (FTPMfTS) approach providing: (1) The end-to-end FTPMfTS process taking time series as input and producing frequent temporal patterns as output. (2) The efficient Hierarchical Temporal Pattern Graph Mining (HTPGM) algorithm that uses efficient data structures for fast support and confidence computation, and employs effective pruning techniques for significantly faster mining. (3) An approximate version of HTPGM that uses mutual information, a measure of data correlation, to prune unpromising time series from the search space. (4) An extensive experimental evaluation showing that HTPGM outperforms the baselines in runtime and memory consumption, and can scale to big datasets. The approximate HTPGM is up to two orders of magnitude faster and less memory consuming than the baselines, while retaining high accuracy.

Spatio-Temporal Graph Convolutional Network for Stochastic Traffic Speed Imputation

Carlos Enrique Muniz Cuza, Nguyen Ho, Eleni Tzirita Zacharatou, Torben Bach Pedersen, Bin Yang

ACM SIGSPATIAL’22

Abstract

The rapid increase of traffic data generated by different sensing systems opens many opportunities to improve transportation services. An important opportunity is to enable stochastic routing that com-putes the arrival time probabilities for each suggested route insteadof only the expected travel time. However, traffic datasets typically have many missing values, which prevents the construction of stochastic speeds. To address this limitation, we propose the Stochastic Spatio-Temporal Graph Convolutional Network (SST-GCN) architecture that accurately imputes missing speed distributions in a road network. SST-GCN combines Temporal Convolutional Networks and Graph Convolutional Networks into a single frameworkto capture both spatial and temporal correlations between roadsegments and time intervals. Moreover, to cope with datasets withmany missing values, we propose a novel self-adaptive context-aware diffusion process that regulates the propagated informationaround the network, avoiding the spread of false information. We extensively evaluate the effectiveness of SST-GCN on real-world datasets, showing that it achieves from 4.6% to 50% higher accuracy than state-of-the-art baselines using three different evaluationmetrics. Furthermore, multiple ablation studies confirm our designchoices and scalability to large road networks.

Data-driven soiling detection in PV modules

Alexandros Kalimeris, Ioannis Psarros, Giorgos Giannopoulos, Manolis Terrovitis, George Papastefanatos, Gregory Kotsis

IEEE Journal of Photovoltaics

Abstract

Soiling is the accumulation of dirt in solar panels which leads to a decreasing trend in solar energy yield and may be the cause of vast revenue losses. The effect of soiling can be reduced by washing the panels, which is, however, a procedure of non-negligible cost. Moreover, soiling monitoring systems are often unreliable or very costly. We study the problem of estimating the soiling ratio in photo-voltaic (PV) modules, i.e., the ratio of the real power output to the power output that would be produced if solar panels were clean. A key advantage of our algorithms is that they estimate soiling, without needing to train on labelled data, i.e., periods of explicitly monitoring the soiling in each park, and without relying on generic analytical formulas which do not take into account the peculiarities of each installation. We consider as input a time series comprising a minimum set of measurements, that are available to most PV park operators. Our experimental evaluation shows that we significantly outperform current state-of-the-art methods for estimating soiling ratio.

Random projections for curves in high dimensions

Ioannis Psarros, Dennis Rohde

39th International Symposium on Computational Geometry (SoCG 2023)

Abstract

Modern time series analysis requires the ability to handle datasets that are inherently high-dimensional; examples include applications in climatology, where measurements from numerous sensors must be taken into account, or inventory tracking of large shops, where the dimension is defined by the number of tracked items. The standard way to mitigate computational issues arising from the high dimensionality of the data is by applying some dimension reduction technique that preserves the structural properties of the ambient space. The dissimilarity between two time series is often measured by ``discrete'' notions of distance, e.g. the dynamic time warping or the discrete Fr\'echet distance. Since all these distance functions are computed directly on the points of a time series, they are sensitive to different sampling rates or gaps. The continuous Fr\'echet distance offers a popular alternative which aims to alleviate this by taking into account all points on the polygonal curve obtained by linearly interpolating between any two consecutive points in a sequence. We study the ability of random projections \`a la Johnson and Lindenstrauss to preserve the continuous Fr\'echet distance of polygonal curves by effectively reducing the dimension. In particular, we show that one can reduce the dimension to $O(\epsilon^{-2} \log N)$, where $N$ is the total number of input points while preserving the continuous Fr\'echet distance between any two determined polygonal curves within a factor of $1\pm \epsilon$. We conclude with applications on clustering.

Prequential Model Selection for Time Series Forecasting based on Saliency Maps

Shivani Tomar, Seshu Tirupathi, Dhaval Vinodbhai Salwala, Ivana Dusparic, Elizabeth Daly

2022 IEEE International Conference on Big Data Workshop Title: 7th Workshop on Real-time Stream Analytics, Stream Mining, CER/CEP & Stream Data Management in Big Data

Abstract

Over the last few years, incremental machine learning for streaming data has gained significant attention due to the need to learn from a constantly evolving stream of data without the need to store it. The advent of big data has further fuelled research on developing systems that can cope with continuously changing data streams and tackle the challenges associated with historical data requirements. The problem of time series forecasting has been studied using varied approaches like neural networks, ensemble methods, decision trees and rules, support vector machines to name a few. However, neural network models have gained particular attention in dealing with changing data distribution due to their generalization abilities. In this paper, we propose a prequential framework named PS-PGSM which involves incrementally training the base models and online Regions of Competence (ROC) computation followed by selection of the best forecaster for the task of time series forecasting using saliency maps. We build upon the state-of-the-art approach named OS-PGSM (Online Model Selection using Performance Gradient based Saliency Maps) in which the model training and ROC computation is performed offline. Past research has demonstrated that a set of different models enables specialization for each model compared to a single forecasting model which is particularly useful when predicting for an evolving time series sequence. Our approach uses saliency maps for prequential calculation of ROC for each model to find the best forecaster based on the performance of each model. We evaluate the proposed approach against OS-PGSM, as well as against previous best performing model by first conducting preliminary experiments on 10 real-world time series datasets and then using 2 real-world big datasets to showcase its applicability to big data. Experimental results not only validate the effectiveness of our approach for big data but also demonstrate superior performance in terms of prediction accuracy and computational time efficiency while also handling concept drift.

Distributed Incremental Machine Learning for Big Time Series Data

Dhaval Salwala, Seshu Tirupathi, Brian Quanz, Wesley Gifford, Stuart Siegel, Vijay Ekambaram, Arindam Jati

2022 IEEE International Conference on Big Data Industry and Government Program

Abstract

Today’s highly instrumented systems generate large amounts of time series data from many different domains. In order to create meaningful insights from these data, techniques are needed to handle the collection, processing, and analysis at scale. The high frequency and volume of data that is generated introduces several challenges including data transformation, managing concept drift, the operational cost of model re-training and tracking, and scaling hyperparameter optimization. Incremental machine learning can provide a viable solution to handle these kinds of data. Further, distributed machine learning can be an efficient technique to improve performance, increase accuracy, and scale to larger input sizes. In this paper, we introduce a framework that combines the computational capabilities of Apache Spark and the workflow parallelization of Ray for distributed incremental learning. We conduct an empirical analysis of our framework for time series forecasting using the Walmart M5 dataset. The system can perform a parameter search on streaming data with concept drift producing a robust pipeline that fits high-volume data effectively. The results are encouraging and substantiate system proficiency over traditional big data analysis approaches that exclusively use either offline or online training.

Machine Learning Platform for Extreme Scale Computing on Compressed IoT Data

Seshu Tirupathi, Dhaval Salwala, Giulio Zizzo, Ambrish Rawat, Mark Purcell, Søren Kejser Jensen, Christian Thomsen, Nguyen Ho, Carlos E. Muniz Cuza, Jonas Brusokas, Torben Bach Pedersen, Giorgos Alexiou, Giorgos Giannopoulos, Panagiotis Gidarakos, Alexandros Kalimeris, Stavros Maroulis, George Papastefanatos, Ioannis Psarros, Vassilis Stamatopoulos, Manolis Terrovitis

2022 IEEE International Conference on Big Data Workshop IWBDR-3

Abstract

With the lowering costs of sensors, high-volume and high-velocity data are increasingly being generated and analyzed, especially in IoT domains like energy and smart homes. Consequently, applications that require accurate short-term forecasts and predictions are also steadily increasing. In this paper, we provide an overview of a novel end-to-end platform that provides efficient ingestion, compression, transfer, query processing, and machine learning-based analytics for high-frequency and high- volume time series from IoT. The performance of the platform is evaluated using real-world dataset from RES installations. The results show the importance of high-frequency analytics and the surprisingly positive impact of error bounded lossy compression on machine learning in the form of AutoML. For example, when detecting yaw misalignments in wind turbines, an improvement of 9% in accuracy was observed for AutoML models on lossy compressed data compared to the current industry standard of 10-minute aggregated data. Thus, these small-scale experiments show the potential of the platform, and larger pilots are planned.

Scalable Distributed Computing Systems for Incremental Machine Learning in Big Data Applications

Dhaval Salwala, Seshu Tirupathi, Brian Quanz, Wesley Gifford, Stuart Siegel

INFORMS 2022

Abstract

Terabytes of daily raw data generated by IoT sensors are indispensable for investigating time-series problems like short-term forecasting of the target variable, and failure predictions. Pure batch learning algorithms can be challenging with this high frequency and high-volume data as concept drifts would require frequent retraining of the deployed models leading to significant downtimes. Therefore, incremental models or coupled batch-incremental models are gaining increasing importance to handle these problems. In this talk, we will present a distributed computing system that can scale to perform incremental learning for big data and efficiently perform a parameter search in big data applications to dynamically generate the most efficient incremental modelling pipelines with every stream of new incoming data, followed by synthetic and real world use cases.

Two-layer Space-oriented Partitioning for Non-point Data

D. Tsitsigkos, P. Bouros, K. Lampropoulos, N. Mamoulis, and M. Terrovitis

IEEE Transactions on Knowledge and Data Engineering (TKDE), to appear.

Abstract

Non-point spatial objects (e.g., polygons, linestrings, etc.) are ubiquitous. We study the problem of indexing non-point objects in memory for range queries and spatial intersection joins. We propose a secondary partitioning technique for space-oriented partitioning indices (e.g., grids), which improves their performance significantly, by avoiding the generation and elimination of duplicate results. Our approach is easy to implement and can be used by any space-partitioning index to significantly reduce the cost of range queries and intersection joins. In addition, the secondary partitions can be processed independently, which makes our method appropriate for distributed and parallel indexing. Experiments on real datasets confirm the advantage of our approach against alternative duplicate elimination techniques and data-oriented state-of-the-art spatial indices. We also show that our partitioning technique, paired with optimized partition-to-partition join algorithms, typically reduces the cost of spatial joins by around 50%.

ModelarDB: Integrated Model-Based Management of Time Series from Edge to Cloud

Søren Kejser Jensen, Christian Thomsen, Torben Bach Pedersen

Transactions on Large-Scale Data- and Knowledge-Centered Systems LIII

Abstract

To ensure critical infrastructure is operating as expected, high-quality sensors are increasingly installed. However, due to the enormous amounts of high-frequency time series they produce, it is impossible or infeasible to transfer or even store these time series in the cloud when using state-of-the-practice compression methods. Thus, simple aggregates, e.g., 1–10-minutes averages, are stored instead of the raw time series. However, by only storing these simple aggregates, informative outliers and fluctuations are lost. Many Time Series Management System (TSMS) have been proposed to efficiently manage time series, but they are generally designed for either the edge or the cloud. In this paper, we describe a new version of the open-source model-based TSMS ModelarDB. The system is designed to be modular and the same binary can be efficiently deployed on the edge and in the cloud. It also supports continuously transferring high-frequency time series compressed using models from the edge to the cloud. We first provide an overview of ModelarDB, analyze the requirements and limitations of the edge, and evaluate existing query engines and data stores for use on the edge. Then, we describe how ModelarDB has been extended to efficiently manage time series on the edge, a novel file-based data store, how ModelarDB’s compression has been improved by not storing time series that can be derived from base time series, and how ModelarDB transfers high-frequency time series from the edge to the cloud. As the work that led to ModelarDB began in 2015, we also reflect on the lessons learned while developing it.

Mining Seasonal Temporal Patterns in Time Series

Van Long Ho, Nguyen Ho, Torben Bach Pedersen

2023 IEEE 39th International Conference on Data Engineering (ICDE)

Abstract

As IoT-enabled sensors become more pervasive, very large time series data are increasingly generated and made available for advanced data analytics. By mining temporal patterns from the available data, valuable insights can be extracted to support decision making. A useful type of patterns found in many real-world applications exhibits periodic occurrences, and is thus called seasonal temporal patterns (STP). Compared to regular patterns, mining seasonal temporal patterns is more challenging since traditional measures such as support and confidence do not capture the seasonality characteristics. Further, the anti-monotonicity property does not hold for STPs, and thus, resulting in an exponential search space. We propose a first solution for seasonal temporal pattern mining (STPM) from time series that can mine STP at different data granularities. We design efficient data structures and use two pruning techniques for the STPM algorithm that downsize the search space and accelerate the mining process. Further, based on the mutual information measure, we propose an approximate version of STPM that only mine seasonal patterns on the promising time series. Finally, extensive experiments with real-world and synthetic datasets show that STPM outperforms the baseline in terms of runtime and memory usage, and can scale to large datasets. The approximate STPM is up to an order of magnitude faster and less memory-consuming than the baseline, while maintaining high accuracy.