Ingeniería Informática
Permanent URI for this community
Browse
Browsing Ingeniería Informática by Subject "ANALISIS DE DATOS"
Now showing 1 - 14 of 14
Results Per Page
Sort Options
artículo de publicación periódica.listelement.badge An algebra for OLAP(2017) Kuijpers, Bart; Vaisman, Alejandro Ariel"Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube, where each cell contains one or more measures can be aggregated along dimensions. Despite the extensive corpus of work in the field, a standard language for OLAP is still needed, since there is no well-defined, accepted semantics, for many of the usual OLAP operations. In this paper, we address this problem, and present a set of operations for manipulating a data cube. We clearly define the semantics of these operations, and prove that they can be composed, yielding a language powerful enough to express complex OLAP queries. We express these operations as a sequence of atomic transformations over a fixed multidimensional matrix, whose cells contain a sequence of measures. Each atomic transformation produces a new measure. When a sequence of transformations defines an OLAP operation, a flag is produced indicating which cells must be considered as input for the next operation. In this way, an elegant algebra is defined. Our main contribution, with respect to other similar efforts in the field is that, for the first time, a formal proof of the correctness of the operations is given, thus providing a clear semantics for them. We believe the present work will serve as a basis to build more solid practical tools for data analysis."proyecto final de grado.listelement.badge Análisis de datos de pacientes y consultantes con COVID-19(2021-09-29) Pingarilho, Pedro Remigio; Gómez, Fermín; Di Luca, Miguel; Gambini, Julianaartículo de publicación periódica.listelement.badge Analytical queries on semantic trajectories using graph databases(2019-10) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"This article studies the analysis of moving object data collected by location-aware devices, such as GPS, using graph databases. Such raw trajectories can be transformed into so-called semantic trajectories, which are sequences of stops that occur at “places of interest.” Trajectory data analysis can be enriched if spatial and non-spatial contextual data associated with the moving objects are taken into account, and aggregation of trajectory data can reveal hidden patterns within such data. When trajectory data are stored in relational databases, there is an “impedance mismatch” between the representation and storage models. Graphs in which the nodes and edges are annotated with properties are gaining increasing interest to model a variety of networks. Therefore, this article proposes the use of graph databases (Neo4j in this case) to represent and store trajectory data, which can thus be analyzed at different aggregation levels using graph query languages (Cypher, for Neo4j). Through a real-world public data case study, the article shows that trajectory queries are expressed more naturally on the graph-based representation than over the relational alternative, and perform better in many typical cases."artículo de publicación periódica.listelement.badge Analyzing the quality of Twitter data streams(2020) Arolfo, Franco A.; Cortes Rodriguez, Kevin; Vaisman, Alejandro Ariel"There is a general belief that the quality of Twitter data streams is generally low and unpredictable, making, in some way, unreliable to take decisions based on such data. The work presented here addresses this problem from a Data Quality (DQ) perspective, adapting the traditional methods used in relational databases, based on quality dimensions and metrics, to capture the characteristics of Twitter data streams in particular, and of Big Data in a more general sense. Therefore, as a first contribution, this paper re-defines the classic DQ dimensions and metrics for the scenario under study. Second, the paper introduces a software tool that allows capturing Twitter data streams in real time, computing their DQ and displaying the results through a wide variety of graphics. As a third contribution of this paper, using the aforementioned machinery, a thorough analysis of the DQ of Twitter streams is performed, based on four dimensions: Readability, Completeness, Usefulness, and Trustworthiness. These dimensions are studied for several different cases, namely unfiltered data streams, data streams filtered using a collection of keywords, and classifying tweets referring to different topics, studying the DQ for each topic. Further, although it is well known that the number of geolocalized tweets is very low, the paper studies the DQ of tweets with respect to the place from where they are posted. Last but not least, the tool allows changing the weights of each quality dimension considered in the computation of the overall data quality of a tweet. This allows defining weights that fit different analysis contexts and/or different user profiles. Interestingly, this study reveals that the quality of Twitter streams is higher than what would have been expected."proyecto final de grado.listelement.badge Calidad de datos contextual en Big Data: calidad de datos de Twitter(2020-04-24) Cortés Rodríguez, Kevin Imanol; Vaisman, Alejandro Ariel"En cada una de las fases del análisis en los procesos relacionados a Big Data, la calidad de datos juega un papel importante. La obtención de la calidad de datos, basados en las dimensiones de la calidad y métricas, deben ser adaptados en pos de capturar las nuevas características que el Big Data nos afronta. Este documento trata de profundizar dicho problema, redefiniendo las dimensiones y métricas de la calidad de datos en un escenario de Big Data, donde el dato llega en tiempo real en formato JSON y es procesado por distintos componentes para obtener métricas de calidad de datos. En particular, este proyecto estudia el caso concreto de mensajes de usuarios de la red social Twitter. Por otra parte, también se detalla la implementación de una nueva arquitectura continuando el proyecto de Data quality in a big data context: about Twitter’s data quality basada en microservicios, desde el momento que se procesa un tweet, llega desde la interfaz al usuario y todas las mejoras agregadas en pos de mejorar la experiencia al usuario."ponencia en congreso.listelement.badge Data quality in a big data context(2018) Arolfo, Franco A.; Vaisman, Alejandro Ariel"In each of the phases of a Big Data analysis process, data quality (DQ) plays a key role. Given the particular characteristics of the data at hand, the traditional DQ methods used for relational databases, based on quality dimensions and metrics, must be adapted and extended, in order to capture the new characteristics that Big Data introduces. This paper dives into this problem, re-defining the DQ dimensions and metrics for a Big Data scenario, where data may arrive, for example, as unstructured documents in real time. This general scenario is instantiated to study the concrete case of Twitter feeds. Further, the paper also describes the implementation of a system that acquires tweets in real time, and computes the quality of each tweet, applying the quality metrics that are defined formally in the paper. The implementation includes a web user interface that allows filtering the tweets for example by keywords, and visualizing the quality of a data stream in many different ways. Experiments are performed and their results discussed."proyecto final de grado.listelement.badge Data quality in a big data context: about Twitter’s data quality(2018) Arolfo, Franco A.; Vaisman, Alejandro Ariel"In each of the phases of a Big Data analysis process, Data Quality (DQ) plays a key role. Given the particular characteristics of the data at hand, the traditional DQ methods, based on quality dimensions and metrics, must be adapted and extended, in order to capture the new characteristics that Big Data introduces. This paper dives into this problem, re-defining the DQ dimensions and metrics for a Big Data scenario, where the data arrives, in this particular case, as unstructured documents in real time, such as JSON objects. This general scenario is instantiated to study the concrete case of Twitter feeds. Further, the paper also describes the implementation of a system that acquires tweets in real time, and computes the quality of each tweet, applying the quality metrics that are defined formally in the paper. The implementation includes a web user interface that allows filtering the tweets, for example, by keywords, and visualizing the quality of a data stream in many different ways. Experiments are performed and their results discussed."artículo de publicación periódica.listelement.badge Mobility data warehouses(2019-04) Vaisman, Alejandro Ariel; Zimányi, Esteban"The interest in mobility data analysis has grown dramatically with the wide availability of devices that track the position of moving objects. Mobility analysis can be applied, for example, to analyze traffic flows. To support mobility analysis, trajectory data warehousing techniques can be used. Trajectory data warehouses typically include, as measures, segments of trajectories, linked to spatial and non-spatial contextual dimensions. This paper goes beyond this concept, by including, as measures, the trajectories of moving objects at any point in time. In this way, online analytical processing (OLAP) queries, typically including aggregation, can be combined with moving object queries, to express queries like “List the total number of trucks running at less than 2 km from each other more than 50% of its route in the province of Antwerp” in a concise and elegant way. Existing proposals for trajectory data warehouses do not support queries like this, since they are based on either the segmentation of the trajectories, or a pre-aggregation of measures. The solution presented here is implemented using MobilityDB, a moving object database that extends the PostgresSQL database with temporal data types, allowing seamless integration with relational spatial and non-spatial data. This integration leads to the concept of mobility data warehouses. This paper discusses modeling and querying mobility data warehouses, providing a comprehensive collection of queries implemented using PostgresSQL and PostGIS as database backend, extended with the libraries provided by MobilityDB."ponencia en congreso.listelement.badge Modelling and querying star and snowflake warehouses using graph databases(2019) Vaisman, Alejandro Ariel; Besteiro, María Florencia; Valverde Melito, Maximiliano Javier"In current “Big Data” scenarios, graph databases are increasingly being used. Online Analytical Processing (OLAP) operations can expand the possibilities of graph analysis beyond the traditional graphbased computation. This paper studies graph databases as an alternative to implement star and snowflake schemas, the typical choices for data warehouse design. For this, the MusicBrainz database is used. A data warehouse for this database is designed, and implemented over a Postgres relational database. This warehouse is also represented as a graph, and implemented over the Neo4j graph database. A collection of typical OLAP queries is used to compare both implementations. The results reported here show that in ten out of thirteen queries tested, the graph implementation outperforms the relational one, in ratios that go from 1.3 to 26 times faster, and performs similarly to the relational implementation in the three remaining cases."artículo de publicación periódica.listelement.badge Online analytical processsing on graph data(2020) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a “Big Data” scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current data analysis requirements, for example, social network analysis. Furthermore, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. Nevertheless, there is not much work on the problem of taking OLAP analysis to the graph data model. This paper proposes a formal multidimensional model for graph analysis, that considers the basic graph data, and also background information in the form of dimension hierarchies. The graphs in this model are node- and edge-labelled directed multihypergraphs, called graphoids, which can be defined at several different levels of granularity using the dimensions associated with them. Operations analogous to the ones used in typical OLAP over cubes are defined over graphoids. The paper presents a formal definition of the graphoid model for OLAP, proves that the typical OLAP operations on cubes can be expressed over the graphoid model, and shows that the classic data cube model is a particular case of the graphoid data model. Finally, a case study supports the claim that, for many kinds of OLAP-like analysis on graphs, the graphoid model works better than the typical relational OLAP alternative, and for the classic OLAP queries, it remains competitive."ponencia en congreso.listelement.badge Performing OLAP over graph data: query language, implementation, and a case study(2017-08) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"In current Big Data scenarios, traditional data warehousing and Online Analytical Processing (OLAP) operations on cubes are clearly not sufficient to address the current data analysis requirements. Nevertheless, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. In spite of this, there is not much work on the problem of taking OLAP analysis to the graph data model. In previous work we proposed a multidimensional (MD) data model for graph analysis, that considers not only the basic graph data, but background information in the form of dimension hierarchies as well. The graphs in our model are node- and edge-labelled directed multi-hypergraphs, called graphoids, defined at several different levels of granularity. In this paper we show how we implemented this proposal over the widely used Neo4J graph database, discuss implementation issues, and present a detailed case study to show how OLAP operations can be used on graphs."ponencia en congreso.listelement.badge Temporal SOLAP: query language, implementation, and a use case(2012) Bisceglia, Pablo; Gómez, Leticia Irene; Vaisman, Alejandro Ariel"The integration of Geographic Information Systems (GIS) and On-Line Analytical Processing (OLAP), denoted SOLAP, is aimed at exploring and analyzing spatial data. In real-world SOLAP applications, spatial and non-spatial data are subject to changes. In this paper we present a temporal query language for SOLAP, called TPiet-QL, supporting so-called discrete changes (for example, in land use or cadastral applications there are situations where parcels are merged or split). TPiet-QL allows expressing integrated GIS-OLAP queries in an scenario where spatial objects change across time. We also present a prototype implementation, and show how this application is used in a real-world scenario: the analysis of protected areas in Uruguay."artículo de publicación periódica.listelement.badge Towards the Internet of water: Using graph databases for hydrological analysis on the Flemish river system(2021-07) Bollen, Erik; Hendrix, Rik; Kuijpers, Bart; Vaisman, Alejandro Ariel"The “Internet of Water” project will deploy 2,500 sensors along the Flemish river system, in Belgium. These sensors will be part of a monitoring system. This will produce anenormous amount of data, on which prediction and analysis tasks can be performed. To represent, store, and query river data, relational databases are normally used. However, this choice introduces an “impedance mismatch” between the conceptual representation (typically a graph) and the storage model (relational tables). To solve this problem, this article proposes to use graph databases. The Flemish river system is presented as a use case and the Neo4j graph database and its high-level query language, Cypher, are used for storing and querying the data, respectively. A relational alternative is implemented over the PostgreSQL database. A collection of representative queries of interest for hydrologists is defined over both database implementations."artículo de publicación periódica.listelement.badge User-centered road network traffic analysis with MobilityDB(2022) Sakr, Mahmoud; Zimányi, Esteban; Vaisman, Alejandro Ariel; Bakli, Mohamed"Performance indicators of road networks are a long-lasting topic of research. Existing schemes assess network properties such as the average speed on road segments and the queuing time at intersections. The increasing availability of user trajectories, collected mainly using mobile phones with a variety of applications, creates opportunities for developing user-centered performance indicators. Performing such an analysis on big trajectory data sets remains a challenge for the existing data management systems, because they lack support for spatiotemporal trajectory data. This article presents an end-to-end solution, based on MobilityDB, a novel moving object database system that extends PostgreSQL with spatiotemporal data types and functions. A new class of indicators is proposed, focused on the users' experience. The indicators address the network design, the traffic flow, and the driving comfort of the motorists. Furthermore, these indicators are expressed as analytical MobilityDB queries over a big set of real vehicle trajectories."