Examinando por Autor "Vaisman, Alejandro Ariel"
Mostrando 1 - 20 de 31
Resultados por página
Opciones de ordenación
Ponencia en Congreso Aggregation languages for moving object and places of interest(2008) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"We address aggregate queries over GIS data and moving object data, where non-spatial information is stored in a data warehouse. We propose a formal data model and query language to express complex aggregate queries. Next, we study the compression of trajectory data, produced by moving objects, using the notions of stops and moves. We show that stops and moves are expressible in our query language and we consider a fragment of this language, consisting of regular expressions to talk about temporally ordered sequences of stops and moves. This fragment can be used not only for querying, but also for expressing data mining and pattern matching tasks over trajectory data."Artículo de Publicación Periódica An algebra for OLAP(2017) Kuijpers, Bart; Vaisman, Alejandro Ariel"Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube, where each cell contains one or more measures can be aggregated along dimensions. Despite the extensive corpus of work in the field, a standard language for OLAP is still needed, since there is no well-defined, accepted semantics, for many of the usual OLAP operations. In this paper, we address this problem, and present a set of operations for manipulating a data cube. We clearly define the semantics of these operations, and prove that they can be composed, yielding a language powerful enough to express complex OLAP queries. We express these operations as a sequence of atomic transformations over a fixed multidimensional matrix, whose cells contain a sequence of measures. Each atomic transformation produces a new measure. When a sequence of transformations defines an OLAP operation, a flag is produced indicating which cells must be considered as input for the next operation. In this way, an elegant algebra is defined. Our main contribution, with respect to other similar efforts in the field is that, for the first time, a formal proof of the correctness of the operations is given, thus providing a clear semantics for them. We believe the present work will serve as a basis to build more solid practical tools for data analysis."Artículo de Publicación Periódica Analytical queries on semantic trajectories using graph databases(2019-10) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"This article studies the analysis of moving object data collected by location-aware devices, such as GPS, using graph databases. Such raw trajectories can be transformed into so-called semantic trajectories, which are sequences of stops that occur at “places of interest.” Trajectory data analysis can be enriched if spatial and non-spatial contextual data associated with the moving objects are taken into account, and aggregation of trajectory data can reveal hidden patterns within such data. When trajectory data are stored in relational databases, there is an “impedance mismatch” between the representation and storage models. Graphs in which the nodes and edges are annotated with properties are gaining increasing interest to model a variety of networks. Therefore, this article proposes the use of graph databases (Neo4j in this case) to represent and store trajectory data, which can thus be analyzed at different aggregation levels using graph query languages (Cypher, for Neo4j). Through a real-world public data case study, the article shows that trajectory queries are expressed more naturally on the graph-based representation than over the relational alternative, and perform better in many typical cases."Artículo de Publicación Periódica Analyzing public transport in the city of Buenos Aires with MobilityDB(2022) Godfrid, Juan; Radnic, Pablo; Vaisman, Alejandro Ariel; Zimányi, Esteban"The General Transit Feed Specification (GTFS) is a data format widely used to share data about public transportation schedules and associated geographic information. GTFS comes in two versions: GTFS Static describing the planned itineraries and GTFS Realtime describing the actual ones. MobilityDB is a novel and free open-source moving object database, developed as a PostgreSQL and PostGIS extension, that adds spatial and temporal data types along with a large number of functions, that facilitate the analysis of mobility data. Loading GTFS data into MobilityDB is a quite complex task that, nevertheless, must be done in an ad-hoc fashion. This work describes how MobilityDB is used to analyze public transport mobility in the city of Buenos Aires, using both, static and real-time GTFS data for the Buenos Aires public transportation system. Visualizations are also produced to enhance the analy-sis. To the authors’ knowledge, this is the first attempt to analyze GTFS data with a moving object database."Artículo de Publicación Periódica Analyzing the quality of Twitter data streams(2022) Arolfo, Franco; Cortés Rodriguez, Kevin; Vaisman, Alejandro Ariel"There is a general belief that the quality of Twitter data streams is generally low and unpredictable, making, in some way, unreliable to take decisions based on such data. The work presented here addresses this problem from a Data Quality (DQ) perspective, adapting the traditional methods used in relational databases, based on quality dimensions and metrics, to capture the characteristics of Twitter data streams in particular, and of Big Data in a more general sense. Therefore, as a first contribution, this paper re-defines the classic DQ dimensions and metrics for the scenario under study. Second, the paper introduces a software tool that allows capturing Twitter data streams in real time, computing their DQ and displaying the results through a wide variety of graphics. As a third contribution of this paper, using the aforementioned machinery, a thorough analysis of the DQ of Twitter streams is performed, based on four dimensions: Readability, Completeness, Usefulness, and Trustworthiness. These dimensions are studied for several different cases, namely unfiltered data streams, data streams filtered using a collection of keywords, and classifying tweets referring to different topics, studying the DQ for each topic. Further, although it is well known that the number of geolocalized tweets is very low, the paper studies the DQ of tweets with respect to the place from where they are posted. Last but not least, the tool allows changing the weights of each quality dimension considered in the computation of the overall data quality of a tweet. This allows defining weights that fit different analysis contexts and/or different user profiles. Interestingly, this study reveals that the quality of Twitter streams is higher than what would have been expected."Artículo de Publicación Periódica Analyzing the quality of Twitter data streams(2020) Arolfo, Franco A.; Cortes Rodriguez, Kevin; Vaisman, Alejandro Ariel"There is a general belief that the quality of Twitter data streams is generally low and unpredictable, making, in some way, unreliable to take decisions based on such data. The work presented here addresses this problem from a Data Quality (DQ) perspective, adapting the traditional methods used in relational databases, based on quality dimensions and metrics, to capture the characteristics of Twitter data streams in particular, and of Big Data in a more general sense. Therefore, as a first contribution, this paper re-defines the classic DQ dimensions and metrics for the scenario under study. Second, the paper introduces a software tool that allows capturing Twitter data streams in real time, computing their DQ and displaying the results through a wide variety of graphics. As a third contribution of this paper, using the aforementioned machinery, a thorough analysis of the DQ of Twitter streams is performed, based on four dimensions: Readability, Completeness, Usefulness, and Trustworthiness. These dimensions are studied for several different cases, namely unfiltered data streams, data streams filtered using a collection of keywords, and classifying tweets referring to different topics, studying the DQ for each topic. Further, although it is well known that the number of geolocalized tweets is very low, the paper studies the DQ of tweets with respect to the place from where they are posted. Last but not least, the tool allows changing the weights of each quality dimension considered in the computation of the overall data quality of a tweet. This allows defining weights that fit different analysis contexts and/or different user profiles. Interestingly, this study reveals that the quality of Twitter streams is higher than what would have been expected."Artículo de Publicación Periódica A data model and query language for spatio-temporal decision support(2010) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"In recent years, applications aimed at exploring and analyzing spatial data have emerged, powered by the increasing need of software that integrates Geographic Information Systems(GIS) and On-Line Analytical Processing (OLAP). These applications have been called SOLAP (Spatial OLAP). In previous work, the authors have introduced Piet, a system based on a formal data model that integrates in a single framework GIS, OLAP (On-Line Analytical Processing), and Moving Object data. Real-world problems are inherently spatio-temporal. Thus, in this paper we present a data model that extends Piet, allowing tracking the history of spatial data in the GIS layers. We present a formal study of the two typical ways of intro ducing time into Piet: timestamping the thematic layers in the GIS, and timestamping the spatial objects in each layer. We denote these strategies snapshot-based and timestamp-based representations, respectively, following well-known terminology borrowed from temporal databases. We present and discuss the formal model for both alternatives. Based on the timestamp-based representation, we introduce a formal First-Order spatio-temporal query language, which we denote Lt, able to express spatio-temporal queries over GIS, OLAP, and trajectory data. Finally, we discuss implementation issues, the update operators that must be supported by the model, and sketch a temporal extension to Piet-QL, the SQL-like query language that supports Piet."Ponencia en Congreso Data quality in a big data context(2018) Arolfo, Franco A.; Vaisman, Alejandro Ariel"In each of the phases of a Big Data analysis process, data quality (DQ) plays a key role. Given the particular characteristics of the data at hand, the traditional DQ methods used for relational databases, based on quality dimensions and metrics, must be adapted and extended, in order to capture the new characteristics that Big Data introduces. This paper dives into this problem, re-defining the DQ dimensions and metrics for a Big Data scenario, where data may arrive, for example, as unstructured documents in real time. This general scenario is instantiated to study the concrete case of Twitter feeds. Further, the paper also describes the implementation of a system that acquires tweets in real time, and computes the quality of each tweet, applying the quality metrics that are defined formally in the paper. The implementation includes a web user interface that allows filtering the tweets for example by keywords, and visualizing the quality of a data stream in many different ways. Experiments are performed and their results discussed."Artículo de Publicación Periódica Design and implementation of ETL processes using BPMN and relational algebra(2020-06-13) Awiti, Judith; Vaisman, Alejandro Ariel; Zimányi, Esteban"Extraction, transformation, and loading (ETL) processes are used to extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse. The Business Process Modeling and Notation (BPMN) has been proposed for expressing ETL processes at a conceptual level. A different approach is studied in this paper, where relational algebra (RA), extended with update operations, is used for specifying ETL processes. In this approach, data tasks in an ETL workflow can be automatically translated into SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses the problem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the case when updating a SCD table impacts on associated SCD tables. Tackling this problem requires extending the classic RA with update operations. The paper also shows the implementation of a portion of the TPC-DI benchmark that results from both approaches. Thus, the paper presents three implementations: (a) An SQL implementation based on the extended RA-based specification of an ETL process expressed in BPMN4ETL; and (b) Two implementations of workflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another one that uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DI benchmark for different scale factors were carried out, and are described and discussed in the paper, showing that the extended RA approach results in more efficient processes than the ones produced by implementing the BPMN4ETL specification over the mentioned ETL tools. The reasons for this result are also discussed."Artículo de Publicación Periódica Efficient analytical queries on semantic web data cubes(2017-12) Etcheverry, Lorena; Vaisman, Alejandro Ariel"The amount of multidimensional data published on the semantic web (SW) is constantly increasing, due to initiatives such as Open Data and Open Government Data, among other ones. Models, languages, and tools, that allow obtaining valuable information e ciently, are thus required. Multidimensional data are typically represented as data cubes, and exploited using Online Analytical Processing (OLAP) techniques. The RDF Data Cube Vocabulary, also denoted QB, is the current W3C standard to represent statistical data on the SW. Given that QB does not include key features needed for OLAP analysis, in previous work we have proposed an extension, denoted QB4OLAP, to overcome this problem without the need of modifying already published data. Once data cubes are appropriately represented on the SW, we need mechanisms to analyze them. However, in the current state-of-the-art, writing e cient analytical queries over SW data cubes demands a deep knowledge of standards like RDF and SPARQL. These skills are unlikely to be found in typical analytical users. Further, OLAP languages like MDX are far from being easily understood by the final user. The lack of friendly tools to exploit multidimensional data on the SW is a barrier that needs to be broken to promote the publication of such data. This is the problem we address in this paper. Our approach is based on allowing analytical users to write queries using what they know best: OLAP operations over data cubes, without dealing with SW technicalities. For this, we devised CQL (standing for Cube Query Language), a simple, high-level query language that operates over data cubes. Taking advantage of structural metadata provided by QB4OLAP, we translate CQL queries into SPARQL ones. Then, we propose query improvement strategies to produce e cient SPARQL queries, adapting general-purpose SPARQL query optimization techniques. We evaluate our implementation using the Star-Schema benchmark, showing that our proposal outperforms others. The QB4OLAP toolkit,a web application that allows exploring and querying (using CQL) SW data cubes, completes our contributions."Ponencia en Congreso From conceptual to logical ETL design using BPMN and relational algebra(2019) Awiti, Judith; Vaisman, Alejandro Ariel; Zimányi, Esteban"Extraction, transformation, and loading (ETL) processes are used to extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse. The Business Process Modeling Notation (BPMN) has been proposed for expressing ETL processes at a conceptual level. This paper extends relational algebra (RA) with update operations for specifying ETL processes at a logical level. In this approach, data tasks can be automatically translated into SQL queries to be executed over a DBMS. An extension of RA is presented, as well as a translation mechanism from BPMN to the RA specification. Throughout the paper, the TPC-DI benchmark is used for comparing both approaches. Experiments show the efficiency of the resulting ETL flow with respect to the Pentaho Data Integration tool."Ponencia en Congreso Indexing continuous paths in temporal graphs(2022) Kuijpers, Bart; Ribas, Ignacio; Soliani, Valeria; Vaisman, Alejandro Ariel"Temporal property graph databases track the evolution over time of nodes, properties, and edges in graphs. Computing temporal paths in these graphs is hard. In this paper we focus on indexing Continuous Paths, defined as paths that exist continuously during a certain time interval. We propose an index structure called TGIndex where index nodes are defined as nodes in the graph database. Two different indexing strategies are studied. We show how the index is used for querying and also present different search strategies, that are compared and analyzed using a large synthetic graph."Artículo de Publicación Periódica Mapping spatiotemporal data to RDF: a SPARQL endpoint for Brussels(2019) Vaisman, Alejandro Ariel; Chentout, Kevin"This paper describes how a platform for publishing and querying linked open data for the Brussels Capital region in Belgium is built. Data are provided as relational tables or XML documents and are mapped into the RDF data model using R2RML, a standard language that allows defining customized mappings from relational databases to RDF datasets. In this work, data are spatiotemporal in nature; therefore, R2RML must be adapted to allow producing spatiotemporal Linked Open Data.Data generated in this way are used to populate a SPARQL endpoint, where queries are submitted and the result can be displayed on a map. This endpoint is implemented using Strabon, a spatiotemporal RDF triple store built by extending the RDF store Sesame. The first part of the paper describes how R2RML is adapted to allow producing spatial RDF data and to support XML data sources. These techniques are then used to map data about cultural events and public transport in Brussels into RDF. Spatial data are stored in the form of stRDF triples, the format required by Strabon. In addition, the endpoint is enriched with external data obtained from the Linked Open Data Cloud, from sites like DBpedia, Geonames, and LinkedGeoData, to provide context for analysis. The second part of the paper shows, through a comprehensive set of the spatial extension to SPARQL (stSPARQL) queries, how the endpoint can be exploited."Artículo de Publicación Periódica Mobility data warehouses(2019-04) Vaisman, Alejandro Ariel; Zimányi, Esteban"The interest in mobility data analysis has grown dramatically with the wide availability of devices that track the position of moving objects. Mobility analysis can be applied, for example, to analyze traffic flows. To support mobility analysis, trajectory data warehousing techniques can be used. Trajectory data warehouses typically include, as measures, segments of trajectories, linked to spatial and non-spatial contextual dimensions. This paper goes beyond this concept, by including, as measures, the trajectories of moving objects at any point in time. In this way, online analytical processing (OLAP) queries, typically including aggregation, can be combined with moving object queries, to express queries like “List the total number of trucks running at less than 2 km from each other more than 50% of its route in the province of Antwerp” in a concise and elegant way. Existing proposals for trajectory data warehouses do not support queries like this, since they are based on either the segmentation of the trajectories, or a pre-aggregation of measures. The solution presented here is implemented using MobilityDB, a moving object database that extends the PostgresSQL database with temporal data types, allowing seamless integration with relational spatial and non-spatial data. This integration leads to the concept of mobility data warehouses. This paper discusses modeling and querying mobility data warehouses, providing a comprehensive collection of queries implemented using PostgresSQL and PostGIS as database backend, extended with the libraries provided by MobilityDB."Artículo de Publicación Periódica A model and query language for temporal graph databases(2021-09) Debrouvier, Ariel; Parodi, Eliseo; Perazzo, Matías; Soliani, Valeria; Vaisman, Alejandro Ariel"Graph databases are becoming increasingly popular for modeling different kinds of networks for data analysis. They are built over the property graph data model, where nodes and edges are annotated with property-value pairs. Most existing work in the field is based on graphs were the temporal dimension is not considered. However, time is present in most real world problems. Many different kinds of changes may occur in a graph as the world it represents evolves across time. For instance, edges, nodes, and properties can be added and/or deleted, and property values can be updated. This paper addresses the problem of modeling, storing, and querying temporal property graphs, allowing keeping the history of a graph database. This paper introduces a temporal graph data model, where nodes and relationships contain attributes (properties) timestamped with a validity interval. Graphs in this model can be heterogeneous, that is, relationships may be of different kinds. Associated with the model, a high-level graph query language, denoted T-GQL, is presented, together with a collection of algorithms for computing different kinds of temporal paths in a graph, capturing different temporal path semantics. T-GQL can express queries like “Give me the friends of the friends of Mary, who lived in Brussels at the same time than her, and also give me the periods when this happened”. As a proof-of-concept, a Neo4j-based implementation of the above is also presented, and a client-side interface allows submitting queries in T-GQL to a Neo4j server. Finally, experiments were carried out over synthetic and real-world data sets, with a twofold goal: on the one hand, to show the plausibility of the approach; on the other hand, to analyze the factors that affect performance, like the length of the paths mentioned in the query, and the size of the graph."Ponencia en Congreso Modeling and querying sensor networks using temporal graph databases(2022) Kuijpers, Bart; Soliani, Valeria; Vaisman, Alejandro Ariel"Transportation networks (e.g., river systems or road net works) equipped with sensors that collect data for several different pur poses can be naturally modeled using graph databases. However, since networks can change over time, to represent these changes appropriately, a temporal graph data model is required. In this paper, we show that sensor-equipped transportation networks can be represented and queried using temporal graph databases and query languages. For this, we extend a recently introduced temporal graph data model and its high-level query language T-GQL to support time series in the nodes of the graph. We redefine temporal paths and study and implement a new kind of path, called Flow path. We take the Flanders’ river system as a use case."Ponencia en Congreso Modelling and querying star and snowflake warehouses using graph databases(2019) Vaisman, Alejandro Ariel; Besteiro, María Florencia; Valverde Melito, Maximiliano Javier"In current “Big Data” scenarios, graph databases are increasingly being used. Online Analytical Processing (OLAP) operations can expand the possibilities of graph analysis beyond the traditional graphbased computation. This paper studies graph databases as an alternative to implement star and snowflake schemas, the typical choices for data warehouse design. For this, the MusicBrainz database is used. A data warehouse for this database is designed, and implemented over a Postgres relational database. This warehouse is also represented as a graph, and implemented over the Neo4j graph database. A collection of typical OLAP queries is used to compare both implementations. The results reported here show that in ten out of thirteen queries tested, the graph implementation outperforms the relational one, in ratios that go from 1.3 to 26 times faster, and performs similarly to the relational implementation in the three remaining cases."Artículo de Publicación Periódica Online analytical processsing on graph data(2020) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a “Big Data” scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current data analysis requirements, for example, social network analysis. Furthermore, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. Nevertheless, there is not much work on the problem of taking OLAP analysis to the graph data model. This paper proposes a formal multidimensional model for graph analysis, that considers the basic graph data, and also background information in the form of dimension hierarchies. The graphs in this model are node- and edge-labelled directed multihypergraphs, called graphoids, which can be defined at several different levels of granularity using the dimensions associated with them. Operations analogous to the ones used in typical OLAP over cubes are defined over graphoids. The paper presents a formal definition of the graphoid model for OLAP, proves that the typical OLAP operations on cubes can be expressed over the graphoid model, and shows that the classic data cube model is a particular case of the graphoid data model. Finally, a case study supports the claim that, for many kinds of OLAP-like analysis on graphs, the graphoid model works better than the typical relational OLAP alternative, and for the classic OLAP queries, it remains competitive."Ponencia en Congreso Performing OLAP over graph data: query language, implementation, and a case study(2017-08) Gómez, Leticia Irene; Kuijpers, Bart; Vaisman, Alejandro Ariel"In current Big Data scenarios, traditional data warehousing and Online Analytical Processing (OLAP) operations on cubes are clearly not sufficient to address the current data analysis requirements. Nevertheless, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. In spite of this, there is not much work on the problem of taking OLAP analysis to the graph data model. In previous work we proposed a multidimensional (MD) data model for graph analysis, that considers not only the basic graph data, but background information in the form of dimension hierarchies as well. The graphs in our model are node- and edge-labelled directed multi-hypergraphs, called graphoids, defined at several different levels of granularity. In this paper we show how we implemented this proposal over the widely used Neo4J graph database, discuss implementation issues, and present a detailed case study to show how OLAP operations can be used on graphs."Artículo de Publicación Periódica Piet: a GIS-OLAP implementation(2007) Vaisman, Alejandro Ariel; Gómez, Leticia Irene; Kuijpers, Bart; Escribano, Ariel"Data aggregation in Geographic Information Systems (GIS) is a desirable feature, although only marginally present in commercial systems, which also fail to provide integration between GIS and OLAP (On Line Analytical Processing). With this in mind, we have developed Piet, a system that makes use of a novel query processing technique: first, a process called sub-polygonization decomposes each thematic layer in a GIS, into open convex polygons; then, another process computes and stores in a database the overlay of those layers for later use by a query processor. We describe the implementation of Piet, and provide experimental evidence that overlay precomputation can outperform GIS systems that employ indexing schemes based on R-trees."