Default Datasets - accarniel/FESTIval GitHub Wiki

FESTIval provides datasets to aid in the definition of empirical experiments for evaluating spatial indices. These datasets are classified in two types:

datasets to be indexed: these datasets are based on real geographic data extracted from OpenStreetMap. They are related to Brazil and their descriptions can be found below.
datasets to form spatial queries: the goal of these datasets is to provide search objects for the definition of spatial queries (e.g., point queries, range queries). Here, we also provide the algorithms employed for generating these search objects, which are not only randomly generated but also possibly correlated to a dataset.

These datasets have been used by us in our experiments. The research papers resulting from our experiments can be accessed here.

Citing the Datasets

They are specified here (please cite this paper if you use one or more dataset):

Carniel, A. C.; Ciferri, R. R.; Ciferri, C. D. A. Spatial Datasets for Conducting Experimental Evaluations of Spatial Indices. In Proceedings of the Satellite Events of the 32nd Brazilian Symposium on Databases (SBBD) - Workshop Dataset Showcase, p. 286-295, 2017.

Datasets to be indexed

The majority of these datasets are extracted from the OpenStreetMap by using the tool Osm2pgsql. After extracting the data, they are specifically handled to form tables with specific characteristics and contexts. In addition, we also provide some datasets extracted from IBGE (in portuguese), which is the official agency responsible for collection of statistical, geographic, cartographic, geodetic and environmental information in Brazil.

We shortly describe the datasets that can be indexed as follows:

Dataset (name of the relational table)	Short Description	Download Link
brazil_buildings2016	Extracted from the OpenStreeMap in May 2016 by using the osm2pgsql version 0.91. It stores the buildings of Brazil, such as universities, hotels, schools, warehouses, stadiums, houses, churches, etc. They are represented by 534,926 regions.	zip file
brazil_buildings2017	Extracted from the OpenStreeMap in January 2017 by using the osm2pgsql version 0.91. It stores the buildings of Brazil, such as universities, hotels, schools, warehouses, stadiums, houses, churches, etc. They are represented by 1,486,557 regions.	zip file
brazil_buildings2017_v2	Extracted from the OpenStreeMap in January 2017. But, differently from brazil_buildings2017, it was extracted by using the osm2pgsql version 0.93. It stores the buildings of Brazil, such as universities, hotels, schools, warehouses, stadiums, houses, churches, etc. They are represented by 1,485,866 regions.	zip file
brazil_highways2017	Extracted from the OpenStreeMap in January 2017 by using the osm2pgsql version 0.93. It stores the highways of Brazil, such as roads, footpaths, streets, cycleways, raceways, etc. They are represented by 2,644,432 lines.	zip file
brazil_points2017	Extracted from the OpenStreeMap in January 2017 by using the osm2pgsql version 0.93. It stores the locations of Brazil represented by points, such as toilets, telephones, banks, hydrants, etc. They are represented by 770,842 points.	zip file
brazil2017	Extracted from the OpenStreeMap in January 2017 by using the osm2pgsql version 0.93. It stores the union among all the regions, lines, and points that represent a spatial phenomena, event, man-made artifacts of Brazil. Number of spatial objects: 5,577,373.	zip file
brazil_cities	It stores the cities of Brazil in 2012. They are represented by 5,512 regions.	zip file
brazil_states	It stores the states of Brazil in 2012. They are represented by 27 regions.	zip file

Datasets to form spatial queries

These datasets are created by using specific algorithms PL/pgSQL, the SQL Procedural Language of PostgreSQL. These algorithms generate points and rectangles, which can be random or correlated to a dataset. They populate two relational tables, which are described below.

Dataset (name of the relational table)	Short Description	Download Link
generated_point	It stores points to create point queries. Here, there are two types of points: random points and correlated points. In special, correlated points are points strongly associated to at least one spatial object of a dataset. For each dataset of the previous table, there is a set of 100 correlated points. For random points, it stores 100 points that are randomly generated by taking as basis the region that represents Brazil.	zip file
generated_rectangle	It stores rectangles to create range queries, such as intersection range queries and containment range queries. Here, there are two types of rectangles: random rectangles and correlated rectangles. While random rectangles are rectangles that simply intersects Brazil, there are two types of correlated rectangles. The first type consists of rectangles that intersect at least one spatial object of a given spatial dataset. The second type consists of rectangles that contain at least one spatial object of a given spatial object. Thus, for each dataset of the previous table, there is a set of 100 correlated rectangles for each type of correlation.	zip file

The generation of the points and rectangles of the aforementioned datasets (generated_point and generated_rectangle) is performed by the following algorithms implemented as PL/pgSQL functions:

Algorithm	Short Description	Link
generate_random_point	This function generates random points and has the following parameters: base_region (a geometry object) is the total extent used in the generation of rectangles, and num (an integer) is the number of points that will be generated.	SQL code
generate_random_rect	This function generates random rectangles and has the following parameters: base_region (a geometry object) is the total extent used in the generation of rectangles, perc (a double) is the percentage value used to define the size of the generated rectangle in relation to base_region, and num (an integer) is the number of rectangles that will be generated.	SQL code
generate_correlated_point	This function generates correlated points and has the following parameters: base_region (a geometry object) is the total extent used in the generation of rectangles, target_table and target_geo (text values) are the table and the geometric column used as basis to generate the correlated objects, and num (an integer) is the number of objects that will be generated.	SQL code
generate_correlated_qws2irqs and generate_correlated_qws2crqs	These functions generate correlated rectangles by considering the correlations of intersection and contaiment respectively. They have the following parameters: base_region (a geometry object) is the total extent used in the generation of rectangles, perc (a double) is the percentage value used to define the size of the generated rectangle in relation to base_region, target_table and target_geo (text values) are the table and the geometric column used as basis to generate the correlated objects, and num (an integer) is the number of objects that will be generated.	SQL code

Definition of the datasets to form spatial queries

If you are interested in creating your own datasets by using the aforementioned scripts, you have to create the relational tables as follows:

create table generated_rectangle (id serial, geom geometry, percentage double precision, type text, is_correlated boolean, dataset text default 'ALL');

create table generated_point (id serial, geom geometry, is_correlated boolean, dataset text default 'ALL');

Examples

In this query, we are generating 100 random points that intersects the minimum bounding rectangle of Brazil (stored in the relational table brazil_extent):

SELECT generate_random_point((SELECT geom FROM brazil_extent), 100);

Next, we are generating 100 random rectangles with 0.001% of the size of the minimum bounding rectangle of Brazil (stored in the relational table brazil_extent):

SELECT generate_random_rect((SELECT geom FROM brazil_extent), 0.001, 100);

The text value 'ALL' will be stored in the column dataset of the relational tables generated_point and generated_rectangle after the execution of the two previous SQL queries, indicating that the generated points and rectangles can be used for any dataset. This is done because the points and rectangles are random. However, you can change the value of dataset by issuing SQL UPDATES accordingly, or by passing a last parameter on these functions. For instance, the value 'brazil_buildings2016' is stored in the column dataset in both of the tables, as follows:

SELECT generate_random_point((SELECT geom FROM brazil_extent), 100, 'brazil_buildings2016');

SELECT generate_random_rect((SELECT geom FROM brazil_extent), 0.001, 100, 'brazil_buildings2016');

In the following query, we are generating 100 rectangles with 0.01% of the size of the minimum bounding rectangle of Brazil (stored in the relational table brazil_extent). In addition, we guarantee that each generated rectangle intersects at least one spatial object stored in the column way of the relational table brazil_buildings2017 (here the dataset assumes the same value of the third parameter):

SELECT generate_correlated_qws2irqs((SELECT geom FROM brazil_extent), 0.01, 'brazil_buildings2017', 'way', 100);

In the following query, we are generating 100 rectangles with 0.1% of the size of the minimum bounding rectangle of Brazil (stored in the relational table brazil_extent). In addition, we guarantee that each generated rectangle contains at least one spatial object stored in the column way of the relational table brazil_buildings2017 (here the dataset assumes the same value of the third parameter):

SELECT generate_correlated_qws2crqs((SELECT geom FROM brazil_extent), 0.1, 'brazil_buildings2017', 'way', 100);

The table brazil_extent, used in our examples, specifies the political boundaries of Brazil and can be accessed here.

Please, feel free to use these algorithms to generate more random and correlated search objects, including search objects for other datasets that are not included in FESTIval.