Lesson 5. Environmental layers: integration and management of spatial ancillary information
Animals move in and interact with complex environments that can be characterized by a set of spatial layers containing environmental data. Spatial databases can manage these different data sets in a unified framework, defining spatial and non-spatial relationships that simplify the analysis of the interaction between animals and their habitat. This simplifies a large set of analyses that can be performed directly in the database with no need for dedicated GIS or statistical software. Such an approach moves the information content managed in the database from a geographical space to an animal's ecological space. This more comprehensive database model of the animals' movement ecology reduces the distance between physical reality and the way data are structured in the database, filling the semantic gap between the scientist's view of biological systems and its implementation in the information system. This lesson shows how vector and raster layers can be included in the database and how you can handle them using (spatial) SQL. The database built so far is extended with environmental ancillary data sets.
Topic 1. Adding ancillary environmental layers
In traditional information systems for wildlife tracking data management, position data are stored in some file-based spatial format (e.g. shapefile). With a multi-steps process in a GIS environment, position data are associated with a set of environmental attributes through an analytical stage (e.g. intersection of GPS positions with vector and raster environmental layers). This process is usually time-consuming and prone to error, implies data replication, and often has to be repeated for any new analysis. It also generally involves different tools for vector and raster maps. An advanced data management system should achieve the same result with an efficient (and, if needed, automated) procedure, possibly performed as a real-time routine management task. To do so, the first step is to integrate both position data and spatial ancillary information on the environment in a unique framework. This is essential to exploring the animals' behavior and understanding the ecological relationships that can be revealed by tracking data. Spatial databases can manage these different data sets in a unified framework. This also affects performance, as databases are optimized to run simple processes on large data sets like the ones generated by GPS sensors.
In this exercise, you will see how to integrate a number of spatial features (see figure below).
- Points: meteorological stations (derived from MeteoTrentino)
- Linestrings: roads network (derived from OpenStreetMap)
- Polygons: administrative units (derived from ISTAT) and the study area.
- Rasters: land cover (source: Corine) and digital elevation models (source: SRTM, see also Jarvis A, Reuter HI, Nelson A, Guevara E (2008) Hole-filled seamless SRTM data V4. International Centre for Tropical Agriculture (CIAT)).
Each species and study have specific data sets required and available, so the goal of this example is to show a complete set of procedures that can be replicated and customized on different data sets. Once layers are integrated into the database, you are encouraged to visualize and explore them in a GIS environment (e.g. QGIS).
Once data are loaded into the database, you will extend the gps_data_animals table with the environmental attributes derived from the ancillary layers provided in the test data set. You will also modify the function tools.new_gps_data_animals to compute these values automatically. In addition, you are encouraged to develop your own (spatial) queries (e.g. detect how many times each animal crosses a road, calculate how many times two animals are in the same place at the same time).
It is a good practice to store your environmental layers in a dedicated schema in order to keep a clear database structure. Let's create the schema env_data:
CREATE SCHEMA env_data AUTHORIZATION postgres; GRANT USAGE ON SCHEMA env_data TO basic_user;
COMMENT ON SCHEMA env_data IS 'Schema that stores environmental ancillary information.';
ALTER DEFAULT PRIVILEGES IN SCHEMA env_data GRANT SELECT ON TABLES TO basic_user;
Now you can start importing the shapefiles of the (vector) environmental layers included in the test data set. An option is to use the drag and drop function of DB Manager (from QGIS Browser) plugin in QGIS.
Alternatively, a standard solution to import shapefiles (vector data) is the shp2pgsql tool. sph2pgsql is an external command-line tool, which can not be run in a SQL interface as it can for a regular SQL command. The code below has to be run in a command-line interpreter (if you are using Windows as operating system, it is also called Command Prompt or MS-DOS shell). You will see other examples of external tools that are run in the same way, and it is very important to understand the difference between these and SQL commands. In these exercises, this difference is represented with a different graphic layout. Start with the meteorological stations:
"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\meteo_stations.shp env_data.meteo_stations | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
Note that the path to shp2pgsql.exe and psql.exe (in this case C:\Program Files\PostgreSQL\9.4\bin*)can be different according to the folder where you installed your version of PostgreSQL. If you connect to the database remotely, you also have to change the address of the server (-h* option). In the parameters, set the reference system (option -s) and create a spatial index for the new table (option -I). The result of shp2pgsql is a text file with the SQL that generates and populates the table env_data.meteo_stations. With the symbol '|' you 'pipe' (send directly) the SQL to the database (through the PostgreSQL interactive terminal psql) where it is automatically executed. You have to set the the port (-p), the name of the database (-d), the user (-U) and the password, if requested. In this way, you complete the whole process with a single command. You can refer to shp2pgsql documentation for more details. You might have to add the whole path to psql and shp2pgsql. This depends on the folder where you installed PostgreSQL. You can easily verify the path searching for these two files. You also have to check that the path of your shapefile (meteo_stations.shp) is properly defined.
You can repeat the same operation for the study area layer:
"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\study_area.shp env_data.study_area | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
Next for the roads layer
"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\roads.shp env_data.roads | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
And for the administrative boundaries
"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\adm_boundaries.shp env_data.adm_boundaries | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
Now the shapefiles are in the database as new tables (one table for each shapefile). You can visualize them through a GIS interface (e.g. QGIS). You can also retrieve a summary of the information from all vector layers available in the database with the following command:
SELECT * FROM geometry_columns;
The primary method to import a raster layer is the command-line tool raster2pgsql, the equivalent of shp2pgsql but for raster files, that converts GDAL-supported rasters into SQL suitable for loading into PostGIS. It is also capable of loading folders of raster files.
GDAL (Geospatial Data Abstraction Library) is a (free) library for reading, writing and processing raster geospatial data formats. It has a lot of simple but very powerful and fast command-line tools for raster data translation and processing. The related OGR library provides a similar capability for simple vector data features. GDAL is used by most of the spatial open-source tools and by a large number of commercial software programs as well. You will probably benefit in particular from the tools gdalinfo (get a layer's basic metadata), gdal_translate (change data format, change data type, cut), gdalwarp1 (mosaicing, reprojection and warping utility).
An interesting feature of raster2pgsql is its capability to store the rasters inside the database (in-db) or keep them as (out-db) files in the file system (with the raster2pgsql -R option). In the last case, only raster metadata are stored in the database, not pixel values themselves. Loading out-db rasters metadata is much faster than loading them completely in the database. Most operations at the pixel values level (e.g. ST_SummaryStats) will have equivalent performance with out- and in-db rasters. Other functions, like ST_Tile, involving only the metadata, will be faster with out-db rasters. Another advantage of out-db rasters is that they stay accessible for external applications unable to query databases (with SQL). However, the administrator must make sure that the link between what is in the db (the path to the raster file in the file system) is not broken (e.g. by moving or renaming the files). On the other hand, only in-db rasters can be generated with CREATE TABLE and modified with UPDATE statements. Which is the best choice depends on the size of the data set and on considerations about performance and database management. A good practice is generally to load very large raster data sets as out-db and to load smaller ones as in-db to save time on loading and to avoid repeatedly backing up huge, static rasters.
The QGIS plugin Load Raster to PostGIS can also be used to import raster data with a graphical interface. An important parameter to set when importing raster layers is the number of tiles (-t option). Tiles are small subsets of the image and correspond to a physical record in the table. This approach dramatically decreases the time required to retrieve information. The recommended values for the tile option range from 20x20 to 100x100. Here is the code (to be run in the Command Prompt) to transform a raster (the digital elevation model derived from SRTM) into the SQL code that is then used to physically load the raster into the database (as you did with shp2pgsql for vectors):
"C:\Program Files\PostgreSQL\9.4\bin\raster2pgsql.exe" -I -M -C -s 4326 -t 20x20 C:\tracking_db\data\env_data\raster\srtm_dem.tif env_data.srtm_dem | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
You can repeat the same process on the land cover layer:
"C:\Program Files\PostgreSQL\9.4\bin\raster2pgsql.exe" -I -M -C -s 3035 C:\tracking_db\data\env_data\raster\corine06.tif -t 20x20 env_data.corine_land_cover | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost
The reference system of the Corine Land Cover data set is not geographic coordinates (SRID 4326), but ETRS89/ETRS-LAEA (SRID 3035), an equal-area projection over Europe. This must be specified with the -s option and kept in mind when this layer will be connected to other spatial layers stored in a different reference system. As with shp2pgsql.exe, the -I option will create a spatial index on the loaded tiles, speeding up many spatial operations, and the -C option will generate a set of constraints on the table, allowing it to be correctly listed in the raster_columns metadata table. The land cover raster identifies classes that are labeled by a code (an integer). To specify the meaning of the codes, you can add a table where they are described. In this example, the land cover layer is taken from the Corine project. Classes are described by a hierarchical legend over three nested levels. The legend is provided in the test data set in the file corine_legend.csv. You import the table of the legend (first creating an empty table, and then loading the data):
CREATE TABLE env_data.corine_land_cover_legend( grid_code integer NOT NULL, clc_l3_code character(3), label1 character varying, label2 character varying, label3 character varying, CONSTRAINT corine_land_cover_legend_pkey PRIMARY KEY (grid_code ));
COMMENT ON TABLE env_data.corine_land_cover_legend IS 'Legend of Corine land cover, associating the numeric code to the three nested levels.';
Then you load the data:
COPY env_data.corine_land_cover_legend FROM 'C:\tracking_db\data\env_data\raster\corine_legend.csv' WITH (FORMAT csv, HEADER, DELIMITER ';');
You can retrieve a summary of the information from all raster layers available in the database with the following command:
SELECT * FROM raster_columns;
To keep a well-documented database, add comments to describe all the spatial layers that you have added:
COMMENT ON TABLE env_data.adm_boundaries IS 'Layer (polygons) of administrative boundaries (comuni).';
COMMENT ON TABLE env_data.corine_land_cover IS 'Layer (raster) of land cover (from Corine project).';
COMMENT ON TABLE env_data.meteo_stations IS 'Layer (points) of meteo stations.';
COMMENT ON TABLE env_data.roads IS 'Layer (lines) of roads network.';
COMMENT ON TABLE env_data.srtm_dem IS 'Layer (raster) of digital elevation model (from SRTM project).';
COMMENT ON TABLE env_data.study_area IS 'Layer (polygons) of the boundaries of the study area.';
- Load the roads network layer, edit it adding some additional roads (using OpenStreetMap or GooglEarth as reference, save edits and export as shapefile
Topic 2. Querying spatial environmental data
As the set of ancillary (spatial) information is now loaded into the database, you can start playing with this information using spatial SQL queries. In fact, it is possible with spatial SQL to run queries that explicitly handle the spatial relationships among the different spatial tables that you have stored in the database. In the following examples, SQL statements will show you how to take advantage of PostGIS features to manage, explore and analyze spatial objects, with optimized performances and no need for specific GIS interfaces.
You start palying with your spatial data by asking for the name of the administrative unit (comune, Italian commune) in which the point at coordinates (11, 46) (longitude, latitude) is located. There are two commands that are used when it comes to intersection of spatial elements: ST_Intersects and ST_Intersection. The former returns true if two features intersect, while the latter returns the geometry produced by the intersection of the objects. In this case, ST_Intersects is used to select the right comune:
SELECT nome_com FROM env_data.adm_boundaries WHERE ST_Intersects((ST_SetSRID(ST_MakePoint(11,46), 4326)), geom);
The result is Cavedine.
In the second example, you compute the distance (rounded to the meter) from the point at coordinates (11, 46) to all the meteorological stations (ordered by distance) in the table env_data.meteo_stations. This information could be used, for example, to derive the precipitation and temperature for a GPS position at the given acquisition time, weighting the measurement from each station according to the distance to the point. In this case, ST_Distance_Spheroid is used. Alternatively, you could use ST_Distance and cast your geometries as geography data types.
SELECT station_id, ST_Distance_Spheroid((ST_SetSRID(ST_MakePoint(11,46), 4326)), geom, 'SPHEROID["WGS 84",6378137,298.257223563]')::integer AS distance FROM env_data.meteo_stations ORDER BY distance;
The result is
In the third example, you compute the distance to the closest road:
SELECT ST_Distance((ST_SetSRID(ST_MakePoint(11,46), 4326))::geography, geom::geography)::integer AS distance FROM env_data.roads ORDER BY distance LIMIT 1;
The result is 1560.
For users, the data type (vector, raster) used to store spatial information is not so relevant when they query their data: queries should transparently use any kind of spatial data as input. Users can then focus on the environmental model instead of worrying about the data model. In the next example, you intersect a point with two raster layers (altitude and land cover) in the same way you do for vector layers. In the case of land cover, the point must first be projected into the Corine reference system (SRID 3035). In the raster layer, just the Corine code class (integer) is stored while the legend is stored in the table env_data.corine_land_cover_legend. In the query, the code class is joined to the legend table and the code description is returned. This is an example of integration of both spatial and non-spatial elements in the same query.
SELECT ST_Value(srtm_dem.rast, (ST_SetSRID(ST_MakePoint(11,46), 4326))) AS altitude, ST_value(corine_land_cover.rast, ST_transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035)) AS land_cover, label2, label3 FROM env_data.corine_land_cover, env_data.srtm_dem, env_data.corine_land_cover_legend WHERE ST_Intersects( corine_land_cover.rast, ST_Transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035)) AND ST_Intersects(srtm_dem.rast,(ST_SetSRID(ST_MakePoint(11,46), 4326))) AND grid_code = ST_Value( corine_land_cover.rast, ST_Transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035));
The result is
Now combine roads and administrative boundaries to compute how many meters of roads there are in each administrative unit. You first have to intersect the two layers (ST_Intersection), then compute the length (ST_Length) and summarize per administrative unit (sum() associated with GROUP BY clause).
SELECT nome_com, sum(ST_Length( (ST_Intersection(roads.geom, adm_boundaries.geom))::geography))::integer AS total_length FROM env_data.roads, env_data.adm_boundaries WHERE ST_Intersects(roads.geom, adm_boundaries.geom) GROUP BY nome_com ORDER BY total_length desc;
The result of the query is
The last examples are about the interaction between rasters and polygons. In this case, we compute some statistics (minimum, maximum, mean, and standard deviation) for the altitude within the study area:
SELECT (sum(ST_Area(((gv).geom)::geography)))/1000000 area, min((gv).val) alt_min, max((gv).val) alt_max, avg((gv).val) alt_avg, stddev((gv).val) alt_stddev FROM (SELECT ST_intersection(rast, geom) AS gv FROM env_data.srtm_dem, env_data.study_area WHERE ST_intersects(rast, geom) ) foo;
The result, from which it is possible to appreciate the large variability of altitude across the study area, is
You might also be interested in the number of pixels of each land cover type within the study area. As with the previous example, we first intersect the study area with the raster of interest, but in this case we need to reproject the study area polygon into the coordinate system of the Corine land cover raster (SRID: 3035). With the following query, you can see the dominance of mixed forests in the study area:
SELECT (pvc).value, SUM((pvc).count) AS total, label3 FROM (SELECT ST_ValueCount(rast) AS pvc FROM env_data.corine_land_cover, env_data.study_area WHERE ST_Intersects(rast, ST_Transform(geom, 3035))) AS cnts, env_data.corine_land_cover_legend WHERE grid_code = (pvc).value GROUP BY (pvc).value, label3 ORDER BY (pvc).value;
The result is
|1||114||Continuous urban fabric|
|2||817||Discontinuous urban fabric|
|3||324||Industrial or commercial units|
|7||125||Mineral extraction sites|
|16||324||Fruit trees and berry plantations|
|19||237||Annual crops associated with permanent crops|
|20||1967||Complex cultivation patterns|
|21||2700||Land principally occupied by agriculture|
|27||586||Moors and heathland|
|32||611||Sparsely vegetated areas|
The previous query can be modified to return the percentage of each class over the total number of pixels. This can be achieved using window functions (which are a valuable tool in many applications and worth to be learnt):
SELECT (pvc).value, (SUM((pvc).count)*100/ SUM(SUM((pvc).count)) over () )::numeric(4,2) AS total_perc, label3 FROM (SELECT ST_ValueCount(rast) AS pvc FROM env_data.corine_land_cover, env_data.study_area WHERE ST_Intersects(rast, ST_Transform(geom, 3035))) AS cnts, env_data.corine_land_cover_legend WHERE grid_code = (pvc).value GROUP BY (pvc).value, label3 ORDER BY (pvc).value;
The result is
|1||0.42||Continuous urban fabric|
|2||3.00||Discontinuous urban fabric|
|3||1.19||Industrial or commercial units|
|7||0.46||Mineral extraction sites|
|16||1.19||Fruit trees and berry plantations|
|19||0.87||Annual crops associated with permanent crops|
|20||7.23||Complex cultivation patterns|
|21||9.93||Land principally occupied by agriculture|
|27||2.15||Moors and heathland|
|32||2.25||Sparsely vegetated areas|
- Find the administrative unit where each meteo station is located
- Find the land cover class where each meteo station is located
Summary exercise of Lesson 5
- Find the distance of each GPS position to the closest road
- Find the proportion of land cover classes used by the animals (i.e. where GPS positions are located)