Lesson 5. Environmental layers: integration and management of spatial ancillary information


Animals move in and interact with complex environments that can be characterized by a set of spatial layers containing environmental data. Spatial databases can manage these different data sets in a unified framework, defining spatial and non-spatial relationships that simplify the analysis of the interaction between animals and their habitat. This simplifies a large set of analyses that can be performed directly in the database with no need for dedicated GIS or statistical software. Such an approach moves the information content managed in the database from a geographical space to an animal's ecological space. This more comprehensive database model of the animals' movement ecology reduces the distance between physical reality and the way data are structured in the database, filling the semantic gap between the scientist's view of biological systems and its implementation in the information system. This lesson shows how vector and raster layers can be included in the database and how you can handle them using (spatial) SQL. The database built so far is extended with environmental ancillary data sets.

Topic 1. Adding ancillary environmental layers

Introduction

In traditional information systems for wildlife tracking data management, position data are stored in some file-based spatial format (e.g. shapefile). With a multi-steps process in a GIS environment, position data are associated with a set of environmental attributes through an analytical stage (e.g. intersection of GPS positions with vector and raster environmental layers). This process is usually time-consuming and prone to error, implies data replication, and often has to be repeated for any new analysis. It also generally involves different tools for vector and raster maps. An advanced data management system should achieve the same result with an efficient (and, if needed, automated) procedure, possibly performed as a real-time routine management task. To do so, the first step is to integrate both position data and spatial ancillary information on the environment in a unique framework. This is essential to exploring the animals' behavior and understanding the ecological relationships that can be revealed by tracking data. Spatial databases can manage these different data sets in a unified framework. This also affects performance, as databases are optimized to run simple processes on large data sets like the ones generated by GPS sensors.
In this exercise, you will see how to integrate a number of spatial features (see figure below).

  • Points: meteorological stations (derived from MeteoTrentino)
  • Linestrings: roads network (derived from OpenStreetMap)
  • Polygons: administrative units (derived from ISTAT) and the study area.
  • Rasters: land cover (source: Corine) and digital elevation models (source: SRTM, see also Jarvis A, Reuter HI, Nelson A, Guevara E (2008) Hole-filled seamless SRTM data V4. International Centre for Tropical Agriculture (CIAT)).

Figure: Environmental layers

Example

Each species and study have specific data sets required and available, so the goal of this example is to show a complete set of procedures that can be replicated and customized on different data sets. Once layers are integrated into the database, you are encouraged to visualize and explore them in a GIS environment (e.g. QGIS).
Once data are loaded into the database, you will extend the gps_data_animals table with the environmental attributes derived from the ancillary layers provided in the test data set. You will also modify the function tools.new_gps_data_animals to compute these values automatically. In addition, you are encouraged to develop your own (spatial) queries (e.g. detect how many times each animal crosses a road, calculate how many times two animals are in the same place at the same time).
It is a good practice to store your environmental layers in a dedicated schema in order to keep a clear database structure. Let's create the schema env_data:

CREATE SCHEMA env_data
  AUTHORIZATION postgres;
GRANT USAGE ON SCHEMA env_data TO basic_user;
COMMENT ON SCHEMA env_data 
IS 'Schema that stores environmental ancillary information.';
ALTER DEFAULT PRIVILEGES IN SCHEMA env_data 
  GRANT SELECT ON TABLES TO basic_user;

Now you can start importing the shapefiles of the (vector) environmental layers included in the test data set. An option is to use the drag and drop function of DB Manager (from QGIS Browser) plugin in QGIS.

Alternatively, a standard solution to import shapefiles (vector data) is the shp2pgsql tool. sph2pgsql is an external command-line tool, which can not be run in a SQL interface as it can for a regular SQL command. The code below has to be run in a command-line interpreter (if you are using Windows as operating system, it is also called Command Prompt or MS-DOS shell). You will see other examples of external tools that are run in the same way, and it is very important to understand the difference between these and SQL commands. In these exercises, this difference is represented with a different graphic layout. Start with the meteorological stations:

"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\meteo_stations.shp env_data.meteo_stations | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

Note that the path to shp2pgsql.exe and psql.exe (in this case C:\Program Files\PostgreSQL\9.4\bin*)can be different according to the folder where you installed your version of PostgreSQL. If you connect to the database remotely, you also have to change the address of the server (-h* option). In the parameters, set the reference system (option -s) and create a spatial index for the new table (option -I). The result of shp2pgsql is a text file with the SQL that generates and populates the table env_data.meteo_stations. With the symbol '|' you 'pipe' (send directly) the SQL to the database (through the PostgreSQL interactive terminal psql) where it is automatically executed. You have to set the the port (-p), the name of the database (-d), the user (-U) and the password, if requested. In this way, you complete the whole process with a single command. You can refer to shp2pgsql documentation for more details. You might have to add the whole path to psql and shp2pgsql. This depends on the folder where you installed PostgreSQL. You can easily verify the path searching for these two files. You also have to check that the path of your shapefile (meteo_stations.shp) is properly defined.
You can repeat the same operation for the study area layer:

"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\study_area.shp env_data.study_area | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

Next for the roads layer

"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\roads.shp env_data.roads | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

And for the administrative boundaries

"C:\Program Files\PostgreSQL\9.4\bin\shp2pgsql.exe" -s 4326 -I C:\tracking_db\data\env_data\vector\adm_boundaries.shp env_data.adm_boundaries | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

Now the shapefiles are in the database as new tables (one table for each shapefile). You can visualize them through a GIS interface (e.g. QGIS). You can also retrieve a summary of the information from all vector layers available in the database with the following command:

SELECT * FROM geometry_columns;

The primary method to import a raster layer is the command-line tool raster2pgsql, the equivalent of shp2pgsql but for raster files, that converts GDAL-supported rasters into SQL suitable for loading into PostGIS. It is also capable of loading folders of raster files.
GDAL (Geospatial Data Abstraction Library) is a (free) library for reading, writing and processing raster geospatial data formats. It has a lot of simple but very powerful and fast command-line tools for raster data translation and processing. The related OGR library provides a similar capability for simple vector data features. GDAL is used by most of the spatial open-source tools and by a large number of commercial software programs as well. You will probably benefit in particular from the tools gdalinfo (get a layer's basic metadata), gdal_translate (change data format, change data type, cut), gdalwarp1 (mosaicing, reprojection and warping utility).

An interesting feature of raster2pgsql is its capability to store the rasters inside the database (in-db) or keep them as (out-db) files in the file system (with the raster2pgsql -R option). In the last case, only raster metadata are stored in the database, not pixel values themselves. Loading out-db rasters metadata is much faster than loading them completely in the database. Most operations at the pixel values level (e.g. ST_SummaryStats) will have equivalent performance with out- and in-db rasters. Other functions, like ST_Tile, involving only the metadata, will be faster with out-db rasters. Another advantage of out-db rasters is that they stay accessible for external applications unable to query databases (with SQL). However, the administrator must make sure that the link between what is in the db (the path to the raster file in the file system) is not broken (e.g. by moving or renaming the files). On the other hand, only in-db rasters can be generated with CREATE TABLE and modified with UPDATE statements. Which is the best choice depends on the size of the data set and on considerations about performance and database management. A good practice is generally to load very large raster data sets as out-db and to load smaller ones as in-db to save time on loading and to avoid repeatedly backing up huge, static rasters.

The QGIS plugin Load Raster to PostGIS can also be used to import raster data with a graphical interface. An important parameter to set when importing raster layers is the number of tiles (-t option). Tiles are small subsets of the image and correspond to a physical record in the table. This approach dramatically decreases the time required to retrieve information. The recommended values for the tile option range from 20x20 to 100x100. Here is the code (to be run in the Command Prompt) to transform a raster (the digital elevation model derived from SRTM) into the SQL code that is then used to physically load the raster into the database (as you did with shp2pgsql for vectors):

"C:\Program Files\PostgreSQL\9.4\bin\raster2pgsql.exe" -I -M -C -s 4326 -t 20x20 C:\tracking_db\data\env_data\raster\srtm_dem.tif env_data.srtm_dem | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

You can repeat the same process on the land cover layer:

"C:\Program Files\PostgreSQL\9.4\bin\raster2pgsql.exe" -I -M -C -s 3035 C:\tracking_db\data\env_data\raster\corine06.tif -t 20x20 env_data.corine_land_cover | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -p 5432 -d gps_tracking_db -U postgres -h localhost

The reference system of the Corine Land Cover data set is not geographic coordinates (SRID 4326), but ETRS89/ETRS-LAEA (SRID 3035), an equal-area projection over Europe. This must be specified with the -s option and kept in mind when this layer will be connected to other spatial layers stored in a different reference system. As with shp2pgsql.exe, the -I option will create a spatial index on the loaded tiles, speeding up many spatial operations, and the -C option will generate a set of constraints on the table, allowing it to be correctly listed in the raster_columns metadata table. The land cover raster identifies classes that are labeled by a code (an integer). To specify the meaning of the codes, you can add a table where they are described. In this example, the land cover layer is taken from the Corine project. Classes are described by a hierarchical legend over three nested levels. The legend is provided in the test data set in the file corine_legend.csv. You import the table of the legend (first creating an empty table, and then loading the data):

CREATE TABLE env_data.corine_land_cover_legend(
  grid_code integer NOT NULL,
  clc_l3_code character(3),
  label1 character varying,
  label2 character varying,
  label3 character varying,
  CONSTRAINT corine_land_cover_legend_pkey 
    PRIMARY KEY (grid_code ));
COMMENT ON TABLE env_data.corine_land_cover_legend
IS 'Legend of Corine land cover, associating the numeric code to the three nested levels.';

Then you load the data:

COPY env_data.corine_land_cover_legend 
FROM 
  'C:\tracking_db\data\env_data\raster\corine_legend.csv' 
  WITH (FORMAT csv, HEADER, DELIMITER ';');

You can retrieve a summary of the information from all raster layers available in the database with the following command:

SELECT * FROM raster_columns;

To keep a well-documented database, add comments to describe all the spatial layers that you have added:

COMMENT ON TABLE env_data.adm_boundaries 
IS 'Layer (polygons) of administrative boundaries (comuni).';
COMMENT ON TABLE env_data.corine_land_cover 
IS 'Layer (raster) of land cover (from Corine project).';
COMMENT ON TABLE env_data.meteo_stations 
IS 'Layer (points) of meteo stations.';
COMMENT ON TABLE env_data.roads 
IS 'Layer (lines) of roads network.';
COMMENT ON TABLE env_data.srtm_dem 
IS 'Layer (raster) of digital elevation model (from SRTM project).';
COMMENT ON TABLE env_data.study_area 
IS 'Layer (polygons) of the boundaries of the study area.';

Exercise

  1. Load the roads network layer, edit it adding some additional roads (using OpenStreetMap or GooglEarth as reference, save edits and export as shapefile

Topic 2. Querying spatial environmental data

Introduction

As the set of ancillary (spatial) information is now loaded into the database, you can start playing with this information using spatial SQL queries. In fact, it is possible with spatial SQL to run queries that explicitly handle the spatial relationships among the different spatial tables that you have stored in the database. In the following examples, SQL statements will show you how to take advantage of PostGIS features to manage, explore and analyze spatial objects, with optimized performances and no need for specific GIS interfaces.

Example

You start palying with your spatial data by asking for the name of the administrative unit (comune, Italian commune) in which the point at coordinates (11, 46) (longitude, latitude) is located. There are two commands that are used when it comes to intersection of spatial elements: ST_Intersects and ST_Intersection. The former returns true if two features intersect, while the latter returns the geometry produced by the intersection of the objects. In this case, ST_Intersects is used to select the right comune:

SELECT 
  nome_com
FROM 
  env_data.adm_boundaries 
WHERE 
  ST_Intersects((ST_SetSRID(ST_MakePoint(11,46), 4326)), geom);

The result is Cavedine.

In the second example, you compute the distance (rounded to the meter) from the point at coordinates (11, 46) to all the meteorological stations (ordered by distance) in the table env_data.meteo_stations. This information could be used, for example, to derive the precipitation and temperature for a GPS position at the given acquisition time, weighting the measurement from each station according to the distance to the point. In this case, ST_Distance_Spheroid is used. Alternatively, you could use ST_Distance and cast your geometries as geography data types.

SELECT 
  station_id, ST_Distance_Spheroid((ST_SetSRID(ST_MakePoint(11,46), 4326)), geom, 'SPHEROID["WGS 84",6378137,298.257223563]')::integer AS distance
FROM 
  env_data.meteo_stations
ORDER BY 
  distance;

The result is

station_id distance
1 2224
2 4080
5 4569
4 10085
3 10374
6 18755

In the third example, you compute the distance to the closest road:

SELECT 
  ST_Distance((ST_SetSRID(ST_MakePoint(11,46), 4326))::geography, geom::geography)::integer AS distance
FROM 
  env_data.roads
ORDER BY 
  distance 
LIMIT 1;

The result is 1560.

For users, the data type (vector, raster) used to store spatial information is not so relevant when they query their data: queries should transparently use any kind of spatial data as input. Users can then focus on the environmental model instead of worrying about the data model. In the next example, you intersect a point with two raster layers (altitude and land cover) in the same way you do for vector layers. In the case of land cover, the point must first be projected into the Corine reference system (SRID 3035). In the raster layer, just the Corine code class (integer) is stored while the legend is stored in the table env_data.corine_land_cover_legend. In the query, the code class is joined to the legend table and the code description is returned. This is an example of integration of both spatial and non-spatial elements in the same query.

SELECT 
  ST_Value(srtm_dem.rast,
  (ST_SetSRID(ST_MakePoint(11,46), 4326))) AS altitude,
  ST_value(corine_land_cover.rast,
  ST_transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035)) AS land_cover, 
  label2, 
  label3
FROM 
  env_data.corine_land_cover, 
  env_data.srtm_dem, 
  env_data.corine_land_cover_legend
WHERE 
  ST_Intersects(
    corine_land_cover.rast,
    ST_Transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035)) AND
  ST_Intersects(srtm_dem.rast,(ST_SetSRID(ST_MakePoint(11,46), 4326))) AND
  grid_code = ST_Value(
    corine_land_cover.rast,
    ST_Transform((ST_SetSRID(ST_MakePoint(11,46), 4326)), 3035));

The result is

altitude land_cover label2 label3
956 24 Forests Coniferous forest

Now combine roads and administrative boundaries to compute how many meters of roads there are in each administrative unit. You first have to intersect the two layers (ST_Intersection), then compute the length (ST_Length) and summarize per administrative unit (sum() associated with GROUP BY clause).

SELECT 
  nome_com, 
  sum(ST_Length(
    (ST_Intersection(roads.geom, adm_boundaries.geom))::geography))::integer AS total_length
FROM 
  env_data.roads, 
  env_data.adm_boundaries 
WHERE 
  ST_Intersects(roads.geom, adm_boundaries.geom)
GROUP BY 
  nome_com 
ORDER BY 
  total_length desc;

The result of the query is

nome_com total_length
Trento 24552
Lasino 15298
Garniga Terme 12653
Calavino 6185
Cavedine 5802
Cimone 5142
Padergnone 4510
Vezzano 1618
Aldeno 1367

The last examples are about the interaction between rasters and polygons. In this case, we compute some statistics (minimum, maximum, mean, and standard deviation) for the altitude within the study area:

SELECT 
  (sum(ST_Area(((gv).geom)::geography)))/1000000 area,
  min((gv).val) alt_min, 
  max((gv).val) alt_max,
  avg((gv).val) alt_avg,
  stddev((gv).val) alt_stddev
FROM
  (SELECT 
    ST_intersection(rast, geom) AS gv
  FROM 
    env_data.srtm_dem,
    env_data.study_area 
  WHERE 
    ST_intersects(rast, geom)
) foo;

The result, from which it is possible to appreciate the large variability of altitude across the study area, is

area alt_min alt_max alt_avg alt_stddev
199.018552456188 180 2133 879.286157704969 422.56622698974

You might also be interested in the number of pixels of each land cover type within the study area. As with the previous example, we first intersect the study area with the raster of interest, but in this case we need to reproject the study area polygon into the coordinate system of the Corine land cover raster (SRID: 3035). With the following query, you can see the dominance of mixed forests in the study area:

SELECT (pvc).value, SUM((pvc).count) AS total, label3
FROM 
  (SELECT ST_ValueCount(rast) AS pvc
  FROM env_data.corine_land_cover, env_data.study_area
  WHERE ST_Intersects(rast, ST_Transform(geom, 3035))) AS cnts, 
  env_data.corine_land_cover_legend
WHERE grid_code = (pvc).value
GROUP BY (pvc).value, label3
ORDER BY (pvc).value;

The result is

lc_class total label3
1 114 Continuous urban fabric
2 817 Discontinuous urban fabric
3 324 Industrial or commercial units
7 125 Mineral extraction sites
16 324 Fruit trees and berry plantations
18 760 Pastures
19 237 Annual crops associated with permanent crops
20 1967 Complex cultivation patterns
21 2700 Land principally occupied by agriculture
23 4473 Broad-leaved forest
24 2867 Coniferous forest
25 8762 Mixed forest
26 600 Natural grasslands
27 586 Moors and heathland
29 1524 Transitional woodland-shrub
31 188 Bare rocks
32 611 Sparsely vegetated areas
41 221 Water bodies

The previous query can be modified to return the percentage of each class over the total number of pixels. This can be achieved using window functions (which are a valuable tool in many applications and worth to be learnt):

SELECT 
  (pvc).value, 
  (SUM((pvc).count)*100/
    SUM(SUM((pvc).count)) over ()
  )::numeric(4,2) AS total_perc, label3
FROM 
  (SELECT ST_ValueCount(rast) AS pvc
  FROM env_data.corine_land_cover, env_data.study_area
  WHERE ST_Intersects(rast, ST_Transform(geom, 3035))) AS cnts, 
  env_data.corine_land_cover_legend
WHERE grid_code = (pvc).value
GROUP BY (pvc).value, label3
ORDER BY (pvc).value;

The result is

value total_perc label3
1 0.42 Continuous urban fabric
2 3.00 Discontinuous urban fabric
3 1.19 Industrial or commercial units
7 0.46 Mineral extraction sites
16 1.19 Fruit trees and berry plantations
18 2.79 Pastures
19 0.87 Annual crops associated with permanent crops
20 7.23 Complex cultivation patterns
21 9.93 Land principally occupied by agriculture
23 16.44 Broad-leaved forest
24 10.54 Coniferous forest
25 32.21 Mixed forest
26 2.21 Natural grasslands
27 2.15 Moors and heathland
29 5.60 Transitional woodland-shrub
31 0.69 Bare rocks
32 2.25 Sparsely vegetated areas
41 0.81 Water bodies

Exercise

  1. Find the administrative unit where each meteo station is located
  2. Find the land cover class where each meteo station is located

Summary exercise of Lesson 5

  1. Find the distance of each GPS position to the closest road
  2. Find the proportion of land cover classes used by the animals (i.e. where GPS positions are located)