Lesson 4. Spatial is not special: how to manage the locations data in a spatial database: PostGIS


A wildlife tracking data management system must include the capability to explicitly deal with the spatial component of movement data. GPS tracking data are sets of spatio-temporal objects (locations) and the spatial component must be properly handled. You will now extend the database adding spatial functionalities through the PostgreSQL spatial extension called PostGIS. PostGIS introduces the spatial data types (both vector and raster) and a large set of SQL spatial functions and tools, including spatial indexes. This possibility essentially allows you to build a GIS using the existing capabilities of relational databases. In this lesson, you will implement a system that automatically transforms the GPS coordinates generated by GPS sensors from a pair of numbers into spatial objects.

Topic 1. Transforming GPS Coordinates into a Spatial Object

Introduction

At the moment, your data are stored in the database and the GPS positions are linked to individuals. While time is properly managed, coordinates are still just two decimal numbers (longitude and latitude) and not spatial objects. It is therefore not possible to find the distance between two points, or the length of a trajectory, or the speed and angle of the step between two locations. In this chapter, you will learn how to add a spatial extension to your database and transform the coordinates into a spatial element (i.e. a point).
Until few years ago, the spatial information produced by GPS sensors was managed and analyzed using dedicated software (GIS) in file-based data formats (e.g. shapefile). Nowadays, the most advanced approaches in data management consider the spatial component of objects (e.g. a moving animal) as one of its many attributes: thus, while understanding the spatial nature of your data is essential to proper analysis, from a software perspective spatial is (less and less) not special. Spatial databases are the technical tool needed to implement this perspective. They integrate spatial data types (vector and raster) together with standard data types that store the objects' other (non-spatial) associated attributes. Spatial data types can be manipulated by SQL through additional commands and functions for the spatial domain. This possibility essentially allows you to build a GIS using the existing capabilities of relational databases. Moreover, while dedicated GIS software is usually focused on analyses and data visualization, providing a rich set of spatial operations, few are optimized for managing large spatial data sets (in particular, vector data) and complex data structures. Spatial databases, in turn, allow both advanced management and spatial operations that can be efficiently undertaken on a large set of elements. This combination of features is becoming essential, as with animal movement data sets the challenge is now on the extraction of synthetic information from very large data sets rather than on the extrapolation of new information (e.g. kernel home ranges from VHF data) from limited data sets with complex algorithms.

Example

The first step to do in order to spatially enable your database is to load the PostGIS extension, which can easily done with the following SQL command (many other extensions exist for PostgreSQL):

CREATE EXTENSION postgis;

Now you can use and exploit all the features offered by PostGIS in your database. The vector objects (points, lines, and polygons) are stored in a specific field of your tables as spatial data types. This field contains the structured list of vertexes, i.e. coordinates of the spatial object, and also includes its reference system. The PostGIS spatial (vectors) data types are not topological, although, if needed, PostGIS has a dedicated topological extension.

With PostGIS activated, you can create a field with geometry data type in your table (2D point feature with longitude/latitude WGS84 as reference system):

ALTER TABLE main.gps_data_animals 
  ADD COLUMN geom geometry(Point,4326);

You can create a spatial index:

CREATE INDEX gps_data_animals_geom_gist
  ON main.gps_data_animals
  USING gist (geom);

You can now populate it (excluding points that have no latitude/longitude):

UPDATE 
  main.gps_data_animals
SET 
  geom = ST_SetSRID(ST_MakePoint(longitude, latitude),4326)
WHERE 
  latitude IS NOT NULL AND longitude IS NOT NULL;

At this point, it is important to visualize the spatial content of your tables. PostgreSQL/PostGIS offers no tool for spatial data visualization, but this can be done by a number of client applications, in particular GIS desktop software like ESRI ArcGIS 10.x or QGIS. QGIS is a powerful and complete open source software. It offers all the functions needed to deal with spatial data. QGIS is the suggested GIS interface because it has many tools specifically for managing and visualizing PostGIS data. Especially remarkable is the tool DB Manager.
Now you can explore the GPS positions data set in QGIS (see figure below). The example is a view zoomed in on the study area rather than all points, because some outliers are located very far from Monte Bondone, affecting the default visualization. In the background you have the Google satellite layer loaded using the OpenLayer Plugin.

Figure: GPS locations

You can also use ArcGIS ESRI 10.x to visualize (but not natively edit, at least at the time of writing this text) your spatial data. Data can be accessed using “Query layers”. A query layer is a layer or stand-alone table that is defined by a SQL query. Query layers allow both spatial and non-spatial information stored in a (spatial) DBMS to be integrated into GIS projects within ArcMap. When working in ArcMap, you create query layers by defining a SQL query. The query is then run against the tables and views in a database, and the result set is added to ArcMap. Query layers behave like any other feature layer or stand-alone table, so they can be used to display data, used as input into a geoprocessing tool, or accessed using developer APIs. The query is executed every time the layer is displayed or used in ArcMap. This allows the latest information to be visible without making a copy or snapshot of the data and is especially useful when working with dynamic information that is frequently changing.

Exercise

  1. Find the distance of all locations of animal 2 to the point (11.0620855; 45.9878812)

Topic 2. Automating the Creation of Points from GPS Coordinates

Introduction

Working with massive data sets (i.e. many sensors at the same time) in near real time requires that routinely operations are done automatically to save time and to avoid errors of manual processing. Here you create a new function to update the geometry field as soon as a new record is uploaded.

Example

You can automate the population of the geometry column so that whenever a new GPS position is updated in the table main.gps_data_animals, the spatial geometry is also created. To do so, you need a trigger and its related function. Here is the SQL code to generate the function:

CREATE OR REPLACE FUNCTION tools.new_gps_data_animals()
RETURNS trigger AS
$BODY$
DECLARE 
thegeom geometry;
BEGIN

IF NEW.longitude IS NOT NULL AND NEW.latitude IS NOT NULL THEN
  thegeom = ST_SetSRID(ST_MakePoint(NEW.longitude, NEW.latitude),4326);
  NEW.geom = thegeom;
END IF;

RETURN NEW;
END;$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
COMMENT ON FUNCTION tools.new_gps_data_animals() 
IS 'When called by a trigger (insert_gps_locations) this function populates the field geom using the values from longitude and latitude fields.';

And here is the SQL code to generate the trigger:

CREATE TRIGGER insert_gps_location
  BEFORE INSERT
  ON main.gps_data_animals
  FOR EACH ROW
  EXECUTE PROCEDURE tools.new_gps_data_animals();

You can see the result by deleting all the records from the main.gps_data_animals table, e.g. for animal 2, and reloading them. As you have set an automatic procedure to synchronize main.gps_data_animals table with the information contained in the table main.gps_sensors_animals (supplementary code in Lesson 3), you can drop the animal 2 record from main.gps_sensors_animals and this will affect main.gps_data_animals in a cascade effect (note that it will not affect the original data in main.gps_data):

DELETE FROM 
  main.gps_sensors_animals 
WHERE 
  animals_id = 2;

There are now no rows for animal 2 in the table main.gps_data_animals. You can verify this by retrieving the number of locations per animal:

SELECT 
  animals_id, count(animals_id) 
FROM 
  main.gps_data_animals
GROUP BY 
  animals_id
ORDER BY 
  animals_id;

The result should be:

animals_id count
1 2114
3 2106
4 2869
5 2924

Note that animal 2 is not in the list. Now you reload the record in the main.gps_sensors_animals:

INSERT INTO main.gps_sensors_animals 
  (animals_id, gps_sensors_id, start_time, end_time, notes) 
VALUES 
  (2,1,'2005-03-20 16:03:14 +0','2006-05-27 17:00:00 +0','End of battery life. Sensor not recovered.');

You can see that records have been re-added to main.gps_data_animals by reloading the original data stored in main.gps_data, with the geometry field correctly and automatically populated (when longitude and latitude are not null):

SELECT 
  animals_id, count(animals_id) AS num_records, count(geom) AS num_records_valid 
FROM 
  main.gps_data_animals
GROUP BY 
  animals_id
ORDER BY 
  animals_id;

The result is

animals_id num_records num_records_valid
1 2114 1650
2 2624 2196
3 2106 1828
4 2869 2642
5 2924 2696

You can now play around with your spatial data set. For example, when you have a number of locations per animal, you can find the centroid of the area covered by the locations:

SELECT 
  animals_id, 
  ST_AsEWKT(
    ST_Centroid(
     ST_Collect(geom))) AS centroid 
FROM 
  main.gps_data_animals 
WHERE 
  geom IS NOT NULL 
GROUP BY 
  animals_id 
ORDER BY 
  animals_id;

The result is

animals_id centroid
1 SRID=4326;POINT(11.056405072 46.0065913348485)
2 SRID=4326;POINT(11.0388902698087 46.0118316898451)
3 SRID=4326;POINT(11.062054399453 46.0229784057986)
4 SRID=4326;POINT(11.0215063307722 46.0046905791446)
5 SRID=4326;POINT(11.0287071960312 46.0085975505935)

In this case you used the SQL command ST_Collect. This function returns a GEOMETRYCOLLECTION or a MULTI object from a set of geometries. The collect function is an 'aggregate' function in the terminology of PostgreSQL. That means that it operates on rows of data, in the same way the sum() and mean() functions do. ST_Collect and ST_Union are often interchangeable. ST_Collect is in general orders of magnitude faster than ST_Union because it does not try to dissolve boundaries. It merely rolls up single geometries into MULTI and MULTI or mixed geometry types into Geometry Collections. The contrary of ST_Collect is ST_Dump, which is a set-returning function.

Exercise

  1. Find the centroids all all the points for each month for animal 2

Topic 3. Creating Spatial Database Views

Views are queries permanently stored in the database. For users (and client applications), they work like normal tables, but their data are calculated at query time and not physically stored. Changing the data in a table alters the data shown in subsequent invocations of related views. Views are useful because they can represent a subset of the data contained in a table; can join and simplify multiple tables into a single virtual table; take very little space to store, as the database contains only the definition of a view (i.e. the SQL query), not a copy of all the data it presents; and provide extra security, limiting the degree of exposure of tables to the outer world. On the other hand, a view might take some time to return its data content. For complex computations that are often used, it is more convenient to store the information in a permanent table.

Example

You can create views where derived information is (virtually) stored. First, create a new schema where all the analysis can be accommodated and kept separated from the basic data:

CREATE SCHEMA analysis
  AUTHORIZATION postgres;
  GRANT USAGE ON SCHEMA analysis TO basic_user;
COMMENT ON SCHEMA analysis 
IS 'Schema that stores key layers for analysis.';
ALTER DEFAULT PRIVILEGES 
  IN SCHEMA analysis 
  GRANT SELECT ON TABLES 
  TO basic_user;

You can see below an example of a view in which just (spatially valid) positions of a single animal are included, created by joining the information with the animal and look-up tables.

CREATE VIEW analysis.view_gps_locations AS 
  SELECT 
    gps_data_animals.gps_data_animals_id, 
    gps_data_animals.animals_id,
    animals.name,
    gps_data_animals.acquisition_time at time zone 'UTC' AS time_utc, 
    animals.sex, 
    lu_age_class.age_class_description, 
    lu_species.species_description,
    gps_data_animals.geom
  FROM 
    main.gps_data_animals, 
    main.animals, 
    lu_tables.lu_age_class, 
    lu_tables.lu_species
  WHERE 
    gps_data_animals.animals_id = animals.animals_id AND
    animals.age_class_code = lu_age_class.age_class_code AND
    animals.species_code = lu_species.species_code AND 
    geom IS NOT NULL;
COMMENT ON VIEW analysis.view_gps_locations
IS 'GPS locations.';

Although the best way to visualize this view is in a GIS environment (in QGIS you might need to explicitly define the unique identifier of the view, i.e. gps_data_animals_id), you can query its non-spatial content with

SELECT 
  gps_data_animals_id AS id, 
  name AS animal,
  time_utc, 
  sex, 
  age_class_description AS age, 
  species_description AS species
FROM 
  analysis.view_gps_locations
LIMIT 5;

The result is something similar to

id animal time_utc sex age species
62 Agostino 2005-03-20 16:03:14 m adult roe deer
64 Agostino 2005-03-21 00:03:06 m adult roe deer
65 Agostino 2005-03-21 04:01:45 m adult roe deer
67 Agostino 2005-03-21 12:02:19 m adult roe deer
68 Agostino 2005-03-21 16:01:12 m adult roe deer

Now you create view with a different representation of your data sets. In this case you derive a trajectory from GPS points. You have to order locations per animal and per acquisition time; then you can group them (animal by animal) in a trajectory (stored as a view):

CREATE VIEW analysis.view_trajectories AS 
  SELECT 
    animals_id, 
    ST_MakeLine(geom)::geometry(LineString,4326) AS geom 
  FROM 
    (SELECT animals_id, geom, acquisition_time 
    FROM main.gps_data_animals 
    WHERE geom IS NOT NULL 
    ORDER BY 
    animals_id, acquisition_time) AS sel_subquery 
  GROUP BY 
    animals_id;
COMMENT ON VIEW analysis.view_trajectories
IS 'GPS locations – Trajectories.';

In the figure below you can see analysis.view_trajectories visualized in QGIS.

Figure: Trajectories

Lastly, create one more view to spatially summarize the GPS data set using convex hull polygons (or minimum convex polygons):

CREATE VIEW analysis.view_convex_hulls AS
  SELECT 
    animals_id,
    (ST_ConvexHull(ST_Collect(geom)))::geometry(Polygon,4326) AS geom
  FROM 
    main.gps_data_animals 
  WHERE 
    geom IS NOT NULL 
  GROUP BY 
    animals_id 
  ORDER BY 
    animals_id 
COMMENT ON VIEW analysis.view_convex_hulls
IS 'GPS locations - Minimum convex polygons.';

The result is represented in the figure below, where you can clearly see the effect of the outliers located far from the study area.

Figure: Convex hulls

This last view is correct only if the GPS positions are located in a relatively small area (e.g. less than 50 kilometers) because the minimum convex polygon of points in geographic coordinates cannot be calculated assuming that coordinates are related to Euclidean space. At the moment the function ST_ConvexHull does not support the GEOGRAPHY data type, so the correct way to proceed would be to project the GPS locations in a proper reference system, calculate the minimum convex polygon and then convert the result back to geographic coordinates. In the example, the error is negligible.

Exercise

  1. Create a view of all the points of female animals, visualize it in QGIS and export as shapefile

Supplementary code: UTM zone of a given point in geographic coordinates

Here you create a simple function to automatically find the UTM zone at defined coordinates:

CREATE OR REPLACE FUNCTION tools.srid_utm(longitude double precision, latitude double precision)
RETURNS integer AS
$BODY$
DECLARE
  srid integer;
  lon float;
  lat float;
BEGIN
  lat := latitude;
  lon := longitude;

IF ((lon > 360 or lon < -360) or (lat > 90 or lat < -90)) THEN 
  RAISE EXCEPTION 'Longitude and latitude is not in a valid format (-360 to 360; -90 to 90)';
ELSEIF (longitude < -180)THEN 
  lon := 360 + lon;
ELSEIF (longitude > 180)THEN 
  lon := 180 - lon;
END IF;

IF latitude >= 0 THEN 
  srid := 32600 + floor((lon+186)/6); 
ELSE
  srid := 32700 + floor((lon+186)/6); 
END IF;

RETURN srid;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;
COMMENT ON FUNCTION tools.srid_utm(double precision, double precision) 
IS 'Function that returns the SRID code of the UTM zone where a point (in geographic coordinates) is located. For polygons or line, it can be used giving ST_x(ST_Centroid(the_geom)) and ST_y(ST_Centroid(the_geom)) as parameters. This function is typically used be used with ST_Transform to project elements with no prior knowledge of their position.';

Here an example to see the SRID of the UTM zone of the point at coordinates (11.001,46.001):

SELECT TOOLS.SRID_UTM(11.001,46.001) AS UTM_zone;

The result is 32632 that corresponds to UTM 32 N WGS84.

You can use this function to project points when you do not know the UTM zone. You can test this functionality with the following code:

SELECT
  ST_AsEWKT(
    ST_Transform(
      ST_SetSRID(ST_MakePoint(31.001,16.001), 4326),
      TOOLS.SRID_UTM(31.001,16.001))
  ) AS projected_point;

If you want to allow the user basic_user to project spatial data, you have to grant permission on the table spatial_ref_sys:

GRANT SELECT ON TABLE spatial_ref_sys TO basic_user;

Summary exercise of Lesson

  1. Create a view with a convex hull for all the points of every month for animal 2 and visualize in QGIS to check if there is any spatial pattern
  2. Calculate the area of the monthly convex hulls of animal 2 and verify if there is any temporal pattern
  3. Repeat the above exercises for all the animals