Lesson 6. How to extract environmental information related to location data


The association of GPS position with environmental attributes can be part of the preliminary processing before data analysis where a set of procedures is created to intersect ancillary layers with GPS positions. Database tools like triggers and functions can be used for this scope. The result is that positions are transformed from a simple pair of numbers (coordinates) to complex multi-dimensional (spatial) objects that define the individual and its habitat in time and space, including their interactions and dependencies. In an additional step, position data can also be joined to activity data to define an even more complete picture of the animal's behavior. Once this is implemented in the database framework, scientists and wildlife managers can deal with data in the same way they model the object of their study as they can start their analyses from objects that represent the animals in their habitat (which previously was the result of a long and complex process). Moreover, users can directly query these objects using a simple and powerful language (SQL) that is close to their natural language. All these elements strengthen the opportunity provided by GPS data to move from mainly testing statistical hypotheses to focusing on biological hypotheses. Scientists can store, access, and manipulate their data in a simple and quick way, which allows them to formulate biological questions that previously were almost impossible to answer for technical reasons. In this lesson, GPS data and ancillary information are connected with automated procedures. In an extra section, an example is illustrated to manage time series of environmental layers that can introduce temporal variability in habitat conditions.

Topic 1. Associate environmental characteristics with GPS locations

Introduction

The goal of this exercise is to automatically transform position data from simple points to objects holding information about the habitat and conditions where the animals were located at a certain moment in time. We will use the points to automatically extract, by the mean of a SQL trigger, this information from other ecological layers.

Exercise

The first step is to add the new fields of information into the main.gps_data_animals table. We will add columns for the name of the administrative unit to which the GPS position belongs, the code for the land cover it is located in, the altitude from the digital elevation model (which can then be used as the third dimension of the point), the id of the closest meteorological station, and the distance to the closest road:

ALTER TABLE main.gps_data_animals 
  ADD COLUMN pro_com integer;
ALTER TABLE main.gps_data_animals 
  ADD COLUMN corine_land_cover_code integer;
ALTER TABLE main.gps_data_animals 
  ADD COLUMN altitude_srtm integer;
ALTER TABLE main.gps_data_animals 
  ADD COLUMN station_id integer;
ALTER TABLE main.gps_data_animals 
  ADD COLUMN roads_dist integer;

These are several common examples of environmental information that can be associated with GPS positions, and others can be implemented according to specific needs. It is important to keep in mind that these spatial relationships are implicitly determined by the coordinates of the elements involved; you do not necessarily have to store these values in a table as you can compute them on the fly whenever you need. Moreover, you might need different information according to the specific study (e.g. the land cover composition in an area of one kilometre around each GPS position instead of the value of the pixel where the point is located). Computing these spatial relationships on the fly can require significant time, so in some cases it is preferable to run the query just once and permanently store the most relevant parameters for your specific study (think about what you will most likely use often). Another advantage of making the relations explicit within tables is that you can then create indexes on columns of these tables. This is not possible with on-the-fly sub-queries. Making many small queries and hence creating many tables and indexing them along the way is generally more efficient in terms of processing time then trying to do everything in a long and complex query. This is not necessarily true when the data set is small enough, as indexes are mostly efficient on large tables. Sometimes, the time necessary to write many SQL statements and the associated indexes exceed the time necessary to execute them. In that case, it might be more efficient to write a single, long, and complex statement and forget about the indexes. This does not apply to the following trigger function, as all the ecological layers were well indexed at load time and it does not rely on intermediate sub-queries of those layers.
The next step is to implement the computation of these parameters inside the automated process of associating GPS positions with animals (from gps_data to gps_data_animals). To achieve this goal, you have to modify the trigger function tools.new_gps_data_animals. In fact, the function tools.new_gps_data_animals is activated whenever a new location is inserted into gps_data_animals (from gps_data). It adds new information (i.e. fills additional fields) to the incoming record (e.g. creates the geometry object from latitude and longitude values) before it is uploaded into the gps_data_animals table (in the code, NEW. Is used to reference the new record not yet inserted). The SQL code that does this is below. The drawback of this function is that it will slow down the import of a large set of positions at once (e.g. millions or more), but it has no practical impact when you manage a continuous data flow from sensors, even for a large number of sensors deployed at the same time.

CREATE OR REPLACE FUNCTION tools.new_gps_data_animals()
RETURNS trigger AS
$BODY$
DECLARE 
  thegeom geometry;
BEGIN

IF NEW.longitude IS NOT NULL AND NEW.latitude IS NOT NULL THEN
  thegeom = ST_SetSRID(ST_MakePoint(NEW.longitude, NEW.latitude), 4326);
  NEW.geom =thegeom;
  NEW.pro_com = 
    (SELECT pro_com::integer 
    FROM env_data.adm_boundaries 
    WHERE ST_Intersects(geom,thegeom)); 
  NEW.corine_land_cover_code = 
    (SELECT ST_Value(rast,ST_Transform(thegeom,3035)) 
    FROM env_data.corine_land_cover 
    WHERE ST_Intersects(ST_Transform(thegeom,3035), rast));
  NEW.altitude_srtm = 
    (SELECT ST_Value(rast,thegeom) 
    FROM env_data.srtm_dem 
    WHERE ST_Intersects(thegeom, rast));
  NEW.station_id = 
    (SELECT station_id::integer 
    FROM env_data.meteo_stations 
    ORDER BY ST_Distance_Spheroid(thegeom, geom, 'SPHEROID["WGS 84",6378137,298.257223563]') 
    LIMIT 1);
  NEW.roads_dist = 
    (SELECT ST_Distance(thegeom::geography, geom::geography)::integer 
    FROM env_data.roads 
    ORDER BY ST_distance(thegeom::geography, geom::geography) 
    LIMIT 1);
END IF;

RETURN NEW;
END;$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION tools.new_gps_data_animals()
OWNER TO postgres;
COMMENT ON FUNCTION tools.new_gps_data_animals() 
IS 'When called by the trigger insert_gps_positions (raised whenever a new position is uploaded into gps_data_animals) this function gets the longitude and latitude values and sets the geometry field accordingly, computing a set of derived environmental information calculated intersecting or relating the position with the environmental ancillary layers.';

As the trigger function is run during location import, the function only works on the locations that are imported after it was created, and not on previously imported locations. To see the effects, you have to add new positions or delete and reload the GPS position stored in gps_data_animals. You can do this by saving the records in gps_sensors_animals in an external .csv file, and then deleting the records from the table (which also deletes the records in gps_data_animals in a cascading effect). When you reload them, the new function will be activated by the trigger that was just defined, and the new attributes will be calculated. You can perform these steps with the following commands.
First, check how many records you have per animal:

SELECT animals_id, count(animals_id) 
FROM main.gps_data_animals 
GROUP BY animals_id;

The result is

animals_id count
4 2869
5 2924
2 2624
1 2114
3 2106

Then, copy of the table main.gps_sensors_animals into an external file.

COPY 
  (SELECT animals_id, gps_sensors_id, start_time, end_time, notes 
FROM main.gps_sensors_animals)
TO 
  'c:/tracking_db/test/gps_sensors_animals.csv' 
  WITH (FORMAT csv, DELIMITER ';');

You then delete all the records in main.gps_sensors_animals, which will delete (in a cascade) all the records in main.gps_data_animals.

DELETE FROM main.gps_sensors_animals;

You can verify that there are now no records are in main.gps_data_animals (the query should return 0 rows).

SELECT * FROM main.gps_data_animals;

The final step is to reload the .csv file into main.gps_sensors_animals. This will launch the trigger functions that recreate all the records in main.gps_data_animals, in which the fields related to environmental attributes are also automatically updated. Note that, due to the different triggers which imply massive computations, the query can take several minutes to execute (you can skip this step and speed up the process by simply calculating the environmental attributes with an update query).

COPY main.gps_sensors_animals 
  (animals_id, gps_sensors_id, start_time, end_time, notes) 
FROM 
  'c:/tracking_db/test/gps_sensors_animals.csv' 
  WITH (FORMAT csv, DELIMITER ';');

You can verify that all the fields were updated:

SELECT 
  gps_data_animals_id AS id, acquisition_time, pro_com, corine_land_cover_code AS lc_code, altitude_srtm AS alt, station_id AS meteo, roads_dist AS dist
FROM 
  main.gps_data_animals 
WHERE 
  geom IS NOT NULL
LIMIT 5;

The result is

id acquisition_time pro_com lc_code alt meteo dist
15275 2005-10-23 22:00:53+02 22091 18 1536 5 812
15276 2005-10-24 02:00:55+02 22091 18 1519 5 740
15277 2005-10-24 06:00:55+02 22091 18 1531 5 598
15280 2005-10-24 18:02:57+02 22091 23 1198 5 586
15281 2005-10-24 22:01:49+02 22091 25 1480 5 319

You can also check that all the records of every animal are in main.gps_data_animals:

SELECT animals_id, count(*) FROM main.gps_data_animals GROUP BY animals_id;

The result is

animals_id count
4 2869
5 2924
2 2624
1 2114
3 2106

As you can see, the whole process can take a few minutes, as you are calculating the environmental attributes for the whole data set at once. As discussed in the previous chapters, the use of triggers and indexes to automatize data flow and speed up analyses might imply processing times that are not sustainable when large data sets are imported at once. In this case, it might be preferable to update environmental attributes and calculate indexes in a later stage to speed up the import process. In this book, we assume that in the operational environment where the database is developed, the data flow is continuous, with large but still limited data sets imported at intervals. You can compare this processing time with what is generally required to achieve the same result in a classic GIS environment based on flat files (e.g. shapefiles, .tif). Do not forget to consider that you can use these minutes for a coffee break while the database does the job for you, instead of clicking here and there in your favorite GIS application!

Exercise

  1. Retrieve the average altitude per month/year of all animals, ordered per animal, year and month and verify if there is any temporal pattern

Supplementary code: Tracking animals in a dynamic environment with remote sensing image time series

The advancement in movement ecology from a data perspective can reach its full potential only by combining the technology of animal tracking with the technology of other environmental sensing programmes. Ecology is fundamentally spatial, and animal ecology is obviously no exception. Any scientific question in animal ecology cannot overlook its spatial dimension, and in particular the dynamic interaction between individual animals or populations, and the environment in which the ecological processes occur. Movement provides the mechanistic link to explain this complex ecosystem interaction, as the movement path is dynamically determined by external factors, through their effect on the individual's state and the life-history characteristics of an animal. Therefore, most modelling approaches for animal movement include environmental factors as explanatory variables. This technically implies the intersection of animal locations with environmental layers, in order to extract the information about the environment that is embedded in spatial coordinates. It appears very clear at this stage, though, that animal locations are not only spatial, but are fully defined by spatial and temporal coordinates (as given by the acquisition time).
Logically, the same temporal definition also applies to environmental layers. Some characteristics of the landscape, such as land cover or road networks, can be considered static over a large period of time (on the order of several years), and these static environmental layers are commonly intersected with animal locations to infer habitat use and selection by animals. However, many characteristics actually relevant to wildlife, such as vegetation biomass or road traffic, are indeed subject to temporal variability (on the order of hours to weeks) in the landscape, and would be better represented by dynamic layers that correspond closely to the conditions actually encountered by an animal moving across the landscape. In this case, using static environmental layers directly limits the potential of wildlife-tracking data, reduces the power of inference of statistical models, and sometimes even introduces sources of bias.
Nowadays, satellite-based remote sensing can provide dynamic global coverage of medium-resolution images that can be used to compute a large number of environmental parameters very useful to wildlife studies. Through remote sensing, it is possible to acquire spatial time series which can then be linked to animal locations, fully exploiting the spatio-temporal nature of wildlife tracking data. Numerous satellite and other sensor networks can now provide information on resources on a monthly, weekly or even daily basis, which can be used as explanatory variables in statistical models or to parametrize bayesian inferences or mechanistic models. The first set of environmental data which was made available as a time series was the Normalized Difference Vegetation Index (NDVI), but other examples include data sets on snow, ocean primary productivity, surface temperature, or salinity. Snow cover, NDVI, and sea surface temperature are some examples of indexes that can be used as explanatory variables in statistical models or to parametrize bayesian inferences or mechanistic models.
The main shortcoming of such remote-sensing layers is the relatively low spatial resolution, which does not fit the current average bias of wildlife-tracking GPS locations (less than 20 m), thus potentially leading to a spatial mismatch between the animal-based information and the environmental layers (note that the resolution can still be perfectly fine, depending on the overall spatial variability and the species and biological process under study). Yet, this is much more desirable than using static layers when the temporal variability is an essential component of the ecological inference. Higher-resolution images and new types of information (e.g. forest structure) are presently provided by new types of sensors, such as those from lidar, radar, or hyper-spectral remote-sensing technology and the new Sentinel 2 (optical data). The new generation of satellites will probably require dedicated storage and analysis tools (e.g. Goggle Earth Engine) that can be related to the Big Data framework.
Here, we discuss the integration in the spatial database of one of the most used indexes for ecological productivity and phenology, i.e. NDVI, derived from MODIS images. The intersection of NDVI images with GPS locations requires a system that is able to handle large amounts of data and explicitly manage both spatial and temporal dimensions, which makes PostGIS an ideal candidate for the task.

The MODIS (Moderate Resolution Imaging Spectroradiometer) instrument operates on the NASA's Terra and Aqua spacecraft. The instrument views the entire earth surface every 1 to 2 days, captures data in 36 spectral bands ranging in wavelength from 0.4 μm to 14.4 μm and at varying spatial resolutions (250 m, 500 m and 1 km). The Global MODIS vegetation indices (code MOD13Q1) are designed to provide consistent spatial and temporal comparisons of vegetation conditions. Red and near-infrared reflectances, centred at 645 nm and 858 nm, respectively, are used to determine the daily vegetation indices, including the well known NDVI. This index is calculated by contrasting intense chlorophyll pigment absorption in the red against the high reflectance of leaf mesophyll in the near infrared. It is a proxy of plant photosynthetic activity and has been found to be highly related to green leaf area index (LAI) and to the fraction of photosynthetically active radiation absorbed by vegetation (FAPAR). Past studies have demonstrated the potential of using NDVI data to study vegetation dynamics. More recently, several applications have been developed using MODIS NDVI data such as land-cover change detection, monitoring forest phenophases, modelling wheat yield, and other applications in forest and agricultural sciences. However, the utility of the MODIS NDVI data products is limited by the availability of high-quality data (e.g. cloud-free), and several processing steps are required before using the data: acquisition via web facilities, re-projection from the native sinusoidal projection to a standard latitude-longitude format, eventually the mosaicking of two or more tiles into a single tile. A number of processing techniques to 'smooth' the data and obtain a cleaned (no clouds) time series of NDVI imagery have also been implemented. These kind of processes are usually based on a set of ancillary information on the data quality of each pixel that are provided together with MODIS NDVI.
In the framework of the present project, NDVI data over the study area have been acquired from Boku University Portal as weekly smoothed data. In the next figure you can see an example fo non smoothed and smoothed temporal profile.

Figure: Environmental layers

Raster time series are quite common from medium- and low-resolution data sets generated by satellites that record information at the same location on Earth at regular time intervals. In this case, each pixel has both a spatial and a temporal reference. In this exercise you integrate an NDVI data set of MODIS images covering the period 2005-2008 (spatial resolution of 1 km and temporal resolution of 7 days divided in 4 tiles). In this example, you will use the env_data schema to store raster time series, in order to keep it transparent to the user: all environmental data (static or dynamic) is in this schema. However, over larger amounts of data, it might be useful to store raster time series in a different schema (e.g. env_data_ts, where ts stands for time series) to support an easier and more efficient back-up procedure.
When you import a raster image using raster2pgsql, a new record is added in the target table for each raster, including one for each tile if the raster was tiled. At this point, each record does not consider time yet, and is thus simply a spatial object. To transform each record into a spatio-temporal object, you must add a field with the timestamp of the data acquisition, or, better, the time range covered by the data if it is related to a period. The time associated with each raster (and each tile) can usually be derived from the name of the file, where this information is typically embedded. In the case of MODIS composite over 16 days, this is the first day of the 7-day period associated with the image in the form MCD13Q1.AyyyyDDD.005.250m_7_days_NDVI.REFMIDw.tif (where yyyy is the year and DDD the day of the year (from 1 to 365). This image is the result of a clip on a mosaic of 4 images derived from the Boku's grid tiles. Note also that values are scale, i.e. converted to 1 byte (in a scale 0-250) using the formula:

NDVI Value = 0.0048 * Digital value – 0.2

Values between 25 and 255 are used to identify not valid data.

With this data type, you can now associate each image or tile with the correct time reference, that is, the 7-day period associated with each raster. This will make the spatio-temporal intersection with GPS positions possible by allowing direct comparisons with GPS timestamps.
To start, create an empty table to store the NDVI images, including a field for the temporal reference (of type date), and its index:

CREATE TABLE env_data.ndvi_modis(
  rid serial NOT NULL, 
  rast raster, 
  filename text,
  acquisition_date date,
  CONSTRAINT ndvi_modis_pkey
    PRIMARY KEY (rid));
CREATE INDEX ndvi_modis_wkb_rast_idx 
  ON env_data.ndvi_modis 
  USING GIST (ST_ConvexHull(rast));
COMMENT ON TABLE env_data.ndvi_modis
IS 'Table that stores values of smoothed MODIS NDVI (7-day periods).';

Now the trick is to use two arguments of the raster2pgsql command -F to include the raster file name as a column of the table (which will then be used by the trigger function), and -a to append the data to an existing table, instead of creating a new one. You can import all of them in a single operation using the wildcard character '*' in the input filename. You can thus run the following command in the Command Prompt (warning: you might need to adjust the rasters' path according to your own setup):

"C:\Program Files\PostgreSQL\9.2\bin\raster2pgsql.exe" -a -C -F -M -s 4326 -t 40x40 C:\tracking_db\data\env_data\raster\raster_ts*.tif env_data.ndvi_modis | psql -p 5432 -d gps_tracking_db -U postgres

Each raster file embeds the acquisition period in its filename. For instance, MCD13Q1.A2005003.005.250m_7_days_NDVI.REFMIDw.tif is associated with the 3rd day of 2005 (3rd January). As you can see, the period is encoded on 10 characters following the common prefix MCD13Q1.A. This allows you to use the to_date function to extract the date from the filename (which was automatically stored in the filename field during the import). For instance, you can extract the starting date from the first raster imported:

SELECT 
  filename, 
  to_date(substring(filename FROM 10 FOR 7) , 'YYYYDDD') as date
FROM 
  env_data.ndvi_modis LIMIT 1;
filename date
MCD13Q1.A2005003.005.250m_7_days_NDVI.REFMIDw.tif 2005-01-03

In the case of more complex filenames with a variable number of characters, you could retrieve the encoded date using the substring function, by extracting the relevant characters relative to some other characters found first using the position function. Let's now update the table by converting the filenames into the date ranges according to the convention used in file naming:

UPDATE env_data.ndvi_modis
SET acquisition_date = to_date(substring(filename FROM 10 FOR 7) , 'YYYYDDD');

As for any type of column, if the table contains a large number of rows (e.g. > 10,000), querying based on the acquisition_date will be faster if you first index it (you can do it even if the table is not that big, as the PostgreSQL planer will determine whether the query will be faster by using the index or not):

CREATE INDEX ndvi_modis_acquisition_date_idx 
ON env_data.ndvi_modis (acquisition_date);

Now each tile (and therefore each pixel) has a spatial and a temporal component and thus can be queried according to both criteria.
Based on this, you can now create a trigger and its associated function to automatically create the appropriate date during the NDVI data import (for future NDVI images). Note that the ndvi_acquisition_date_update function will be activated before a NDVI tile is loaded, so that the transaction is aborted if, for any reason, the acquisition_date can not be computed, and only valid rows are inserted into the *ndvi_modis *table:

CREATE OR REPLACE FUNCTION tools.ndvi_acquisition_date_update()
RETURNS trigger AS
$BODY$
BEGIN
  NEW.acquisition_date = to_date(substring(new.filename FROM 10 FOR 7) , 'YYYYDDD');
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
COMMENT ON FUNCTION tools.ndvi_acquisition_date_update() 
IS 'This function is raised whenever a new record is inserted into the MODIS NDVI time series table in order to define the date range. The acquisition_range value is derived from the original filename.';
CREATE TRIGGER update_ndvi_acquisition_range
BEFORE INSERT ON env_data.ndvi_modis
  FOR EACH ROW EXECUTE PROCEDURE tools.ndvi_acquisition_date_update();

Every time you add new NDVI rasters, the acquisition_date will then be updated appropriately.

To intersect a GPS position with this kind of data set, both temporal and spatial criteria must be defined. In the next example, you retrieve the MODIS NDVI value at point (11, 46) using the ST_Value PostGIS SQL function, and for the whole year 2005:

SELECT 
  rid,
  acquisition_date,
  ST_Value(rast, ST_SetSRID(ST_MakePoint(11.1, 46.1), 4326)) * 0.0048 -0.2
    AS ndvi
FROM env_data.ndvi_modis
WHERE ST_Intersects(ST_SetSRID(ST_MakePoint(11.1, 46.1), 4326), rast) AND
  EXTRACT(year FROM acquisition_date) = 2005
ORDER BY acquisition_date;

The result gives you the complete NDVI profile at this location for the year 2005.
You can now retrieve NDVI values at coordinates from real animals. Data at a given day can be derived interpolating the previous and next NDVI images (the time series is smoothed, and also given the low temporal variability of NDVI, you can assume that in a week the value change linearly).

Select 
  gps_data_animals_id, 
  (trunc((st_value(pre.rast, geom) * (date_trunc('week', acquisition_time::date + 7)::date -acquisition_time::date)::integer +
st_value(post.rast, geom) * (acquisition_time::date - date_trunc('week', acquisition_time::date)::date))::integer/7)) * 0.0048 -0.2
from  
  main.gps_data_animals, 
  env_data.ndvi_modis pre,
  env_data.ndvi_modis post
where
  st_intersects(geom, pre.rast) and 
  st_intersects(geom, post.rast) and 
  date_trunc('week', acquisition_time::date)::date = pre.acquisition_date and 
  date_trunc('week', acquisition_time::date + 7)::date = post.acquisition_date
LIMIT 10;

You can now create a new field to permanently store the NDVI value in the gps_data_animals table:

ALTER TABLE main.gps_data_animals 
ADD COLUMN ndvi_modis double precision;

Now, let's update the value of this column:

UPDATE 
  main.gps_data_animals
SET 
  ndvi_modis = (trunc((st_value(pre.rast, geom) * (date_trunc('week', acquisition_time::date + 7)::date -acquisition_time::date)::integer +
st_value(post.rast, geom) * (acquisition_time::date - date_trunc('week', acquisition_time::date)::date))::integer/7)) * 0.0048 - 0.2
from  
env_data.ndvi_modis pre,
env_data.ndvi_modis post 
where
geom is not null and
st_intersects(geom, pre.rast) and 
st_intersects(geom, post.rast) and 
date_trunc('week', acquisition_time::date)::date = pre.acquisition_date and 
date_trunc('week', acquisition_time::date + 7)::date = post.acquisition_date;

If you want, now you can try to use triggers to automate the process as you did with other environmental variables, extending the trigger function new_gps_data_animals.

Summary exercise of Lesson 6

  1. Is the average altitude higher for males or females?
  2. What is the average monthly NDVI for each animal?