Lesson 1. Storing tracking data in an advanced database platform: PostgreSQL


The state-of-the-art technical tool for effectively and efficiently managing tracking data is the spatial relational database. Using databases to manage tracking data implies a considerable effort for those who are not already familiar with these tools, but this is necessary to be able to deal with the data coming from the new sensors. Moreover, the time spent to learn databases will be largely paid back with the time saved for the management and processing of the data. In this lesson, you are guided through how to set up a new database in which you will create a table to accommodate the test GPS data sets. You create a new table in a dedicated schema.

This lesson describes how to upload the raw GPS data coming from five sensors deployed on roe deer in the Italian Alps into the database and how to create additional database users. The reference software platform used is the open source PostgreSQL with its spatial extension PostGIS. The reference (graphical) interface used to deal with the database is pgAdmin. All the examples provided (SQL code) and technical solutions proposed are tuned on this software, although most of the code can be easily adapted for other platforms.

Topic 1. Create a database and import your data in a table

Introduction

Once a tracking project starts and sensors are deployed on animals, data begin to arrive (usually in the form of text files containing the raw information recorded by sensors). At this point, data must be handled by researchers. In this and the following exercises, you will create a fully operational relational database for GPS tracking data. The first steps to carry out are the creation of a new database, of a new schema and a table to accommodate data coming from the (GPS) sensor that can then be imported. All the basic knowledge needed for these tasks were introduced in the Part 1 of this tutorial.

Example

Assuming that you have PostgreSQL installed and running on your computer (or server), the first thing that you have to do to import your raw sensors data into the database is to connect to the database server and create a new database with the command CREATE DATABASE:

CREATE DATABASE gps_tracking_db
ENCODING = 'UTF8'
TEMPLATE = template0
LC_COLLATE = 'C'
LC_CTYPE = 'C';

You could create the database using just the first line of the code. The other lines are added just to be sure that the database will use UTF8 as encoding system and will not be based on any local-specific setting regarding, e.g. alphabets, sorting, or number formatting. This is very important when you work in an international environment where different languages (and therefore characters) can potentially be used. When you import textual data with different encoding, you have to specify the original encoding otherwise special character might be misinterpreted. Different encodings are a typical source of error when data are moved from a system to another.

Although not compulsory, it is very important to document the objects that you create in your database to enable other users (and probably yourself later on) to understand its structure and content, pretty much the same way you use to do with metadata of your data. COMMENT command gives you this possibility. Comments are stored into the database. In this case:

COMMENT ON DATABASE gps_tracking_db   
IS 'Next Generation Data Management in Movement Ecology Summer school: my database.'; 

By default, a database comes with the public schema; it is good practice, however, to use different schemas to store user data. For this reason you create a new SCHEMA called main:

CREATE SCHEMA main;
COMMENT ON SCHEMA main IS 'Schema that stores all the GPS tracking core data.'; 

Before importing the GPS data sets into the database, it is recommended that you examine the source data (usually .dbf, .csv, or .txt files) with a spreadsheet or a text editor to see what information is contained. Every GPS brand/model can produce different information, or at least organise this information in a different way, as unfortunately no consolidated standards exist yet. The idea is to import raw data (as they are when received from the sensors) into the database and then process them to transform data into information. Once you identify which attributes are stored in the original files (for example, GSM01438.csv), you can create the structure of a table with the same columns, with the correct database DATA TYPES. You can find the GPS data sets in .csv files included in the trackingDB_datasets.zip file with test data in the sub-folder /tracking_db/data/sensors_data.
The SQL code that generates the same table structure of the source files (Vectronic GPS collars) within the database, which is called here main.gps_data (main is the name of the schema where the table will be created, while gps_data is the name of the table) is

CREATE TABLE main.gps_data 
( 
gps_data_id serial NOT NULL, 
gps_sensors_code character varying, 
line_no integer, 
utc_date date, 
utc_time time without time zone, 
lmt_date date, 
lmt_time time without time zone, 
ecef_x integer, 
ecef_y integer, 
ecef_z integer, 
latitude double precision, 
longitude double precision, 
height double precision, 
dop double precision, 
nav character varying(2), 
validated character varying(3), 
sats_used integer, 
ch01_sat_id integer, 
ch01_sat_cnr integer, 
ch02_sat_id integer, 
ch02_sat_cnr integer, 
ch03_sat_id integer, 
ch03_sat_cnr integer, 
ch04_sat_id integer, 
ch04_sat_cnr integer, 
ch05_sat_id integer, 
ch05_sat_cnr integer, 
ch06_sat_id integer, 
ch06_sat_cnr integer, 
ch07_sat_id integer, 
ch07_sat_cnr integer, 
ch08_sat_id integer, 
ch08_sat_cnr integer, 
ch09_sat_id integer, 
ch09_sat_cnr integer, 
ch10_sat_id integer, 
ch10_sat_cnr integer, 
ch11_sat_id integer, 
ch11_sat_cnr integer, 
ch12_sat_id integer, 
ch12_sat_cnr integer, 
main_vol double precision, 
bu_vol double precision, 
temp double precision, 
easting integer, 
northing integer, 
remarks character varying 
); 
COMMENT ON TABLE main.gps_data 
IS 'Table that stores raw data as they come from the sensors (plus the ID of the sensor).'; 

In a relational database, each table must have a primary key, that is a field (or combination of fields) that uniquely identify each record. In this case, we added a SERIAL id field managed by the database (as a sequence of integers automatically generated) to be sure that it is unique. We set this field as the primary key of the table:

ALTER TABLE main.gps_data 
ADD CONSTRAINT gps_data_pkey PRIMARY KEY(gps_data_id); 

To keep track of database changes, it is useful to add another field where the timestamp of the insert of each record is automatically recorded (assigning a dynamic default value):

ALTER TABLE main.gps_data ADD COLUMN insert_timestamp timestamp with time zone; 
ALTER TABLE main.gps_data ALTER COLUMN insert_timestamp SET DEFAULT now(); 

Now we are ready to import our data sets. There are many ways to do so. The main one is to use the COPY command setting the appropriate parameters:

COPY main.gps_data( 
gps_sensors_code, line_no, utc_date, utc_time, lmt_date, lmt_time, ecef_x, ecef_y, ecef_z, latitude, longitude, height, dop, nav, validated, sats_used, ch01_sat_id, ch01_sat_cnr, ch02_sat_id, ch02_sat_cnr, ch03_sat_id, ch03_sat_cnr, ch04_sat_id, ch04_sat_cnr, ch05_sat_id, ch05_sat_cnr, ch06_sat_id, ch06_sat_cnr, ch07_sat_id, ch07_sat_cnr, ch08_sat_id, ch08_sat_cnr, ch09_sat_id, ch09_sat_cnr, ch10_sat_id, ch10_sat_cnr, ch11_sat_id, ch11_sat_cnr, ch12_sat_id, ch12_sat_cnr, main_vol, bu_vol, temp, easting, northing, remarks) 
FROM 
'C:\tracking_db\data\sensors_data\GSM01438.csv' WITH CSV HEADER DELIMITER ';'; 

You might have to adapt the path to the file.

If PostgreSQL complain that date is out of range, check the standard date format used by PostgreSQL (datestyle):

SHOW datestyle; 

if it is not ISO, DMY, then you have to set the date format (for the current session) as:

SET SESSION datestyle = "ISO, DMY"; 

This will change the datastyle for the current session only. If you want to change this setting permanently, you have to modify the datestyle option in the postgresql.conf file.

In the case of .dbf files, you can use a tool that comes with PgAdmin (Shapefile and .dbf importer). In this case, you do not have to create the structure of the table before importing, but you lose control over the definition of data type (e.g. time will probably be stored as a text value). You can also use MS Access and link both source and destination table and run an upload query.

Exercise

Import into the database the raw GPS data from the test sensors:

  • GSM01508
  • GSM01511
  • GSM01512

Topic 2. Finalise the table of raw data: acquisition timestamps, indexes and permissions

Introduction

PostgreSQL (and most of the database systems) can deal with a large set of specific type of data with dedicated database data types and related functionalities in addition to string and numbers. In the next exercise, a new field is introduced to represent a timestamp with time zone, i.e. date & time & time zone together (later on in this tutorial, another specific data type will be introduced to manage spatial data).

Moreover, a database offers many functionalities to improve performances. One of these are indexes that are data structures that improve the speed of data retrieval operations on a database table at the cost of slower writes and the use of more storage space. Database indexes work in a similar way to a book's table of contents: you have to add an extra page and update it whenever new content is added, but then searching for specific sections will be much faster. In the exercise, an index will be added to the location table and an example of its effectiveness will be shown.

One of the main advantages of an advanced database management system like PostgreSQL is that the database can be accessed by a number of different users at the same time, keeping the data always in a single version with a proper management of concurrency. This ensures that the database maintains the ACID (atomicity, consistency, isolation, durability) principles in an efficient manner and that different permission levels can be set for each user. For example, you can have a single administrator that can change the database, a set of advanced users that can edit the content of the core tables and create their own object (e.g. tables, functions) without changing the main database structure, and a set of users that can just read the data. In the exercise, a new group of database users and related permissions(roles and privileges) on the raw locations table will be added to illustrate some of the options to manage different class of users.

Example

In the original GPS data file, no timestamp field is present. Although the table main.gps_data is designed to store data as they come from the sensors, it is convenient to have an additional field where date and time are combined and where the correct time zone is set (in this case UTC). To do so, you first add a field with data type timestamp with time zone type. Then you fill it (with an UPDATE statement) from the time and date fields. In a later exercise, you will see how to automatize this step using triggers.

ALTER TABLE main.gps_data 
  ADD COLUMN acquisition_time timestamp with time zone;
UPDATE main.gps_data 
  SET acquisition_time = (utc_date + utc_time) AT TIME ZONE 'UTC';

Now the table is ready. Next you can add some indexes. You have to decide on which fields you create indexes for by considering what kind of query will be performed most often in the database. Here you add indexes on the acquisition_time and the gps_sensors_code fields, which are probably two key attributes in the retrieval of data from this table:

CREATE INDEX acquisition_time_index
  ON main.gps_data
  USING btree (acquisition_time );
CREATE INDEX gps_sensors_code_index
  ON main.gps_data
  USING btree (gps_sensors_code);

As a simple example, you can now retrieve data using specific selection criteria. Let's retrieve data from the collar GSM01512 during the month of May (whatever the year), and order them by their acquisition time:

SELECT 
  gps_data_id AS id, gps_sensors_code AS sensor_id, 
  latitude, longitude, acquisition_time
FROM 
  main.gps_data
WHERE 
  gps_sensors_code = 'GSM01512' and EXTRACT(MONTH FROM acquisition_time) = 5
ORDER BY 
  acquisition_time
LIMIT 5;

The first records (LIMIT 5 returns just the first 5 records) of the result of this query are

id sensor_id latitude longitude acquisition_time
11906 GSM01512 46.00563 11.05291 2006-05-01 00:01:01+00
11907 GSM01512 46.00630 11.05352 2006-05-01 04:02:54+00
11908 GSM01512 46.00652 11.05326 2006-05-01 08:01:03+00
11909 GSM01512 46.00437 11.05536 2006-05-01 12:02:40+00
11910 GSM01512 46.00720 11.05297 2006-05-01 16:01:23+00

In most of the cases, your database will have multiple users. As an example, you create here a group of users, called basic_users.

CREATE ROLE basic_users LOGIN
NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;

You can then associate a permission to SELECT data from the raw locations data. First, you have to give USAGE permission on the SCHEMA where the main.gps_data table is stored (with SET USAGE ON THE SCHEMA), then you grant READ permission on the table.

GRANT USAGE ON SCHEMA main TO basic_users;
GRANT SELECT ON main.gps_data TO basic_users;

Groups are very useful because you can associate multiple users to same same group and they will automatically inherit all the permissions of the group so you do not have to assign permissions to each one individually. Permissions can be given to a whole group or to specific users.

You create a user that is part of the basic_users group (login ciccio_pasticcio, password pasticcio_ciccio).

CREATE ROLE ciccio_pasticcio LOGIN
  PASSWORD 'pasticcio_ciccio'
  NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION;

In general, every time you create a new database object, you have to set the privileges (for tables: ALL, INSERT, SELECT, DELETE, UPDATE) to groups/users. If you want to set the same privileges to a group/user for all the tables in a specific schema you can use GRANT SELECT ON ALL TABLE:

GRANT SELECT ON ALL TABLES 
  IN SCHEMA main 
  TO basic_users;

You can now reconnect to the database as ciccio_pasticcio user and try first to visualize data and then change the values of a record. This latter operation will be forbidden because the user do not have this kind of privilege. In this way, the database integrity will be safeguarded from unexperienced user that might remove all the data running the wring command.

If you want to automatically grant permissions to specific groups/users to all new (i.e. that will be created in the future) objects in a schema, you can use ALTER DEFAULT PRIVILEGES:

ALTER DEFAULT PRIVILEGES 
  IN SCHEMA main 
  GRANT SELECT ON TABLES 
  TO basic_users;

From now on, users belonging to basic_users group with have reading access to all the tables that will be created in the main schema. By default, all objects created in the schema public are fully accessible to all users.

Setting a permission policy in a complex multi-user environment requires an appropriate definition of data access at different levels.

Exercise

  1. Create a query that returns the number of locations acquired in summer that are stored in the locations table
  2. Find the extreme coordinates (minimum and maximum longitude and latitude) in the data set
  3. Claculate the percentage of locations in the table that are related to each of the sensors

Topic 3. Export your data and backup the database

Introduction

There are different ways to export a table or the results of a query to an external file. One is to use the command COPY (TO). COPY TO (similarly to what happens with the command COPY FROM used to import data) with a file name directly write the content of a table or the result of a query to a file. The file must be accessible by the PostgreSQL user (i.e. you have to check the permission on target folder by the user ID the PostgreSQL server runs as) and the name (path) must be specified from the viewpoint of the server. This means that files can be read or write only in folders 'visible' to the database servers. If you want to remotely connect to the database and save data into your local machine, you can use the command \COPY that performs a frontend (client) copy. \COPY is not an SQL command and must be run from a PostgreSQL interactive terminal (PSQL). This is an operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, PSQL reads or writes the file and routes the data between the server and the local file system. This means that file accessibility and privileges are those of the local user, not the server, and no SQL superuser privileges are required.
Another possibility to export data is to use the pgAdmin interface: in the SQL console select Query/Execute to file and the results will be saved to a local file instead of being visualized. Other database interfaces have similar tools.

A proper backup policy for a database is important to securing all your valuable data and the information that you have derived through data processing. In general it is recommended to have frequent (scheduled) backups (e.g. once a day) for schemas that change often and less frequent backups (e.g. once a week) for schemas (if any) that occupy a larger disk size and do not change often (e.g. ancillary environmental layers). PostgreSQL offers very good tools for database backup and recovery. The two main tools to back up are:

  • pg_dump.exe: extracts a PostgreSQL database or part of the database into a script file or other archive file (pg_restore.exe is then used to restore the database);
  • pg_dumpall.exe: extracts a PostgreSQL database cluster (i.e. all the databases created inside the same installation of PostgreSQL) into a script file (e.g. including database setting, roles).

These are not SQL commands but executable commands that must run from a command-line interpreter (with Windows, the default command-line interpreter is the program cmd.exe, also called Command Prompt). pgAdmin also offers a graphic interface for backing up and restoring the database. Moreover, it also important to keep a file-based copy of the original raw data files, particularly those generated by sensors.

Example

An example of data export for the whole main.gps_data table is

COPY (
  SELECT gps_data_id, gps_sensors_code, latitude, longitude, acquisition_time, insert_timestamp 
  FROM main.gps_data) 
TO 
  'C:\tracking_db\test\export_test1.csv' 
  WITH (FORMAT csv, HEADER, DELIMITER ';');

An example of a back up (if you want to reuse this layout, you must properly st the path to your pg_dump command) of the schema main and all its content is

C:\PostgreSQL\9.4\bin\pg_dump.exe --host localhost --port 5432 --username "postgres" --no-password --format plain --encoding UTF8 --verbose --file "C:\tracking_db\test\backup_db_20150706.sql" --schema "main" "gps_tracking_db"

Exercise

  1. Find and export to a .csv file the extreme coordinates (minimum and maximum longitude and latitude) and the duration of the deployment for each sensor
  2. Make a complete backup of the database built so far and restore it in a new database on your computer

Summary exercise of Lesson 1

  1. Find the number of days monitored by each GPS sensor
  2. Make a bar graph of the number of locations per sensor