Lesson 2. Managing and modelling information on animals, sensors and captures


GPS positions are used to describe animal movements and to derive a large set of information, for example about animals' behavior, social interactions, and environmental preferences. GPS data are related to (and must be integrated with) many other sources of information that together can be used to describe the complexity of movement ecology. In a database framework, this can only be achieved through proper database data modelling, which depends on a clear definition of the biological context of a study. Particularly, data modelling becomes a key step when database systems manage a large set of connected data sets that grow in size and complexity: it permits easy updates and modification and adaptation of the database structure to accommodate the changing goals, constraints, and spatial scales of studies. In this lesson, you will extend your database with two three new tables to integrate ancillary information useful to interpreting GPS data: one for GPS sensors, one for animals, and one for captures.

Topic 1. The world in a database: database data models

A data model describes what types of data are stored and how they are organized. It can be seen as the conceptual representation of the real world in the database structures that include data objects (i.e. tables) and their mutual relationships. In particular, data modelling becomes a key step when database systems grow in size and complexity, and user requirements become more sophisticated: it permits easy updates and modification and adaptation of the database structure to accommodate the changing goals, constraints, and spatial scales of studies and the evolution of wildlife tracking systems. Without a rigorous data modelling approach, an information system might lose the flexibility to manage data efficiently in the long term, reducing its utility to a simple storage device for raw data, and thus failing to address many of the necessary requirements.
To model data properly, you have to clearly state the biological context of your study. A logical way to proceed is to define (a) very basic questions on the sample unit, i.e. individual animals and (b) basic questions about data collection.
a) Typically, individuals are the sampling units of an ecological study based on wildlife tracking. Therefore, the first question to be asked while modelling the data is: What basic biological information is needed to characterise individuals as part of a sample? Species, sex and age (or age class) at capture are the main factors which are relevant in all studies. Age classes typically depend on the species. Moreover, age class is not constant for all the GPS positions. The correct age class at any given moment can be derived from the age class at capture and then defining rules that specify when the individual change from a class to another (for roe deer, you might assume that at 1st April of every year each individual that was a fawn becomes a yearling, and each yearling becomes an adult). Other information used to characterise individuals could be specific to a study, for example in a study on spatial behaviour of translocated animals, 'resident' or 'translocated' is an essential piece of information linked to individual animals. All these elements should be described in specific tables.
b) A single individual becomes a 'studied unit' when it is fitted with a sensor, in this case to collect position information. First of all, GPS sensors should be described by a dedicated table containing the technical characteristics of each device (e.g. vendor, model). Capture time, or 'device-fitting' time, together with the time and a description of the end of the deployment (e.g. drop-off of the tag, death of the animal), are also essential to the model. The link between the sensors and the animals should be described in a table that states unequivocally when the data collected from a sensor 'become' (and cease to be) bio-logged data, i.e. the period during which they refer to an individual's behaviour. The start of the deployment usually coincides with the moment of capture, but it is not the same thing. Indeed, moment of capture can be the 'end' of one relationship between a sensor and an animal (i.e. when a device is taken off an animal) and at the same time the 'beginning' of another (i.e. another device is fitted instead).

Thanks to the tables 'animals', 'sensors' and 'sensors to animals', and the relationships built among them, GPS data can be linked unequivocally to individuals, i.e. the sampling units (see Lesson 3).

Some information related to animals can change over time. Therefore, they must be marked with the reference time that they refer to. Examples of typical parameters assessed at capture are age and positivity of association to a disease. Translocation may also coincide with the capture/release time. If this information changes over time according to well-defined rules (e.g. transition from age classes), their value can be dynamically calculated in the database at different moments in time (e.g. using database functions). In one of the next lessons, you will see an example of a function to calculate age class from the information on the age class at capture and the acquisition time of GPS positions for roe deer.
The basic structure based on the elements animals, sensors, sensors to animals, and, of course, position data, can be extended to take into account the specific goals of each project, the complexity of the real-world problems faced, the technical environment, and the available data. Examples of data that can be integrated are capture methodology, handling procedure, use of tranquilizers and so forth, that should be described in a 'captures' table linked to the specific individual (in the table 'animals'). Finally, data referring to individuals may come from several sources, e.g. several sensors or visual observations. In all these cases, the link between data and sample units (individuals) should also be clearly stated by appropriate relationships. In complex projects, especially those involving field data collection in parallel with remote tracking, and involve multiple species/sensor types, database can quickly get complex and grow up to hundreds of tables.

The design of the database data model is an exercise that must be performed at the very initial stage of the creation of a database. Once the objects (and objectives) of a study are identified and described (see above), some tools exist to graphically translates the conceptual model into connected tables, each of them representing a specific entity of the world. This process is not trivial and force biologists to "formalize" their goals, data and scientific approach (which also helps to organize the whole data collection in a systematic and consistent way). For example, at the beginning of a tracking study, it is easy to assume that a tag (i.e. a collar) can be identified with the animal where it is deployed, creating a single table. Later on, it happens very often that the same collar is reused on other animals, thus making a data model based on a single animal-collar table unsuitable. Changing a database on and advanced stage of development is very complicate and requires by far more times than a carefully planned phase of data modelling at the start of the project.

A very popular graphical tool to represent a data model is the Entity-Relationship Diagram (ERD) that helps to show the relationship between elements, concepts or events of a system and that can be used as the foundation for a relational database.

In the figure below, it is illustrated the schema of the database structure created at the end of this and the next lessons.

Figure: Database schema

Topic 2. Extend the database: data on sensors and animals

Introduction

At the moment, there is a single table in the your test database and it represents raw data from GPS sensors. This data alone gives very little information on what it represents. To take full advantage of these positional data sets, you have to join locations with other kind of information that help to transform the raw number into the description of real-life objects (in this case, moving animals). When this contextual information is very limited, you can simply add some metadata, but in case it is more structured and complex, you have to properly integrate it into the database as tables.
Now you can start to extend the database with new tables to represent sensors and animals. This process will continue throughout all the following lessons in order to include a wide range of information that are relevant for wildlife tracking.

Example

In the sub-folder \tracking_db\data\animals and \tracking_db\data\sensors of the test data set you will find two files: animals.csv and gps_sensors.csv. Let's start with data on GPS sensors. Once you have explored its content, first you have to create a table in the database with the same attributes as the .csv file, and then import the data into it. Here is the code of the table structure:

CREATE TABLE main.gps_sensors(
gps_sensors_id integer,
gps_sensors_code character varying NOT NULL,
purchase_date date,
frequency double precision,
vendor character varying,
model character varying,
sim character varying,
CONSTRAINT gps_sensors_pkey
PRIMARY KEY (gps_sensors_id),
CONSTRAINT gps_sensor_code_unique
UNIQUE (gps_sensors_code)
);
COMMENT ON TABLE main.gps_sensors
IS 'GPS sensors catalog.';

The only field created in the table that is not present in the original file is gps_sensors_id. This is an integer used as primary key. You could also use gps_sensors_code as primary key, but in many practical situations it is handy to use an integer field. In some cases, a good recommendation is to use a serial number as primary key to let the database generate a unique code (integer) every time that a new record is inserted. In this exercise, we use an integer data type because the values of the gps_sensors_id field are pre-defined in order to be correctly referenced in the exercises of the next lessons. A UNIQUE constraint is created on the field gps_sensors_code to be sure that the same sensor is not imported more than once.
You add a field to keep track of the timestamp of record insertion, which can be very useful to monitor database activities:

ALTER TABLE main.gps_sensors 
  ADD COLUMN insert_timestamp timestamp with time zone DEFAULT now();

Now you can import data using the COPY command:

COPY main.gps_sensors(
  gps_sensors_id, gps_sensors_code, purchase_date, frequency, vendor, model, sim)
FROM 
  'C:\tracking_db\data\sensors\gps_sensors.csv' 
  WITH (FORMAT csv, DELIMITER ';');

At this stage, you have defined the list of GPS sensors that exist in your database. To be sure that you will never have GPS data that come from a GPS sensor that does not exist in the database, you apply a foreign key between main.gps_data and main.gps_sensors. Foreign keys physically translate the concept of relations among tables.

ALTER TABLE main.gps_data
  ADD CONSTRAINT gps_data_gps_sensors_fkey 
  FOREIGN KEY (gps_sensors_code)
  REFERENCES main.gps_sensors (gps_sensors_code) 
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION;

This setting says that in order to delete a record in main.gps_sensors, you first have to delete all the associated records in main.gps_data. From now on, before importing GPS data from a sensor, you have first to create the sensor's record in the main.gps_sensors table.
You can add other kinds of constraints to control the consistency of your database. As an example, here you check that the date of purchase is after 2000-01-01. If this condition is not met, the database will refuse to insert (or modify) the record and will return an error message.

ALTER TABLE main.gps_sensors
  ADD CONSTRAINT purchase_date_check 
  CHECK (purchase_date > '2000-01-01'::date);

Exercise

  1. Visualize all the locations stored in main.gps_data including also the date when the sensor was purchased and the vendor name

Topic 3. Extend the database: data on captures

Introduction

The coordinates provided by GPS sensors describe in detail the spatial patterns of animal movements, which provide valuable information to biologists and wildlife managers. On the other hand, coordinates alone, with no information of what they represent i.e. (the object that is moving and its characteristics), can just partially address the main ecological questions that they can potential answer. In this and in the next lessons, we will try to build up a (database) framework to move from coordinates to more complex object: individual animals, with their characteristics and interactions, moving in their environment. The first step is to describe the individuals that are monitored. In this exercise we create a table to store basic information such as age, sex, species, and the name that researchers use to identify that individual (if any).

Example

Now you repeat the same process for data on animals. Analyzing the animals' source file (animals.csv), you can derive the fields of the new main.animals table:

CREATE TABLE main.animals(
  animals_id integer,
  animals_code character varying(20) NOT NULL,
  name character varying(40),
  sex character(1),
  age_class_code integer,
  species_code integer,
  note character varying,
  CONSTRAINT animals_pkey PRIMARY KEY (animals_id)
);
COMMENT ON TABLE main.animals
IS 'Animals catalog with the main information on individuals.';

As for main.gps_sensors, in your operational database you can use the serial data type for the animals_id field. Age class (at capture) and species are attributes that can only have defined values. To enforce consistency in the database, in these cases you can use look up tables. Look up tables are tables that store the list and the description of all possible values referenced by specific fields in different tables and constitute the definition of the valid domain. They are very common because they can help to simplify a database structure and add flexibility as compared to constraints defined on specific fields. It is recommended to keep look up tables in a separated schema to give the database a more readable and clear data structure. Therefore, you create a lu_tables schema:

CREATE SCHEMA lu_tables;
  GRANT USAGE ON SCHEMA lu_tables TO basic_user;
COMMENT ON SCHEMA lu_tables
IS 'Schema that stores look up tables.';

You set as default that the user basic_user will be able to run SELECT queries on all the tables that will be created into this schema:

ALTER DEFAULT PRIVILEGES 
  IN SCHEMA lu_tables 
  GRANT SELECT ON TABLES 
  TO basic_user;

Now you create a look up table for species:

CREATE TABLE lu_tables.lu_species(
  species_code integer,
  species_description character varying,
  CONSTRAINT lu_species_pkey 
  PRIMARY KEY (species_code)
);
COMMENT ON TABLE lu_tables.lu_species
IS 'Look up table for species.';

You populate it with some values (just roe deer code will be used in our test data set):

INSERT INTO lu_tables.lu_species 
  VALUES (1, 'roe deer');

INSERT INTO lu_tables.lu_species 
  VALUES (2, 'rein deer');

INSERT INTO lu_tables.lu_species 
  VALUES (3, 'moose');

You can do the same for age classes:

CREATE TABLE lu_tables.lu_age_class(
  age_class_code integer, 
  age_class_description character varying,
  CONSTRAINT lage_class_pkey 
  PRIMARY KEY (age_class_code)
);
COMMENT ON TABLE lu_tables.lu_age_class
IS 'Look up table for age classes.';

You populate it with some values (these categories are based on roe deer, other species might need a different approach):

INSERT INTO lu_tables.lu_age_class 
  VALUES (1, 'fawn');

INSERT INTO lu_tables.lu_age_class 
  VALUES (2, 'yearling');

INSERT INTO lu_tables.lu_age_class 
  VALUES (3, 'adult');

At this stage you can create the foreign keys between the main.animals table and the two look up tables. This will prevent to have animals with species and age classes that are not listed in the look up tables. This is an important tool to ensure data integrity.

ALTER TABLE main.animals
  ADD CONSTRAINT animals_lu_species 
  FOREIGN KEY (species_code)
  REFERENCES lu_tables.lu_species (species_code) 
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION;
ALTER TABLE main.animals
  ADD CONSTRAINT animals_lu_age_class 
  FOREIGN KEY (age_class_code)
  REFERENCES lu_tables.lu_age_class (age_class_code) 
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION;

For sex class of deer, you do not expect to have more than the two possible values: female and male (stored in the database as 'f' and 'm' to simplify data input). In this case, instead of a look up table you can set a check on the field:

ALTER TABLE main.animals
  ADD CONSTRAINT sex_check 
  CHECK (sex = 'm' OR sex = 'f');

Whether it is better to use a look up table or a check must be evaluated case by case, mainly according to the number of admitted values and the possibility that you will want to add new values in the future.
You should also add a field to keep track of the timestamp of record insertion in order to be able to keep track of what happened in the database. More complex approach are possible to monitor the database activity, from the creation of a log file that store all the operations performed to proper versioned database, each of them with pro and cons.

ALTER TABLE main.animals 
  ADD COLUMN insert_timestamp timestamp with time zone DEFAULT now();

As a last step, you import the values from the file:

COPY main.animals(
  animals_id,animals_code, name, sex, age_class_code, species_code)
FROM 
  'C:\tracking_db\data\animals\animals.csv' 
  WITH (FORMAT csv, DELIMITER ';');

To test the result, you can retrieve the animals' data with the extended species and age class description:

SELECT
  animals.animals_id AS id, 
  animals.animals_code AS code, 
  animals.name, 
  lu_age_class.age_class_description AS age_class, 
FROM 
  lu_tables.lu_age_class, 
  lu_tables.lu_species, 
  main.animals
WHERE 
  lu_age_class.age_class_code = animals.age_class_code 

The result of the query is

id code name age_class
1 F09 Daniela adult
2 M03 Agostino adult
3 M06 Sandro adult
4 F10 Alessandra adult
5 M10 Decimo adult

Exercise

  1. Query the table main.animals including the information on the sex and the species (with the name of the species stored in the look up table)
  2. Modify the field vendor in the table main.gps_sensors to limit the possible values to: 'Vectronics', 'Sirtrack' and 'Lotek', then try to insert a vendor not in the list

Summary exercise of Lesson 2

  1. Design a general schema of a possible database extension (i.e. tables and their links) to include information on capture and particularly on
  • date and time of the capture (note that the same animal can be captured more than once)
  • location of capture
  • capture method
  • person who captured the animal
  • if the animal was collared, in this case include the code of the collar
  • the name/id of the animal
  • body temperature (note that temperature can be measured more than once at the different stages of the capture)