getting started with sdm package

sdm is an object-oriented, reproducible and extensible R platform for species distribution modelling. The sdm package is designed to create a comprehensive modelling and simulation framework that: 1) provides a standardised and unified structure for handling species distributions data and modelling techniques (e.g. a unified interface is used to fit different models offered by different packages); 2) is able to support markedly different modelling approaches; 3) enables scientists to modify the existing methods, extend the framework by developing new methods or procedures, and share them to be reproduced by the other scientists; 4) handles spatial as well as temporal data for single or multiple species; 5) employs high performance computing solutions to speed up modelling and simulations, and finally; 6) uses flexible and easy-to-use GUI interface. For more information, check the published paper by Naimi and Araujo (2016) in the journal of Ecography.

This document provides a very quick demonstration on the package followed by some examples, that would be helpful to get a guick start with the package.

Installing sdm and all the required packages

sdm can simply be installed using the standard install.packages function as:

install.packages(‘sdm’)

Depending on the methods are selected through the modelling and using the package, several packages may be needed, and therefore, should be installed on your machine. A quick way to install all the required packages (to guarantee having full functionaliy of sdm), is to simply use the function installAll offered by the sdm package. You can simply call it without any argument:

installAll()

A brief overview:

There are three main functions provide the main steps of developing/using species distibution models. The three steps include data preparation, model fitting/ evaluation, and prediction. The functions used for these steps:

: to read data from different formats, and prepare a data object. Both species (single or multiple) and explanatory variables can be introduced to the function, as well as other information such as spatial coordinates, time, grouping variables, etc.
: to fit and evaluate species distribution models (multiple algorithms can be used)
: when models are fitted, they can be used to predict/project given a new dataset.

Example Dataset:

The package is followed by several datasets that are used in the help pages. We also use one of those examples here for our demonstration:

We use a shapefile containing presence=absence records for a species as spatial points (species.shp), and four raster datasets (in Ascii format) as explanatory variables (predictors). The files are in the sdm library, so we can directly read them from the library folder. We use rgdal to read raster data (it supports different common formats), and raster package to handle the raster data.

library(sdm)

library(raster)
library(rgdal)

file <- system.file("external/species.shp", package="sdm") # get the location of the species shapefile in the sdm package

# so, file is simply a filename (with path):
file

## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/sdm/external/species.shp"

# read the species shapefile using the function shapefile:

species <- shapefile(file)

class(species) # it is a SpatialPointsDataFrame

## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

plot(species)

# we can take a look at the head of attribute table in the species dataset:

head(species)

##   Occurrence
## 1          1
## 2          0
## 3          1
## 4          1
## 5          1
## 6          0

# you can see that there is a column has presence-absence records (name of column is important to know: Occurrence)

#--- we can plot presence and absence points separately with different colours:

plot(species[species$Occurrence == 1,],col='blue',pch=16)

points(species[species$Occurrence == 0,],col='red',pch=16)

##########
# Let's read predictor variables (raster datasets)
# We have four Ascii-Grids, so, let's just take the name of all files ending to '.asc' to be able to read them together. list.files function can be used to get a list of files in a given path:

path <- system.file("external", package="sdm") # path to the folder contains the data
lst <- list.files(path=path,pattern='asc$',full.names = T) # list the name of files in the specified path, match the pattern (means all files with a name ending to asc). We asked to generate full names (i.e., names with the path)

lst # this is the name of raster files we want to use as predictor variables

## [1] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/sdm/external/elevation.asc"    
## [2] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/sdm/external/precipitation.asc"
## [3] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/sdm/external/temperature.asc"  
## [4] "/Library/Frameworks/R.framework/Versions/3.2/Resources/library/sdm/external/vegetation.asc"

# stack is a function in the raster package, to read/create a multi-layers raster dataset
preds <- stack(lst) # making a raster object

preds # see the specification of the raster layers (e.g., cell size, extent, etc.)

## class       : RasterStack 
## dimensions  : 71, 124, 8804, 4  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## names       : elevation, precipitation, temperature, vegetation

plot(preds)

plot(preds[[4]]) # only plot the 4th layer

plot(species,add=T) # let's add the species point on the previous plot

Data preparation

So far, we used other packages to just read the data we need to use in our study. Now, we can use the sdm package. First, we need to read and put data in the package that creates a data object. This is very simple and efficient using the function .

In this function, we can specify the train dataset (can be spatial points or simply a data.frame), and predictors (if available separately as a raster object). In addition, if there is an independent dataset available to be used for measuring model performance (evaluation/validation), we can provide it through the test arguement. A formula can also be used to specify the response and explanatory variables (and in case if you have data.frame as input which contains coordinates, you can specify the coordinates in the formula as well, + more information). If the formula is not specified, the function tries to detect the species and predictor variables.

library(sdm)

d <- sdmData(formula=Occurrence~., train=species, predictors=preds)

d

## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of fearures                    :  4 
## fearure names                         :  elevation, precipitation, temperature, ... 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE

# we didn't really need the formula in this example, as it would be easy for the function to guess which dataset is species, and which are predictors. So, it could be like this:
d <- sdmData(train=species, predictors=preds)

d

## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of fearures                    :  4 
## fearure names                         :  elevation, precipitation, temperature, ... 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE

# However, formula makes it so flexible to handle the variables, specifally if there are several other information (e.g., time). If you have multiple species, you can have their names in the left hand side (e.g., sp1+sp2+sp3~.)

# You may also want to take part of variables:
d <- sdmData(formula=Occurrence~precipitation+temperature, train=species, predictors=preds)

d

## class                                 : sdmdata 
## =========================================================== 
## number of species                     :  1 
## species names                         :  Occurrence 
## number of fearures                    :  2 
## fearure names                         :  precipitation, temperature 
## type                                  :  Presence-Absence 
## has independet test data?             :  FALSE 
## number of records                     :  200 
## has Coordinates?                      :  TRUE

d <- sdmData(formula= ~., train=species, predictors=preds)


#---

Model Fitting and Evaluation

When you create the data object (d in the above example), you would be able to fit the models. To do so, you are going to use the function . In this function, you can specify the variables and the type of features can be generated based on, through a formula. In addition, the name of methods can be specifies as well as settings.

# in the following example, we use 3 different methods to fit the models.

m1 <- sdm(Occurrence~.,data=d,methods=c('glm','gam','brt'))

m1

## class                                 : sdmModels 
## ======================================================== 
## number of species                     :  1 
## number of modelling methods           :  3 
## names of modelling methods            :  glm, gam, brt 
## ------------------------------------------
## model run success percentage (per species)  :
## ------------------------------------------
## method          Occurrence       
## ---------------------- 
## glm        :     100 %
## gam        :     100 %
## brt        :     100 %
## 
## ###################################################################
## model performance (per species), using training test dataset:
## --------------------------------------------------------------------
## 
##  ## species   :  Occurrence 
## ======================
## 
## methods    :     AUC     |     COR     |     TSS     |     Deviance 
## -------------------------------------------------------------------
## glm        :     0.88    |     0.7     |     0.69    |     0.83     
## gam        :     0.88    |     0.71    |     0.7     |     0.82     
## brt        :     0.89    |     0.72    |     0.69    |     0.92

# as you can see, a report is generates shows how many percent of models were successful, and their performance
#-------------------

# in the above example, the performance statistics were calculated based on the training dataset (the data that were used to fit the mdoel). It is a better idea to have an independent dataset (if so, we would specify in the test argument of sdmData). However, for most of cases, there is no such data available, therefore, we can split the dataset as an alternative solution. Splitting (partitioning) can be one time or several times (several replicates). There are also several methods to do that (i.e., subsampling, cross-validation, bootsrapping)

# Here we are going to fit 5 models and evaluate them through 2 runs of subsampling, each draw 30 percent of training data as test dataset:

m2 <- sdm(Occurrence~.,data=d,methods=c('rf','tree','fda','mars','svm'),replicatin='sub',test.percent=30,n=2)

m2

## class                                 : sdmModels 
## ======================================================== 
## number of species                     :  1 
## number of modelling methods           :  5 
## names of modelling methods            :  rf, cart, fda, mars, svm 
## replicate.methods (data partitioning) :  subsampling 
## number of replicates (each method)    :  2 
## toral number of replicates per model  :  2 (per species) 
## test percentage (in subsampling)      :  30 
## ------------------------------------------
## model run success percentage (per species)  :
## ------------------------------------------
## method          Occurrence       
## ---------------------- 
## rf         :     100 %
## cart       :     100 %
## fda        :     100 %
## mars       :     100 %
## svm        :     100 %
## 
## ###################################################################
## model Mean performance (per species), using test dataset (generated using partitioning):
## --------------------------------------------------------------------
## 
##  ## species   :  Occurrence 
## ======================
## 
## methods    :     AUC     |     COR     |     TSS     |     Deviance 
## -------------------------------------------------------------------
## rf         :     0.85    |     0.66    |     0.7     |     0.94     
## cart       :     0.83    |     0.6     |     0.63    |     1.36     
## fda        :     0.86    |     0.69    |     0.67    |     0.88     
## mars       :     0.82    |     0.58    |     0.65    |     1.29     
## svm        :     0.86    |     0.71    |     0.73    |     NaN

getModelInfo(m2) # info on runs including modelID, whether they are successfully fitted and evaluated, etc.

##    modelID    species method replication replicationID success training
## 1        1 Occurrence     rf subsampling             1    TRUE     TRUE
## 2        2 Occurrence     rf subsampling             2    TRUE     TRUE
## 3        3 Occurrence   cart subsampling             1    TRUE     TRUE
## 4        4 Occurrence   cart subsampling             2    TRUE     TRUE
## 5        5 Occurrence    fda subsampling             1    TRUE     TRUE
## 6        6 Occurrence    fda subsampling             2    TRUE     TRUE
## 7        7 Occurrence   mars subsampling             1    TRUE     TRUE
## 8        8 Occurrence   mars subsampling             2    TRUE     TRUE
## 9        9 Occurrence    svm subsampling             1    TRUE     TRUE
## 10      10 Occurrence    svm subsampling             2    TRUE     TRUE
##    test.dep test.indep
## 1      TRUE      FALSE
## 2      TRUE      FALSE
## 3      TRUE      FALSE
## 4      TRUE      FALSE
## 5      TRUE      FALSE
## 6      TRUE      FALSE
## 7      TRUE      FALSE
## 8      TRUE      FALSE
## 9      TRUE      FALSE
## 10     TRUE      FALSE

# We can generate the roc curve and compare the results for all models:
roc(m2)

# the plots can be smoothes:
roc(m2,smooth=T)

Prediction

We can use the output of fitting, to predict into the study area, or project into a new location or a new time.

The predict function can be used for this purpose:

# in the following, we just predict the habitat suitability into the whole study area
# since the newdata is a raster object, the output is also a raster object

p1 <- predict(m1,newdata=preds,filename='p1.img') # many commonly used raster format is supported (through the package rgdal)

plot(p1)

p2 <- predict(m2,newdata=preds,filename='p2.img')

p2

## class       : RasterBrick 
## dimensions  : 71, 124, 8804, 10  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : /Users/bnaimi/Dropbox/R_Books_Docs/r-gis.net/_sdm_rforge/sdm/pkg/sdm/vignettes/p2.img 
## names       : id_1.sp_1.m_rf.re_subs, id_2.sp_1.m_rf.re_subs, id_3.sp_1.m_cart.re_subs, id_4.sp_1.m_cart.re_subs, id_5.sp_1.m_fda.re_subs, id_6.sp_1.m_fda.re_subs, id_7.sp_1.m_mars.re_subs, id_8.sp_1.m_mars.re_subs, id_9.sp_1.m_svm.re_subs, id_10.sp_1.m_svm.re_subs 
## fullname    : id_1-species_Occurrence-method_rf-replication_subsampling, id_2-species_Occurrence-method_rf-replication_subsampling, id_3-species_Occurrence-method_cart-replication_subsampling, id_4-species_Occurrence-method_cart-replication_subsampling, id_5-species_Occurrence-method_fda-replication_subsampling, id_6-species_Occurrence-method_fda-replication_subsampling, id_7-species_Occurrence-method_mars-replication_subsampling, id_8-species_Occurrence-method_mars-replication_subsampling, id_9-species_Occurrence-method_svm-replication_subsampling, id_10-species_Occurrence-method_svm-replication_subsampling

nlayers(p2)

## [1] 10

plot(p2[[1:4]]) # plot the first 12 rasters

# we can take the mean of raster over different runs for each method and species:
p2m <- predict(m2,newdata=preds,filename='p2m.img',mean=T)

p2m

## class       : RasterBrick 
## dimensions  : 71, 124, 8804, 5  (nrow, ncol, ncell, nlayers)
## resolution  : 4219.223, 4219.223  (x, y)
## extent      : 100975.3, 624159, 3988830, 4288395  (xmin, xmax, ymin, ymax)
## coord. ref. : NA 
## data source : /Users/bnaimi/Dropbox/R_Books_Docs/r-gis.net/_sdm_rforge/sdm/pkg/sdm/vignettes/p2m.img 
## names       : sp_1.m_rf.re_subs, sp_1.m_cart.re_subs, sp_1.m_fda.re_subs, sp_1.m_mars.re_subs, sp_1.m_svm.re_subs 
## min values  :      1.915833e-02,        0.000000e+00,       1.772793e-03,        8.986686e-05,      -6.880102e-02 
## max values  :         0.9867000,           0.9677419,          0.9975556,           0.9979973,          1.1222804 
## fullname    : species_Occurrence-method_rf-replication (Mean)_subsampling, species_Occurrence-method_cart-replication (Mean)_subsampling, species_Occurrence-method_fda-replication (Mean)_subsampling, species_Occurrence-method_mars-replication (Mean)_subsampling, species_Occurrence-method_svm-replication (Mean)_subsampling

plot(p2m)

# full names of rasters:
getZ(p2m)

## [1] "species_Occurrence-method_rf-replication (Mean)_subsampling"  
## [2] "species_Occurrence-method_cart-replication (Mean)_subsampling"
## [3] "species_Occurrence-method_fda-replication (Mean)_subsampling" 
## [4] "species_Occurrence-method_mars-replication (Mean)_subsampling"
## [5] "species_Occurrence-method_svm-replication (Mean)_subsampling"