---
title: "An Overview of meteR: testing the Maximum Entropy Theory of Ecology"
author: "Andy Rominger, Cory Merow"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    fig_caption: yes
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{An Overview of meteR: testing the Maximum Entropy Theory of Ecology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
<!--  rmarkdown::html_vignette % this doesn't seem to allow TOC
#another option for output
output: rmarkdown::html_vignette

output: 
  pdf_document:
    toc: true
    toc_depth: 3

vignette: >
  \VignetteIndexEntry{An Overview of meteR: testing the Maximum Entropy Theory of Ecology}
  \VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
  \usepackage[T1]{fontenc} 
  -->

<!-- For more details on using R Markdown see <http://rmarkdown.rstudio.com>. 
#TO DO
FIX latex
-ADD data refs
-how to handle plotting when just the state variables are specified? - make it so you can add this to the the plot via just a line?
-->

```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE)
# note that setting this to true causes problems with R CMD CHECK
```

<!--========================================================================-->
#Introduction
The package `meteR` is designed to facilitate fitting the models for the Maximum Entropy Theory of Ecology (METE) from data. For an overview of METE, see Harte et al. (2008), Harte (2011), and Harte and Newman (2014). Note that throughout this tutorial we use the notation from these sources without extensive explanation (Figure 1). `meteR` can use data input in multiple formats and predict all the fundamental distributions and relationships of the theory. Our objective is to facilitate tests of METE with empirical data sets. 

![fig1](METE_Fig.png) 

Figure 1 was taken from Harte and Newman (2014) TREE 29: 384–389 and illustrates the derivation of, and connection between, different METE distributions and relationships.

<!--========================================================================-->
<!--========================================================================-->
<!--========================================================================-->

#Case Study 1 - Abundance and Power Distributions
We begin illustrating the capabilities of `meteR` with data from an arthropod community (Gruner 2007) collected via canopy fogging in Hawaii and including individual body mass measurements for each individual collected. These data are distributed as part of `meteR`. We will use it to illustrate the construction of distributions related to abundance and metabolic rate. Notably, these data do not contain any spatial information so we illustrate spatial predictions with a different data set in the next section.

```{r}
library(meteR)
data(arth)
dim(arth)
head(arth)
```
This data set illustrates one data format used by `meteR`; each row represents an individual, with an observation of its metabolic rate (note that we convert mass to metabolic rate using the usual relationship of Metabolic Scaling Theory such that metabolic rate $M \propto mass^{3/4}$). If multiple individuals of the same size from the same species are observed (and lumped into one record), these can be specified as well, i.e., by letting `arth$count` be something other than 1. For an example of such formatting see the data set `anbo`, included with  `meteR`, discussed in the next section. There are two main reasons to provide data to `meteR`: (1) `meteR` will calculate the state variables $N_0$ (number of individuals), $S_0$ (number of species), $E_0$ (total metabolic rate) and relevant summary statistics automatically; (2) these data are used by `meter` to compare against predictions. If data are not provided, the values for the state variables can be directly specified by the user (see Case Study 3).

Analysis begins by building the ecosystem structure function (ESF; $R(n,e)$) from which all non-spatial macroecological metrics can be derived. $R(n, e)$ describes the joint probability of observing a species with $n$ individuals and a randomly chosen member of that species having metabolic rate $e$. METE computes this distribution by maximizing information entropy relative to the constraints of $N_0/S_0$ and $E_0/S_0$ using the method of Lagrange multipliers. In `meteR` we achieve this as follows:
```{r}
esf1 <- meteESF(spp=arth$spp,
                abund=arth$count,
                power=arth$mass^(3/4),
                minE=min(arth$mass^(3/4)))
esf1
str(esf1)
```
Note that we use the terms 'power' and 'metabolic rate' interchangeably (units of power are energy/time and thus an energetic rate). Further, note that we specified the minimum value for the metabolic rate, but that the minimum observed value will be taken by default. Without loss of generality, metabolic rates are re-scaled by this minimum such that the minimum possible observable metabolic rate is 1. This is necessary for the underlying mathematics as discussed in Harte (2011). 

The returned object (of class `meteESF`) contains useful information. In addition to returning the inputs (in analogy, e.g., to common model fitting functions such as `lm`), it returns the Lagrange multipliers from entropy maximization as well as information about the fitting procedure.

<!--========================================================================-->
<!--========================================================================-->

## Species Abundance Distribution (SAD)
From the ESF, we can predict macroecological patterns. We begin with the species abundance distribution (SAD).
```{r}
sad1 <- sad(esf1)
sad1
```
The `S3` method `sad` for class `meteESF` extracts the SAD and provides density (d), probability (p) and quantile (q) functions, as well as random number generation (r) for the distribution (see these with `str(sad1)`). This specification follows the conventions used by statistical distributions in the `stats` package (e.g. see `?rnorm`). These distributions allow us to use the model in a number of ways. You can access these functions directly as follows; e.g. randomly generating samples from the fitted distribution or determining the quantiles.
```{r}
sad1$r(20)
sad1$q(seq(0,1,length=10))
```

`meteR` readily plots the SAD as either a rank-abundance distribution (`ptype=rad`) or a cumulative distribution (`ptype=cdf`) (and other predicted distributions):
```{r sad_plot,fig.width=6.5, fig.cap="Different ways of plotting the SAD"}
par(mfrow=c(1,2))
plot(sad1, ptype='rad', log='y')
plot(sad1, ptype='cdf', log='x')
```

`meteR` provides functions to assess model fit based on likelihood and residuals. First, we illustrate likelihood methods:

```{r, fig.width=5, fig.cap="Likelihood-based z-score."}
#== calculate the liklihood of the data, given the fitted model
logLik(sad1)
#== randomly generate 100 data sets from the fitted distribution and calculate
#== the z-score of the data w.r.t. these simulations
llz <- logLikZ(sad1, nrep=100, return.sim=TRUE)
llz$z
#== plot the distributions
plot(density(llz$sim, from=0),
     xlim=range(c(llz$sim,llz$z)),
     xlab='Scaled log(likelihood)',col='red')
#== add 95% quantile region
abline(v=quantile(llz$sim, 0.95), col='red')
#== add observed likelihood
abline(v=llz$z,lty=2)
legend('top',legend=c('data','simulated from METE'),
       col=c('black','red'), lty=c(2, 1), bg='white') 
```

Because the likelihood of the observed data falls within the 95\% quantile region of the simulated likelihoods, we have confidence that the model provides a good fit.  It should be noted that the z-score and associated simulation values are square-transformed such that the resulting null distribution approaches a Chi-squared distribution with d.f. = 1.  Specifically the z-score is $z = ((\text{logLik}_{obs} - mean(\text{logLik}_{sim})) / sd(\text{logLik}_{sim}))^2$.

Note that other utilities that extract liklihoods from model objects are also useful, including AIC, thus opening up all model comparison capabilities in `R` and its contributed packages:
```{r}
AIC(sad1)
```

Next, we illustrate methods relying on residuals to assess model fit. Residuals can be calculated either on the rank abundance (`type='rank'`) or cumulative (`type='cumulative'`) distribution and can be calculated as relative residuals ($(x_i - \hat{x})/\hat{x}$) or absolute ($x_i - \hat{x}$) by setting argument `relative` to `TRUE` or `FALSE`, respectively. Relative residuals can be thought of as the proportional difference between observed and expected abundance or probability. From the residuals, mean squared error can be calculated and used to evaluate model fit. Although both rank abundance and cumulative density options are available, we recommend only using rank abundance as cumulative density is constrained at upper and lower bounds and so its residuals will not be as reliable.

```{r}
#== calculate the residuals from the fitted distribution
head(residuals(sad1))
head(residuals(sad1, type='cumulative', relative=FALSE))
#== calculate the mean-squared error
mse(sad1, type='rank', relative=FALSE)
#== randomly generate 100 data sets from the fitted distribution and calculate
#== the z-score of the data w.r.t. these simulations
msez.rank <- mseZ(sad1, nrep=100, return.sim=TRUE, type='rank')
msez.rank$z
#== plot the distributions
plot(density(msez.rank$sim),
     xlim=c(0.05, 10),
     xlab='Scaled mean squared error',col='red', log='x')
#== add 95% quantile region
abline(v=quantile(msez.rank$sim, 0.95), col='red')
#== add observed likelihood
abline(v=msez.rank$z,lty=2)
legend('top',legend=c('data','simulated from METE'),
       col=c('black','red'), lty=c(2 ,1),bg='white') 
```

Here again we fail to reject METE as the model that generated the observed data (the observed MSE is below the 95\% quantile region). We can also see that the distribution of errors (which again uses the same square transformation as in `logLikZ`) is not as Chi-squared shaped as the distribution of likelihoods. Thus although residuals can be useful (specifically we can see whether, e.g., common species are more or less common than predicted by METE) we recommend using likelihood for evaluating model fit.

<!--========================================================================-->
<!--========================================================================-->

## Individual Power Distribution (IPD)

Similarly to the analyses illustrated above for the SAD, one can examine the individual power distribution (IPD; the distribution of metabolic rates, or power, over individuals in the community). We illustrate with an abbreviated version of the analyses above. First, fit and plot the IPD.

```{r fig.width=7, fig.cap="Different ways of plotting the IPD"}
ipd1 <- ipd(esf1)
ipd1
str(ipd1) # analogous structure to sad1 above
ipd1$r(8) # random number generation from fitted distribution

par(mfrow=c(1,2))
plot(ipd1, ptype='cdf', log='x')
plot(ipd1, ptype='rad', log='y')
```

Next, assess the fit of the IPD.

```{r fig.width=5, fig.cap="Likelihood-based z-score."}
head(residuals(ipd1))
logLik(ipd1)
logLikZ(ipd1, nrep=100)
llz <- logLikZ(ipd1, nrep=100, return.sim=TRUE)
plot(density(llz$sim),xlim=range(c(llz$sim,llz$obs)),xlab='log(likelihood)',col='red')
abline(v=llz$obs,lty=2)
llz$z
legend('top',legend=c('data','simulated\nfrom METE'),col=c('black','red'),
       lty=c(1,1),bty='n') 

```

Interestingly, although it appears that the data are unlikely to have been drawn from the METE distribution based on the z-score of the likelihood, the mean squared error suggests otherwise.

```{r fig.width=5, fig.cap="Mean squared error-based z-score."}
mse(ipd1, type='rank', relative=FALSE)
mseZ <- mseZ(ipd1, nrep=100, return.sim=TRUE)
mseZ$z
plot(density(mseZ$sim),xlim=range(c(mseZ$sim,mseZ$obs)),xlab='mse',col='red')
abline(v=mseZ$obs,lty=2)
legend('top',legend=c('data','simulated\nfrom METE'),col=c('black','red'),
       lty=c(1,1),bty='n') 
```

<!-- yeah i'm worried about this, especially cause the plots look like theory fits really well. will need to look into it more -->

METE predicts two other distributions relating to metabolic rates: the distribution of metabolic rates across individuals within a species with $n$ total individuals (the `sipd`, $\Theta(e | n)$ in the notation of Harte 2011) and the species level distribution of average metabolic rates (the `spd`, $\nu(e)$ in the notation of Harte 2011). 

We can obtain the SIPD with `S3` method `sipd` which requires either a species ID be specified or the total abundance of a hypothetical species:
```{r fig.width=5, fig.cap="Distribution of metabolic rates of individuals in a species with $n$ individuals"}
sipd1 <- sipd(esf1, sppID=27)
plot(sipd1, log='x',ylim=c(0,1))

## sipd based on total abundance of a hypothetical species
sipd2 <- sipd(esf1, n=25)
```
Note that this plot is not particularly informative in this case because there are only 3 unique observations of metabolic rate for this species. 

We can obtain the SPD with `S3` method `spd`:
```{r fig.width=5, fig.cap="Distribution of metabolic rates of individuals in a species with $n$ individuals"}
spd1 <- spd(esf1)
plot(spd1, log='x')
```

Because the distribution of within species metabolic rates depends on the total abundance of that species, it makes sense that there should be a relationship between the mean metabolic rate and the abundance of a species. This relationship has been widely studied (Damuth 1998; White et al. 2007) and if mean metabolic rate is independent of $n$ this is known as energy equivalence. To explore this relationship we introduce a new `S3` class `meteRelat` which acts like `meteDist` to house both observed and theoretical predictions. `meteRelat` objects represent a deterministic relationship, in contrast to the  probabilistic distributions represented by `meteDist` objects.  To obtain the relationship between abundance and mean metabolic rate we use the `ebar` function:
```{r fig.width=5, fig.cap="Relationship between abundance and mean metabolic rate"}
ebar1 <- ebar(esf1)
plot(ebar1)
```
<!--========================================================================-->
<!--========================================================================-->
<!--========================================================================-->

#Case Study 2 - Spatial distributions
Next, we illustrate spatial analyses with data from a survey of a desert annual grassland community at the Anza 
Borrego Preserve (Harte and Harte, unpub. data). The survey recorded every herbaceous plant within a 4x4m plot gridded to ever square meter (resulting in 16 total cells).  Thus spatial data take the form of the row and column identity where each plant was recorded. This dataset is distributed as part of `meteR`. We will use it to illustrate the construction of distributions related to abundance and area. Note that metabolic rate/body mass information are not available, but that we can still analyze many of METE's predictions related to abundance and spatial distribution. As shown in case study 1 above, we can build the ESF and investigate the SAD.

```{r}
data(anbo)
head(anbo)
esf2 <- meteESF(spp=anbo$spp,
             abund=anbo$count)
str(esf2)
```

Note that when metabolic rate data are absent, `meteR` assumes a very large value for the state variable E0, such that terms involving E0 can be safely ignored in distributions and relationships involving abundance following Harte (2011).

```{r,fig.width=7,fig.cap="Different ways of plotting the SAD for the anbo data."}
(sad2=sad(esf2))
par(mfrow=c(1,2))
plot(sad2, ptype='rad')
plot(sad2, ptype='cdf')
```

Analogous to the construction of the ESF above, we construct the Spatial Structure Function (SSF) (denoted $\Pi(n | n_0, A, A_0)$ in Harte 2011). $\Pi$ depends on the total abundance $n_0$ of the species of interest as well as the the total area of the study system $A_0$ and the scale for which the SSF is to be calculated ($A$). Again, in analogy to how the SAD is constructed from the ESF, we can obtain the Species Spatial Abundance Distribution (SSAD) from the SSF, but here must specify the species for which the SSF and SSAD are to be constructed as well as the spatial state variables (both the area of interest `A` and the total area `A0`).

```{r,fig.width=7,fig.cap="SSAD for the anbo data."}
## note we are calculating SSF for the species crcr
pi1 <- meteSSF(anbo$spp, 'crcr', anbo$count, row=anbo$row, col=anbo$column, A=1, A0=16)
pi1
plot(ssad(pi1))
```

Areas greater than the minimum can also be explored:
```{r,fig.width=7,fig.cap="SSAD for A=2."}
pi2 <- meteSSF(anbo$spp, 'crcr', anbo$count, row=anbo$row, col=anbo$column, A=2, A0=16)
pi2
plot(ssad(pi2)) # theory is not looking too good for this case
```

This is done by aggregating cells internally in `meteSSF`.

We will examine the case where spatial data are given in the form of `x` and `y` coordinates for each abundance measure by simulating such data.

```{r,fig.width=7,fig.cap="SSAD for simulated x,y anbo data."}
## jitter abundance records within each cell
anbo$x <- runif(nrow(anbo), 0, 1) + anbo$column
anbo$y <- runif(nrow(anbo), 0, 1) + anbo$row

pi3 <- meteSSF(anbo$spp, 'crcr', anbo$count, x=anbo$x, y=anbo$y, A=1, A0=16)
plot(ssad(pi3)) # the plot has naturally changed slightly due to the jittering
```

Next we predict the Species Area Relationship and Endemics Area Relationship (SAR and EAR). We begin by predicting the number of species at scales smaller than the total plot size. For this we need to combine the ESF and the SSF. Note the the EAR is obtained in all functions by specifying argument `EAR=TRUE`. Downscaling assumes that METE holds at the largest spatial scale and smaller scales are spatially nested subsamples of this larger scale.  As such this approach makes intuitive sense for small-scale plot data.  Conversely upscaling allows one to use plot-level data to predict species richness at scales much larger than the surveyed plot.  This is achieved by iteratively solving for the larger spatial scale that would yield the observed smaller scale under nested subsampling.  This iterative process can be carried out to indefinite scales.

The METE-predicted SAR can be obtained in two ways, one directly and one recalling the `meteDist` objects explored earlier.

```{r,fig.width=7,fig.cap="The SAR (black) and EAR (blue) for the anbo data."}
## first the direct method
anbo.esf <- meteESF(spp=anbo$spp, abund=anbo$count)
anbo.thr.downscale <- downscaleSAR(anbo.esf, 2^(seq(-3, 4, length=17)), 16)
anbo.thr.downscale
anbo.thr.downscaleEAR <- downscaleSAR(anbo.esf, 2^(seq(-3, 4, length=17)), 16, EAR=TRUE)

## upscaling
anbo.thr.upscale <- upscaleSAR(anbo.esf, 16, 2^6)

## plotting
plot(anbo.thr.downscale, xlim=c(1, 2^6), ylim=c(0, 35))
plot(anbo.thr.downscaleEAR, col='blue', add=TRUE)
plot(anbo.thr.upscale, col='red', add=TRUE)
```
In this example, there are 24 species distributed across 16 quadrats, hence the downscaled SAR (black) and EAR (blue) converge at 24 species at an area of 16. Endemics here must be interpreted as local endemics, thus all 24 species are ``endemic'' to the 16m$^2$ plot.  Upscale species richness is shown in red.  Note that for mathematical convenience (discussed in Harte 2011) we only consider doublings of area when computing upscaled species richness.  For intermediate areas one can simply interpolate.

To compare the mete prediction with data we use `meteSAR` to obtain a `meteRelat` object that stores both observed data and theoretical predictions:
```{r,fig.width=7,fig.cap="Comparing SAR with data."}
anbo.sar <- meteSAR(anbo$spp, anbo$count, anbo$row, anbo$col, Amin=1, A0=16)
anbo.sar
plot(anbo.sar, log='xy')
```

Note that we can also calculate the SAR or EAR using `x`, `y` data instead of rows and columns similarly to how we did this with the SSF:
```{r}
anbo.sar.sim <- meteSAR(anbo$spp, anbo$count, x=anbo$x, y=anbo$x, Amin=1, A0=16)
anbo.sar.sim
```

<!-- ANDY: is there an easy way to get confidence intervals on these predictions? would you have to bootstrap the data or is there a more elegant option? More generally, can you think of a generic way to get intervals on any of the predictions or meter? Maybe theres a slick way to use bootstrapping package and just run each of our functions in just a few lines?-->

<!-- the z-score simulation stuff can give confidence intervals, but for the relationships there's no straightforward way to do it.  we'd have to make probabilistic relationships, which is do-able, but best saved for later.  Justin Kitzes actually has a paper about probabilistic SAR that we can base our later attempts on-->

The empirical SAR (or EAR) can also be directly obtained from the data without jointly calculating the theoretical predictions:
```{r,fig.width=7}
## empirical SAR and EAR
anbo.obs.sar <- empiricalSAR(anbo$spp, anbo$count, anbo$row, anbo$col, Amin=1, A0=16)
plot(anbo.obs.sar)

anbo.obs.ear <- empiricalSAR(anbo$spp, anbo$count, anbo$row, anbo$col, Amin=1, A0=16, EAR=TRUE)
plot(anbo.obs.ear)
```


<!--========================================================================-->
<!--========================================================================-->
<!--========================================================================-->

# Case Study 3 - Specifying the state variables
`meteR` allows the user to manually specify the state variables, which can be useful to explore how METE predictions vary as a function of state variables without regard to specific datasets. 

```{r,fig.width=7}
esf3 <- meteESF(N0=4000, S0=50, E0=1e5, minE=1)
sad3 <- sad(esf3)
ipd3 <- ipd(esf3)
par(mfrow=c(1,2))
plot(sad3)
plot(ipd3)
```

Similarly, one can predict spatial distributions from the state variables. First, we downscale the species area relationship.

```{r,fig.width=7}
## theoretical SARs from state variables only
thr.downscale <- downscaleSAR(meteESF(S0=40, N0=400), 2^seq(-1, 4, by=0.25), 16)
thr.downscaleEAR <- downscaleSAR(meteESF(S0=40, N0=400), 2^seq(-1, 4, by=0.25), 16, EAR=TRUE)
plot(thr.downscale, ylim=c(0, 40), col='red')
plot(thr.downscaleEAR, add=TRUE, col='blue')
```

We can also upscale the SAR. 
```{r,fig.width=7}
thr.upscale <- upscaleSAR(meteESF(S0=40, N0=400), 2^(-1:4), 16)
```

<!--
For example, we could explore how the SAD and IPD vary as a function of N0.
```{r,fig.width=7}
#N0s=c(50,100,200,400,800,1600,3200,6400)
#esf4=sad4=ipd4=vector('list',8)
#for(i in 1:length(N0s)){
#  esf4[[i]]=meteESF(N0=N0s[i],S0=50,E0=1e5,minE=1)
#  sad4[[i]]=sad(esf4[[i]])
#  ipd4[[i]]=ipd(esf4[[i]])
#}

#par(mfrow=c(1,2))
#plot(sad4[[i]])
#for(i in 1:length(N0s)){ 

#}
#plot(sad3)
#plot(ipd3)
```
-->
<!--========================================================================-->

# References
* Damuth, J. D. 1998. Population ecology: common rules for animals and plants. Nature 395: 115-116.
* Gruner, D. S. 2007. Geological age, ecosystem development, and local resource constraints on arthropod community structure in the Hawaiian Islands. Biological Journal of the Linnean Society 90: 551-570.
* Harte, J. 2011. Maximum entropy and ecology: a theory of abundance, distribution, and energetics. Oxford University Press.
* Harte, J. et al. 2008. Maximum entropy and the state-variable approach to macroecology. Ecology 89: 2700-2711.
* Harte, J. and Newman, E. A. 2014. Maximum information entropy: a foundation for ecological theory. Trends in Ecology and Evolution 29: 384–389.
* White, E. P., et al. 2007. Relationships between body size and abundance in ecology. Trends in Ecology and Evolution 22: 323-330.