--- title: "REMA model equations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{REMA model equations} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_chunk$set(warning = FALSE, message = FALSE) # devtools::build_rmd(files = 'vignettes/rema_equations.Rmd') ``` # Base model structure for a single survey and stratum The base REMA model can be represented as a simple state-space random walk model with added noise. The observation model is comprised of an index of log-transformed annual survey biomass estimates $ln(B_t)$ with associated standard deviations $\sigma_{ln(B_t)}$, where $\sigma_{ln(B_t)}$ is approximated using the coefficient of variation of $B_t$ ($\sigma_{B_t}/B_t$) such that $$ \sigma_{ln(B_t)}=\sqrt{ln((\frac{\sigma_{B_t}}{B_t})^2+1)}. $$ The measurement or observation equation, which describes the relationship between the observed survey biomass $ln(B_t)$ and the latent state variable, population biomass $ln(\hat{B}_t)$, is expressed as $$ ln(B_t) = ln(\hat{B}_t) + \epsilon_{B}, \text{where } \epsilon_{B} \sim N(0, \sigma_{ln(B_t)}^2). $$ The state equation and associated process error variance $\sigma_{PE}^2$ is defined as $$ ln(\hat{B}_{t}) = ln(\hat{B}_{t-1})+\eta_{t-1}, \text{where } \eta_t \sim N(0, \sigma_{PE}^2), $$ where the initial $ln\hat{B}_{1}$ is constrained by the random walk process. In the base model, the process error variance $\sigma_{PE}^2$ is the only fixed effect parameter estimated, and the unobserved population biomass $ln(\hat{B}_t)$ is estimated as a series of random effects. The model is fit using marginal maximum likelihood estimation, where the marginal negative log-likelihood is obtained via the Laplace approximation (Skaug and Fournier 2006). A reproducible example of fitting the univariate (i.e., single stratum, single survey) version of the TMB model and comparing those results to ADMB is available in the [REMA basics](https://afsc-assessments.github.io/rema/articles/ex1_basics.html) vignette. ## Extending to multiple biomass survey strata The single survey, single stratum version of the model can be extended to include one or more additional strata $j$ from the same survey. The inclusion of multiple strata in the same model is advantageous in scenarios where the apportionment of biomass among areas is a desired result. This extension assumes no correlation in the observation errors among survey strata, though PE variance can be shared or estimated independently among strata. The ADMB and TMB versions of the REMA model utilize different methods for estimating variance of the total predicted biomass. Therefore, while strata-specific estimates of predicted biomass and associated confidence intervals will be close to identical between the ADMB and TMB versions of the model, the confidence intervals of the total predicted biomass will differ slightly. In the original ADMB code, the Marlow (1967) method is applied, such that the total variance $\sigma_J^2$ is approximated as $$ \sigma_J^2 = ln\Bigg(\frac{\sum{e^{2 \hat{B}_j + \sigma_{\hat{B}_j}^2} (e^{\sigma_{\hat{B}_j}^2 - 1})}} {(\sum{e^{\hat{B}_j + \sigma_{\hat{B}_j}^2/2}})^2} + 1\Bigg). $$ In the updated *rema* package, the total variance is estimated using the standard Delta method and can be replicated in ADMB using an `sdreport_number` and in TMB using the `ADREPORT` macro. As described in Monnahan et al. (2021), the exploration of methods for summing log-normal variables is a research topic that has potential impacts beyond the scope of this package. A reproducible example of fitting the multivariate version of the REMA model in TMB and a comparison of results to ADMB is available in the [REMA basics](https://afsc-assessments.github.io/rema/articles/ex1_basics.html) vignette. In this example, we also show how Akaike Information Criteria (AIC) can be used to explore the inclusion of strata-specific versus a single, shared process error variance. ## Addition of an auxiliary catch per unit effort (CPUE) survey In situations where an auxiliary biomass or catch per unit effort (CPUE) survey $I_{t}$ and associated variance $\sigma_{I_{t}}$ are available, an additional scaling parameter $q$ (estimated in log-space) can be estimated to facilitate the inclusion of the new information into biomass predictions. The predicted annual CPUE survey index $\hat{I}_{t}$ is calculated as $$ \hat{I}_{t} = e^{q} * e^{\hat{B}_{t}} $$ and the CPUE survey observations has an additional measurement equation and likelihood component similar to the biomass survey: $$ ln(I_{t}) = ln(\hat{I}_{t}) + \epsilon_{I}, \text{where } \epsilon_{I} \sim N(0, \sigma_{ln(I_t)}^2), $$ By default, *rema* estimates a single $q$ for each stratum; however, users can alternatively specify shared $q$ parameters among strata. If the strata definitions are not the same for t,he biomass and CPUE surveys (e.g. biomass is estimated at a finer geographic resolution that CPUE), the user can define the relationship between the two surveys' strata using the `q_options` argument to the [`prepare_rema_input()`](https://afsc-assessments.github.io/rema/reference/prepare_rema_input.html) function (see `q_options` and `pointer_biomass_cpue_strata)`. However, since the the auxiliary CPUE survey index is related to unobserved population biomass at the level of the biomass survey strata, the REMA model can only accommodate scenarios where the CPUE survey strata have a coarser resolution or are equivalent to the biomass survey strata. A reproducible example of fitting to the two-survey version of the REMA model is provided in the [Fitting to an additional CPUE survey vignette](https://afsc-assessments.github.io/rema/articles/ex2_cpue.html). This example also describes an error in the previously used ADMB version of the REMA model, and presents several alternative models that estimate additional observation error. These topics are described in more detail in the following two sections. ### The ADMB version of REMA Separability, as implemented in the `SEPARABLE_FUNCTION` in ADMB and through automatic detection in TMB, increases the computational efficiency of the Laplace approximation by breaking complex, multivariate integrals into a product of simpler, univariate integrals (Fournier et al. 2012). The use of separability is particularly relevant for state-space models as it relates to the efficiency of the Laplace approximation of the marginal negative log-likelihood and integration of the random effects from the joint negative log-likelihood. In the random effects implementation of ADMB, parameters defined inside the `PARAMETER_SECTION` of the template file cannot be used within the `SEPARABLE_FUNCTION` unless they are passed as arguments to the function (Skaug and Fournier 2013). In the ADMB version of the REMA model, the calculation of predicted CPUE [i.e. $ln(\hat{I}_{t})=ln(q)+ln(\hat{B})$] occurred outside rather than inside the `SEPARABLE_FUNCTION`, and as a result, violated this rule and affecting the accuracy of the Laplace approximation. We explored this error in a [simplified example](https://github.com/JaneSullivan-NOAA/separable_function) of the REMA model in both ADMB and TMB. We were able to reproduce results from the TMB version of REMA in ADMB by passing $ln(q)$ as an argument to the `SEPARABLE_FUNCTION` and performing the $ln(\hat{I}_{t})$ calculation inside the `SEPARABLE_FUNCTION`. When we used Markov Chain Monte Carlo (MCMC) methods for statistical inference instead of MLE, the results from both ADMB versions closely matched the correct version of REMA. Finally, a comparison of the individual negative log-likelihood (NLL) components, along with the joint and marginal NLLs, revealed that all were the same between the TMB model, correct "inside" ADMB version, and incorrect "outside" ADMB version, *except* the marginal NLL. Taken together, this analysis confirms there is a bug in the "outside" ADMB version, which affects the accuracy of the Laplace approximation and model results. ## Estimation of additional observation error Based on experience gained using alternative observed index estimates (e.g., relative CPUE indices), there appears to be cases where the estimates of observation error variances for the biomass and/or CPUE survey are too low (e.g., Echave et al. 2020). That is, there is a mismatch between biologically reasonable inter-annual variability and the precision of index estimates. In these instances, the model estimates of $(\sigma^2_{B_{t,j}}+\sigma^2_{I_{t,j}})/\sigma^2_{PE}$ may be lower than what should be expected based on an individual species life history traits. For example, if the ratio of observation to PE variation is low, model predictions of population biomass may exhibit high inter-annual variability. This behavior would be unexpected in low productivity species, which should exhibit low inter-annual variation in biomass (i.e. low PE variance), especially in situations when fishing exploitation is low. One approach to addressing this issue is to estimate additional observation error. This method is commonly implemented in Stock Synthesis, has been implemented in Alaskan crab stock assessments (Zheng and Siddeek 2020), and has been explored in several groundfish assessments as well. Using the biomass survey as an example, the extra estimated error ($\sigma_{\tau}$) is specified as: $$ \sigma_{ln(B_t)}=\sqrt{ln((\frac{\sigma_{B_t}}{B_t}+{\sigma_{\tau}})^2+1)}. $$ This approach is a new method for Tier 5 stock assessments and apportionment methods at the AFSC. A reproducible example with GOA shortraker is outlined in the [Fitting to an additional CPUE survey](https://afsc-assessments.github.io/rema/articles/ex2_cpue.html). In this example, the estimation of additional observation error for the biomass and CPUE survey resulted in a better fit by AIC than status quo approaches and has a more biologically realistic trend in predicted population biomass. ## Exploration of the Tweedie distribution for zero biomass observations The [Tweedie distributions](https://en.wikipedia.org/wiki/Tweedie_distribution) are a family of probability distributions that can be generalized to include the Gaussian, inverse Gaussian, gamma, Poisson, and compound Poisson-gamma distributions. The Tweedie distribution can be defined with three parameters, a mean ($\mu$), power parameter ($\rho$), and dispersion ($\phi$), where the relationship between the variance and these parameters is defined by $\sigma^2=\phi\mu^\rho$. When $\rho$ is bounded between 1 and 2, the Tweedie is a positive, continuous distribution that can handle zero values, thus allowing it to more naturally handle survey time series with one or more zero observations. Values of $\rho$ on these bounds are special cases of the Tweedie distribution, where $\rho$=1 is equivalent to a Poisson distribution and $\rho$=2 is a gamma distribution. In *rema*, the Tweedie $\rho$ is constrained between 1 and 2 using a logit-transformation. Similar to the normal distribution, observation error variances are treated as known for the Tweedie distribution in the REMA model. Using the biomass survey observation as an example, the dispersion of the biomass observation in strata $j$ in year $t$ is derived as $$ \phi_{B_{t,j}}=\frac{\sigma_{B_{t,j}}^2}{{(\hat{B}_{t,j})}^\rho}. $$ The result is that only one additional parameter ($\rho$) is estimated when applying this alternative distribution. The measurement equation for the Tweedie becomes $$ B_{t,j} = \hat{B}_{t,j} + \epsilon_{B}, \text{where } \epsilon_{B} \sim Tw_{\rho}(0, \phi_{B_{t,j}}). $$ Because $\sigma_{B_{t,j}}$ is undefined when $B_{t,j}$=0, the zeros are assumed to have a CV=1.5. This assumption can be explored by the user in `rema`, for example, if the user wanted to define a CV=2.0 for zeros, they would define `zeros = list(assumption = 'tweedie', options_tweedie = list(zeros_cv = 2.0))` within the `prepare_rema_input()` function. A reproducible example using the Tweedie distribution for observation error is outlined in the [Strategies for handling zero biomass observations](https://janesullivan-noaa.github.io/rema/articles/ex3_zeros.html) vignette. We use the non-shortspine thornyhead component of the Bering Sea/Aleutian Islands other rockfish stock, which is unique in that 13 of 38 bottom trawl survey biomass estimates are zeros. While the Tweedie performs well for this case, further exploration revealed that the Tweedie models can be very slow to run and often did not converge, especially in instances where the observation error variance estimates were low. It is possible that further development on estimating additional observation error variance (e.g. ${\tau}^2$ in the previous section) could alleviate these convergence errors. However, in the interim, the Tweedie distribution in *rema* should be considered experimental, and we do not recommend it as a viable alternative for tactical assessments at this time. ## Experimental One-Step Ahead (OSA) residuals The use of one-step ahead (OSA) residuals, also referred to as forecast residuals or prediction errors, is crucial for validation of state-space models like REMA (Thygesen et al. 2017). Instead of comparing the observed and expected values at the same time step, OSA residuals use forecasted values based on all previous observations (i.e. they exclude the observed value at the current time step from prediction). Traditional residuals (e.g. Pearson's residuals) are inappropriate for state-space models, because process error variance may be over-estimated in cases where the model is mispecified, thus leading to artificially small residuals (see Section 3 of Thygesen et al. 2017 for an example). Methods for calculating OSA residuals have been implemented in TMB through the `TMB::oneStepPredict()` function. While these methods are straight forward to implement in TMB, they are computationally demanding, and the validity (and accuracy) of the OSA residuals may vary by situation and method used. Methods to implement OSA residuals in *rema* are under development and should be considered experimental. Currently, OSA residuals are implemented using the `method = "cdf"` option in `TMB::oneStepPredict()`, which has the benefit of speeding up residual calculations by saving copies of the one-step cumulative density function at each data point and thus reducing the number of calculations at each function evaluation. While OSA residuals appear to be calculated correctly for some REMA models, occasionally `NaN` values for residuals are returned, especially in cases with small measurement errors. One potential cause of this error may lie in the initialization of the state process within the REMA model and will be explored further in future versions of this package. # References Echave, K. B., P. J. F. Hulson, and K. A. Siwicke. 2020. Assessment of the Thornyhead stock complex in the Gulf of Alaska. In: Stock assessment and fishery evaluation report for the groundfish resources of the Gulf of Alaska as projected for 2021. North Pacific Fishery Management Council, 605 W. 4th. Avenue, Suite 306, Anchorage, AK 99501. Fournier D. A., H. J. Skaug , J. Ancheta , J. Ianelli , A. Magnusson, M. N. Maunder , A. Nielsen, and J. Sibert. 2012. AD Model Builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models, Optimization Methods and Software, 27:2, 33-249, DOI: 10.1080/10556788.2011.597854 Marlow, N. A. 1967. A normal limit theorem for power sums of independent normal random variables. Bell System Technical Journal. 46 (9): 2081--2089. . Methot, R. D. and Wetzel, C. R. 2013. Stock Synthesis: A biological and statistical framework for fish stock assessment and fishery management. Fisheries Research, 142: 86-99. Skaug H. and D. Fournier. 2013. Random effects in AD Model Builder: ADMB-RE User Guide. Skaug, H. J. and D. A. Fournier. 2006. Automatic approximation of the marginal likelihood in non-Gaussian hierarchical models. Computational Statistics & Data Analysis, 51(2): 699--709. Thygesen, U. H., Albertsen, C. M., Berg, C. W*.,* Kristensen, K., Nielsen, A. 2017. Validation of ecological state space models using the Laplace approximation. Environ Ecol Stat 24**,** 317--339. Zheng, J., and M. S. M. Siddeek. 2020. Bristol Bay red king crab stock assessment in fall 2020. In: Stock assessment and fishery evaluation report the king and tanner crab fisheries of the Bering Sea and Aleutian Islands regions. North Pacific Fishery Management Council, 605 W 4th Ave, Suite 306 Anchorage, AK 99501.