Profile likelihood

Approximation of confidence intervals using only gene specific data

What does it mean and when can I use it?

The profile likelihood is estimated by optimising the whole likelihood function, having fixed a parameter under consideration to a certain value. Parameter values for different genes influence each other via shared parameters, for example, the normalisation factors or variance of the negative binomial distribution.

If there are not many genes in the data, that could not be a prolblem. However, usually experimentalists doing RNA-seq experiments are interested in measuring parameters of as many genes as possible, which results in thousands of genes in data tables. Optimising likelihood function for such cases can take a lot of computation time. Hence it can be more efficeint to use approximate fitting by neglecting the influence of individual gene parameters on the shared parameters.

The workflow

Let us use the model of experiment used in the vignette for application of pulseR to spike-ins-free data:

library(pulseR)
attach(pulseRSpikeinsData)

formulas <- MeanFormulas(
  total      =  mu,
  labelled   =  mu * exp(-d * time),
  unlabelled =  mu * (1 - exp(-d * time))
)
formulaIndexes <- list(total_fraction = "total",
                       pull_down      = c("labelled", "unlabelled"))
pd <- PulseData(counts, conditions, formulas, formulaIndexes, 
                spikeins = spikeins)
opts <- setBoundaries(list(
  mu = c(.1, 1e6),
  d = c(.01, 2)
))
opts$ub$size <- 1e7
opts <- setTolerance(params = 1e-3,  options = opts)
initPars <- initParameters(par = list(size=100),
                           geneParams = c("mu", "d"),
                           pulseData  = pd,
                           options    = opts)
initPars$d[] <- .1
initPars$mu <- pd$counts[,1] + 1
fit <- fitModel(pulseData = pd, par = initPars, options = opts)

Construct the profile likelihood

geneIndex <- 7
parName <- "d"
interval <- c(0.8,1.3) * fit[[parName]][geneIndex]
numPoints <- 51
pl <- profileGene(parName, geneIndex, pd, fit, interval, 
                  options=opts, numPoints = 21)
str(pl)
## 'data.frame':    21 obs. of  2 variables:
##  $ d   : num  0.124 0.127 0.131 0.135 0.139 ...
##  $ logL: num  19.23 14.48 10.41 7.01 4.29 ...

We can see the result using the plotPL function:

plotPL(pl)

The red line corresponds to the 95% confidence level likelihood threshold. The 95%-confidence interval for the \(d\) parameter is defined from the crossing points of the threshold line and the profile likelihood curve.

Use the ciGene function in order to estimate this points numerically:

geneIndexes <- 1:10
CIs <- ciGene(parName, geneIndexes, pd, fit, opts)
CIs
##            [,1]      [,2]
##  [1,]        NA        NA
##  [2,]        NA        NA
##  [3,] 0.2342371 0.2437456
##  [4,] 0.1083609 0.1290361
##  [5,] 0.1968716 0.2190968
##  [6,] 0.1536246 0.1719114
##  [7,] 0.1435240 0.1619218
##  [8,] 0.2693011 0.2887113
##  [9,] 0.2757496 0.2946228
## [10,] 0.2386713 0.2568335

Estimation of confidence intervals using the whole likelihood

If the normalisation factors are not fixed, or changes in a single gene parameter result in observed difference in the shared or normalisation parameters values, it is necessary to use the whole likelihood, which is a function of all shared and individual parameters.

plTotal <- profile(paramPath = list("d",geneIndex),
              pd = pd, 
              par = fit,
              options = opts,
              interval = interval,
              numPoints = 20)

If the model includes the normalisation factors and we are interested in estimation of the CI for the normalisation factors, it is possible to run the same function in the same manner:

plTotal <- profile(paramPath = list("normFactors",2,1),
              pd = pd, 
              par = fit,
              options = opts,
              interval = c(1, 3))

However, it may result in long computation times, since it is equivalent to refitting the whole model several times.

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pulseR_1.0.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.19       roxygen2_6.1.0     crayon_1.3.4      
##  [4] assertthat_0.2.0   digest_0.6.18      rprojroot_1.3-2   
##  [7] commonmark_1.6     MASS_7.3-50        R6_2.3.0          
## [10] backports_1.1.2    magrittr_1.5       evaluate_0.12     
## [13] stringi_1.2.4      rlang_0.2.2        fs_1.2.6          
## [16] xml2_1.2.0         rmarkdown_1.10     pkgdown_1.1.0.9000
## [19] desc_1.2.0         tools_3.4.0        stringr_1.3.1     
## [22] yaml_2.2.0         compiler_3.4.0     memoise_1.1.0     
## [25] htmltools_0.3.6    knitr_1.20

Uvarovskii Alexey

2018-10-22

Approximation of confidence intervals using only gene specific data

What does it mean and when can I use it?

The workflow

Construct the profile likelihood

Estimation of confidence intervals using the whole likelihood

Contents