The workflow consists of defining the data object, setting fitting options and fitting itself:
library(pulseR)
...
# define the data
pd <- PulseData(...)
# set options
opts <- setTolerance(...)
opts <- setBoundaries(..., opts)
# define first parameter guess
initPars <- initParameters(...)
# fit the model
fit <- fitModel(pd, initPars, opts)
In the pulseR
package, we keep all inial data in the PulseData
structure (S3 class). When the information about read counts, conditions and the model is defined, one can create the object with a line as
library(pulseR)
attach(pulseRSpikeinsData)
pd <- PulseData(counts, conditions, formulas, formulaIndexes, spikeins)
The read counts must be delivered as an integer matrix. The rownames may describe the gene names and the column order corresponds to the sample order in the condition matrix (see below)
counts[c(1:5, 50:55), 1:4]
## total_fraction total_fraction total_fraction pull_down
## gene_1 1006 983 1005 1859
## gene_2 1984 1975 1992 3214
## gene_3 2914 2971 3049 4930
## gene_4 4106 4012 3839 7377
## gene_5 5083 5240 4915 8455
## gene_50 49745 50289 50003 80099
## gene_51 50713 52207 50233 94670
## gene_52 51829 51490 51292 87236
## gene_53 53077 52261 54400 96165
## gene_54 52626 52373 53856 102734
## gene_55 55412 55741 55124 89592
The information about samples must be described in the condition matrix. It is obligatory that the row order corresponds to the column order in the count matrix. Time or other sample-specific variable are defined here.
conditions
## fraction time
## sample_1 total_fraction 1
## sample_2 total_fraction 2
## sample_3 total_fraction 3
## sample_4 pull_down 1
## sample_5 pull_down 2
## sample_6 pull_down 3
## sample_7 total_fraction 1
## sample_8 total_fraction 2
## sample_9 total_fraction 3
## sample_10 pull_down 1
## sample_11 pull_down 2
## sample_12 pull_down 3
The design of the experiment defines how different fractions evolve with the time. Let us consider the chase experiment, when the labelled molecules are being only degraded and the unlabelled molecules are synthesised and degraded:
formulas <- MeanFormulas(
total = mu,
labelled = mu * exp(-d * time),
unlabelled = mu * (1 - exp(-d * time))
)
Although the expression level in the form of mean read count is easier for perception, it can be better to model it at the logarithmic scale. In this case, the mean read counts is exp(mu)
, and the degradation rate d
and logarithm of mean read count mu
have values of the same scale, i.e. unit changes in d * time
or mu
result in similar changes in the expected read number:
formulas <- MeanFormulas(
total = exp(mu),
labelled = exp(mu - d * time),
unlabelled = exp(mu) * (1 - exp(-d * time))
)
In the mathematical form, it looks like \[ \begin{align} \text{[total]} &= e^\mu\\ \text{[labelled]} &= e^{\mu - d\cdot t}\\ \text{[unlabelled]} &= e^{\mu}\left(1 - e^{-d\cdot t}\right) \end{align} \]
In this vignette we use the logarithm of the mean read expression!
It is important to note, that one should not forget to draw attention to to the correct scale of the parameter ranges, see below.
In pulseR, it is possible to model contamination of the fractions. This definition is passed as formulaIndexes
argument to the PulseData
function. Here we define that the pull-down fraction consists of the labelled and unlabelled molecules, and the total fraction is degenerates to a simple formula, which we defined under the total
list item in the formulas
:
formulaIndexes <- list(total_fraction = "total",
pull_down = c("labelled", "unlabelled"))
Samples can be normalised using spike-ins (this vignette), which should be included in the count table together with other genes. One group of sample must be assigned as a reference, namely, the total fraction, which includes all the spike-ins type, labelled and unlabelled, in the original proportion.
spikeins
## $refGroup
## [1] "total_fraction"
##
## $spikeLists
## $spikeLists$total_fraction
## $spikeLists$total_fraction[[1]]
## [1] "spikes 1" "spikes 2" "spikes 3" "spikes 4" "spikes 5"
## [6] "spikes 6" "spikes 7" "spikes 8" "spikes 9" "spikes 10"
##
##
## $spikeLists$pull_down
## $spikeLists$pull_down[[1]]
## [1] "spikes 1" "spikes 2" "spikes 3" "spikes 4" "spikes 5"
##
## $spikeLists$pull_down[[2]]
## [1] "spikes 6" "spikes 7" "spikes 8" "spikes 9" "spikes 10"
The fitting procedure uses as an input the PulseData
object, which we can create having defined all the information about the experimental system:
pd <- PulseData(counts, conditions, formulas, formulaIndexes,
spikeins = spikeins)
Alternatively, the relations between samples can be inferred during the fitting procedure (if no spike-ins counts are present in the data). There are two factors, which affects the expected read number in the model at a sample-wise level:
The way of sample grouping is defined in the group
argument of the PulseData
function. We assume that efficiency of the pull-down procedure is different between different time points (time
column in the condition matrix), but the same for the samples from the same time point:
pd <- PulseData(counts, conditions, formulas, formulaIndexes,
groups = ~ fraction + time)
The parameters can be estimated using the fitModel
procedure. It requires to provide the PulseData
object, the initial guess for the parameter values (see below) and the fitting options.
fit <- fitModel(pulseData = pd, par = initPars, options = opts)
The function will iteratively fit the parameters from the formulas, overdispersion parameter for the negative binomial distribution and, if defined, the normalisation factors, until the stopping criteria is met.
To fit the model, we use L-BFGS-B method implemented in the stats::optim
function. It requires values for the lower and upper boundaries for the parameter values.
As it was mentioned earlier, we model the mean expression level at the logarithmic scale, which must be reflected in the boundaries as well. One may simply add log
to the exprected read count range:
opts <- setBoundaries(list(
mu = log(c(.1, 1e6)),
d = c(.01, 2),
size = c(1, 1e7)
))
The fitModel
function performs fitting in iterations: it optimises gene-specific, shared, overdispersion and normalisation parameters at separate steps. Relative change between them is the stopping criteria:
opts <- setTolerance(params = 1e-3, options = opts)
The fitting routine needs an initial guess for the parameter values. The result of the fitting may depend on the model parametrisation and the initial guess.
The parameters, which are not specified in the par argument, in our case, both, the expression level and the degradation rate, will be assigned a random value within the boundaries defined in the fitting options list:
initPars <- initParameters(par = list(size=1e5),
geneParams = c("mu", "d"),
pulseData = pd,
options = opts)
However, since we use the total sample as a reference, we expect the read count in the “total_fraction” samples to be of close value to the optimal estimation of the expression level, i.e. exp(mu)
in our parametrisation. We can help the fitting procedure by providing these values as a first guess (do not forget about the logarithm):
initPars$mu <- log(pd$counts[,1] + .5)
The result of the fitModel
function is just a list with the parameter values, which are ordered according to the rows in the read count table:
fit <- fitModel(pulseData = pd, par = initPars, options = opts)
str(fit)
## List of 3
## $ size: num 5471
## $ mu : num [1:500] 6.92 7.62 8.02 8.3 8.54 ...
## $ d : num [1:500] 0.161 0.272 0.237 0.119 0.208 ...
We can use model prediction to see how good the model fits to the data (in the case of the simulated data all looks good):
pr <- predictExpression(fit, pd)
plot(pr$predictions, pd$counts, pch=16, cex=.5, log='xy',
xlab="prediction", ylab="experiment")
abline(0, 1, col = "tomato")
The data were simulated from known values in pulseRSpikeinsData$par
and we can compare fit to the true values:
par(mfrow=c(1,2))
plot(par$d, fit$d, pch=16, cex=1,
xlab="true", ylab="fit", main="d, degradation rate")
abline(0, 1, col = "tomato")
plot(par$mu, exp(fit$mu), pch=16, cex=1, log='xy',
xlab="true", ylab="fit", main="mu, expression level")
abline(0, 1, col = "tomato")
One can see, that there is a bias in the estimations: the normalisation affects the estimation by introducing a shift in the degradations rates. In this example, the normalisation and cross-contamination is derived on the basis of 5 labelled and 5 unlabelled spike-ins only.
For simulation the uniformly distributed parameters were used, which is also observed in the fitted values:
plot(exp(fit$mu), fit$d, pch=16, cex=.5, xlab="expression", ylab="degradation")
To generate the confidence intervals, one needs to provide the parameter name (e.g. “d”), indexes of the genes (in this example, only the first 150 genes), for which to compute the intervals and all the objects from the fitting procedure:
geneSubset <- 1:150
cis <- ciGene("d", geneSubset, pd, fit, opts)
## Create a result table and plot the CI together with the estimations
dat <- data.frame(
d = fit$d [geneSubset],
mu = fit$mu[geneSubset],
d.min = cis[,1],
d.max = cis[,2]
)
o <- order(dat$d)
i <- seq_along(dat$d)
plot(x = i, y = dat$d[o], type = "n",
cex.lab = 1.2,
xlab = "ranked by d",
ylab = "d, 1/hr"
)
lines(x = i, y = dat$d.max[o], col = "dodgerblue")
lines(x = i, y = dat$d.min[o], col = "dodgerblue")
points(x = i, y = dat$d[o], cex = .3)