Summary:
Given a codon data file, a tree and a codon substitution
model, PositiveSelection performs a series of analyses, along the lines
of Yang, Goldman 2000 paper to identify sites in the data which are under
selective pressure.
MF means "Multiple Files". The idea of
this analysis is to apply positive selection ideas to a collection
of small datasets, each of which doesn't contain enough data to
infer sites under selective pressure, but by sharing parameters,
better estimates can be obtained. The idea for the Random Effects portion of this analysis
is due to Simon Frost.
Input:
- Number of sequences file: the file with
the number of sequences per data subset. Each line in this file should contain
a single number - the number of sequences per data subset.
- Data file: A codon data file in any
recognizable format, with all the sequences. Any of the
predefined
genetic code translation tables can be used to interpret
the data. The sequences will be partitioned into subsets as specified in the
number of sequences file.
Trees file: the file trees for each data subset. Each line
should contain a single Newick format tree, for the corresponding subset.
The user will be prompted to specify the extent
of parameter sharing:
- All: - dN/dS ratio distribution
parameters, base frequencies and transversion/transition ratios (if applicable)
are shared for all subsets.
- dN/dS only: - only the dN/dS ratio distribution
parameters are shared for all subsets.
and also, cut-off Bayesian level for a site to be considered under selective
pressure (a number between 0 and 1), and how many rate classes
should be used in discretizing continuous distributions.
dN/dS variability is modeled by those 13 distributions.
-
Single Rate no rate variation.
-
Neutral rates are 0 or 1 with mixing parameter P.
-
Selection rates are 0 or 1 or W (estimated) with mixing parameters P1 and P2.
-
Discrete rates are R or R*M1 or R*M2 with mixing parameters P1 and P2.
-
Freqs rates are 0,1/3,2/3,1,3 with mixing parameters P1,P2,P3 and P4.
-
Gamma rates are sampled (by conditional mean) from a two parameter gamma
distribution.
-
2 Gamma rates are sampled (by conditional mean) from a mixture of
a two parameter gamma and a mean 1 gamma.
-
Beta rates are sampled (by conditional mean) from a two parameter beta
distribution (thus the rates are all in [0,1])
-
Beta+w rates are sampled (by conditional mean) from a mixture of a two parameter beta
distribution and the class with rate W.
-
Beta & (Gamma+1) rates are sampled (by conditional mean) from a mixture of a two parameter beta
distribution and a two parameter gamma distribution shifted to [1,Infinity).
-
Beta & (Normal>1) rates are sampled (by conditional mean) from a mixture of a two parameter beta
distribution and a two parameter normal distribution restricted to [1,Infinity).
-
0 & 2 (Normal>1) rates are sampled (by conditional mean) from a mixture of the zero rate class,
a two parameter normal and a mean 1 normal (restricted to [0,Infinity)).
-
3 Normal rates are sampled (by conditional mean) from a mixture of the a mean 0 normal,
a mean 1 normal, and a two-parameter normal, all restricted to [0,Infinity).
-
This and the following distributions are the (R)andom (E)ffects model, which used
the following idea. Let c be a random dN/dS variable. In the usual
setting, c varies accross sites, but for this analysis, we let c
vary accross subsets. Thus, the likelihood of a complete data set will be:
where, N is the number of subsets, M is the number of values c
can take on, and the conditional likelihood of subset k is computed using the same
value of c for all sites.
RE: Lognormal rates are sampled (by conditional mean) from a lognormal distribution,
with variance parameter sigma.
-
RE:Gamma rates are sampled (by conditional mean) from a two parameter gamma
distribution.
-
RE:Discrete rates are R or R*M1 or R*M2 with mixing parameters P1 and P2.
The user may choose any combination (or all 16) distributions
to run.
Models: MG94,GY94
with either 3 or 9 frequency parameters codon models can be selected for the analysis.
Output:
A summary table is output to the screen and a detailed report spooled to a file
chosen by the user.
Result
Processing Tools:
None are really applicable.
|