Navigation Banner
 
  HyPhy Documentation: Standard Analyses: MFPositiveSelection.bf

     Summary: Given a codon data file, a tree and a codon substitution model, PositiveSelection performs a series of analyses, along the lines of Yang, Goldman 2000 paper to identify sites in the data which are under selective pressure.      MF means "Multiple Files". The idea of this analysis is to apply positive selection ideas to a collection of small datasets, each of which doesn't contain enough data to infer sites under selective pressure, but by sharing parameters, better estimates can be obtained. The idea for the Random Effects portion of this analysis is due to Simon Frost.

     Input:

  1. Number of sequences file: the file with the number of sequences per data subset. Each line in this file should contain a single number - the number of sequences per data subset.
  2. Data file: A codon data file in any recognizable format, with all the sequences. Any of the predefined genetic code translation tables can be used to interpret the data. The sequences will be partitioned into subsets as specified in the number of sequences file.
  3. Trees file: the file trees for each data subset. Each line should contain a single Newick format tree, for the corresponding subset.

    The user will be prompted to specify the extent of parameter sharing:
  • All: - dN/dS ratio distribution parameters, base frequencies and transversion/transition ratios (if applicable) are shared for all subsets.
  • dN/dS only: - only the dN/dS ratio distribution parameters are shared for all subsets.
and also, cut-off Bayesian level for a site to be considered under selective pressure (a number between 0 and 1), and how many rate classes should be used in discretizing continuous distributions.

    dN/dS variability is modeled by those 13 distributions.

  1. Single Rate no rate variation.
  2. Neutral rates are 0 or 1 with mixing parameter P.
  3. Selection rates are 0 or 1 or W (estimated) with mixing parameters P1 and P2.
  4. Discrete rates are R or R*M1 or R*M2 with mixing parameters P1 and P2.
  5. Freqs rates are 0,1/3,2/3,1,3 with mixing parameters P1,P2,P3 and P4.
  6. Gamma rates are sampled (by conditional mean) from a two parameter gamma distribution.
  7. 2 Gamma rates are sampled (by conditional mean) from a mixture of a two parameter gamma and a mean 1 gamma.
  8. Beta rates are sampled (by conditional mean) from a two parameter beta distribution (thus the rates are all in [0,1])
  9. Beta+w rates are sampled (by conditional mean) from a mixture of a two parameter beta distribution and the class with rate W.
  10. Beta & (Gamma+1) rates are sampled (by conditional mean) from a mixture of a two parameter beta distribution and a two parameter gamma distribution shifted to [1,Infinity).
  11. Beta & (Normal>1) rates are sampled (by conditional mean) from a mixture of a two parameter beta distribution and a two parameter normal distribution restricted to [1,Infinity).
  12. 0 & 2 (Normal>1) rates are sampled (by conditional mean) from a mixture of the zero rate class, a two parameter normal and a mean 1 normal (restricted to [0,Infinity)).
  13. 3 Normal rates are sampled (by conditional mean) from a mixture of the a mean 0 normal, a mean 1 normal, and a two-parameter normal, all restricted to [0,Infinity).
  14. This and the following distributions are the (R)andom (E)ffects model, which used the following idea. Let c be a random dN/dS variable. In the usual setting, c varies accross sites, but for this analysis, we let c vary accross subsets. Thus, the likelihood of a complete data set will be:

    where, N is the number of subsets, M is the number of values c can take on, and the conditional likelihood of subset k is computed using the same value of c for all sites.

    RE: Lognormal rates are sampled (by conditional mean) from a lognormal distribution, with variance parameter sigma.

  15. RE:Gamma rates are sampled (by conditional mean) from a two parameter gamma distribution.
  16. RE:Discrete rates are R or R*M1 or R*M2 with mixing parameters P1 and P2.

    The user may choose any combination (or all 16) distributions to run.

    Models: MG94,GY94 with either 3 or 9 frequency parameters codon models can be selected for the analysis.

    Output: A summary table is output to the screen and a detailed report spooled to a file chosen by the user.

     Result Processing Tools: None are really applicable.

 
Sergei L. Kosakovsky Pond and Spencer V. Muse, 1997-2002