This walkthrough provides instructions for implementing the RERconverge package to identify genes whose evolutionary rates shift in association with change in a trait. For information on how to download and install RERconverge, see the wiki. Source code and a quick start guide are available on github.

Overview

The following document describes the steps necessary to perform a standard RERconverge analysis to identify genomic elements with convergent rates of evolution in phenotypically convergent species using a binary trait or a continuous trait.

Output is a the list of genomic elements with statistics that represent the strength and directon of the relationship between the genomic element's evolutionary rate and the phenotype. These statistics can be used to make inferences about the genomic elements' importances to the phenotype. Genomic elements that evolve more slowly for a given phenotype may be under increased evolutionary constraint because their function is important for the development of the convergent phenotype, for example. On the other hand, genomic elements that evolve more quickly for a given phenotype may either be under decreased evolutionary constraint due to loss of function or relatively decreased functional importance or, conversely, undergoing directional selection to increase or alter functionality. The ranked gene list can be further used as input for functional enrichment methodologies to find pathways or other functional groups under convergent evolutionary pressures.

Data Input Requirements and Formatting

The analysis requires two sources of data as input:

  1. Phylogenetic trees for every genomic element with branch lengths that represent element-specific evolutionary rates
    • Trees should be in Newick format with tip labels and no node labels
    • Tree topologies must all be subsets of the same master tree topology
  2. Species-labeled phenotype values
    • Species labels must match tree tip labels
    • For continuous traits:
      • a named numeric vector of trait values
    • For binary traits:
      • a vector of foreground species names OR
      • a Newick tree with branch lengths 0 for background branches and values between 0 and 1 for foreground branches (for a basic analysis, set all foreground branches to 1) OR
      • the user can specify foreground branches (which will be set to 1) using an interactive tool (see below)

When choosing a dataset to work with, consider the availability and accuracy of both genomic and phenotypic data, and be sure to select a valid convergent phenotype that is observed in multiple independent clades in your phylogeny, and at high and low levels for continuous traits.

For a more detailed description of data formatting requirements and examples, please see the relevant sections of the walkthrough.

Analysis Walkthrough

Installing and loading RERconverge

Note: Prior to running the vignette, be sure to follow all the steps for installation on the wiki, up to the "Install from Github" step.

In R, load the RERConverge library.

if (!require("RERconverge", character.only=T, quietly=T)) {
    require(devtools)
    install_github("nclark-lab/RERconverge", ref="master") 
    #"ref" can be modified to specify a particular branch
}
library(RERconverge)

This should also download all the files we will be working with to your computer, in the directory where your R library lives. If you'd like to visualize or work with any of these files separately, this is where you can find them:

rerpath = find.package('RERconverge') #If this errors, there is an issue with installation
print(rerpath)
## [1] "/Users/nathan/Library/R/3.5/library/RERconverge"

Reading in gene trees with readTrees

To run RERconverge, you will first need to supply a file containing gene trees for all genes to be included in your analysis. This is a tab delimited file with the following information on each line:

Gene_name Newick_tree

An example file is provided in extdata/subsetMammalGeneTrees.txt, which you can view in any text editor.

Now in R, read in the gene trees. The readTrees function takes quite a while to read in trees for all genes, so we will limit ourselves to the first 200 using max.read (this will still take a minute or so, so be patient):

toytreefile = "subsetMammalGeneTrees.txt" 
toyTrees=readTrees(paste(rerpath,"/extdata/",toytreefile,sep=""), max.read = 200)
## Read 500 items
## max is 62
## Rotating trees
## estimating master tree branch lengths from 32 genes
## Naming columns of paths matrix

First, the code tells us that there are 500 items, or gene trees, in the file. Since we have set max.read = 200, it will only read the first 200 of these. Then it says that the maximum number of tips in the gene trees is 62 and, later, it reports that it will use the 32 genes in this set that have data for all 62 species to estimate a master tree. The master tree will be used for subsequent analyses.

RERconverge is intended to be used on genome-scale datasets, containing a large number of gene trees with data present for all species. It thus has a minimum number of such gene trees required for readTrees to use to estimate a master tree; this is set with the minTreesAll option and is 20 by default. If your dataset is smaller, you may adjust this or supply your own master tree using the option masterTree (this should be a phylo object generated using ape's read.tree); however, we recommend interpreting results with caution in this case.

Estimating relative evolutionary rates (RER) with getAllResiduals

The next step is to estimate relative evolutionary rates, or RERs, for all branches in the tree for each gene. Intuitively, a gene's RER for a given branch represents how quickly or slowly the gene is evolving on that branch relative to its overall rate of evolution throughout the tree.

Briefly, RERs are calculated by normalizing branch lengths across all trees by the master branch lengths. Branch lengths are then corrected for the heteroskedastic relationship between average branch length and variance using weighted regression. For a more detailed description of how RERs are computed, see (Chikina, Robinson, and Clark 2016) and (Partha et al. 2017).

We will use the getAllResiduals function to calculate RERs. This uses the following input variables (all the options set here are also the defaults):

  • useSpecies: a vector that can be used to specify a subset of species to use in the analysis. Here we will use the species in our AdultWeightLog vector that will be used for continuous trait analysis. Note that these are also the same species used for binary trait analysis. These species should be a subset of species included in toyTrees$masterTree$tip.label.
  • transform: the method used to transform the raw data. By transforming the raw data, we reduce the heteroscedasticity (relationship between mean and variance) and the influence of outliers. Here we will use a square-root transform ("sqrt"), which has performed the best at reducing heteroskedasticity in our datasets. Also available are "none" (no transformation) and "log" (natural logarithm transformation).
  • weighted: whether to use a weighted regression to estimate RER. Weighting allows further correction for the relationship between mean and variance, which can be directly estimated from the data.
  • scale: whether to scale the individual branches of the gene trees to account for variance across trees. This scales the variance, though not the mean, of each branch length, using the R function scale.

The useSpecies input variable can be provided to most RERconverge functions. Excluding one or more species from this vector will exclude them from the analyses.

Here is the basic method, with the recommended settings:

data("logAdultWeightcm") 
mamRERw = getAllResiduals(toyTrees,useSpecies=names(logAdultWeightcm), 
                           transform = "sqrt", weighted = T, scale = T)

The first output of this function tells you that the cutoff is set to 3.2e-05. Any branches shorter than this will be excluded from the analysis. It then prints out i= 1 ... 200, showing the progress as it calculates relative evolutionary rates for sets of gene trees.

The plots generated by this function show the heteroscedasticity in the original data (on the left) and the data after transformation and weighted regression (on the right). The x-axis displays bins of branch lengths on the tree, and the y-axis is the (log-scaled) variance in these branch lengths across trees. As you can see by comparing the right plot to the left plot, transforming and performing a weighted regression reduces the relationship between the mean branch length (x-axis) and the variance in branch length (y-axis). You can alter values for transform, scale, and weighted to attempt to optimize heteroskedasticity correction.

If you wish to save this RER object for later, you can use R's saveRDS function. This will allow you to load it later with readRDS, using a different name, if you wish.

saveRDS(mamRERw, file="mamRERw.rds") 
newmamRERw = readRDS("mamRERw.rds")

Now that we have RERs, we can visualize these for any given gene using the plotRers function. Here is an example.

#make average and gene tree plots
noneutherians <- c("Platypus","Wallaby","Tasmanian_devil","Opossum")
par(mfrow=c(1,2))
avgtree=plotTreeHighlightBranches(toyTrees$masterTree, outgroup=noneutherians,
                                  hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), 
                                  main="Average tree") #plot average tree
bend3tree=plotTreeHighlightBranches(toyTrees$trees$BEND3, outgroup=noneutherians, 
                                    hlspecies=c("Vole","Squirrel"), hlcols=c("blue","red"), 
                                    main="BEND3 tree") #plot individual gene tree