What is protein phylogeny?
Phylogeny is a branch of biology that studies the evolution of species. This is done by analyzing the similarities and differences between genetic or protein sequences and mapping this data onto a phylogenetic tree. Phylogenetic trees allow us to understand the relationships between organisms throughout evolution, and can be analyzed/generated in numerous ways.
How to construct a phylogenetic tree
The first step to building a phylogenetic tree is to obtain the genetic or protein sequences of each organism you wish to study. In this case, protein sequences from 13 species' homologues of the human version of C4-A, called "complement C4-A isoform 1 preproprotein", were obtained using Ensembl and Homologene, then put into an unformatted text file. Next, the sequences were aligned using Clustal Omega and visualized . Sequence alignment is necessary to determine how conserved each protein sequence is in respect to the others, which is what a phylogenetic tree will allow one to visualize in a simpler way than from the raw data alone. Below is the entire sequence alignment file, and it was color-coded after alignment using Jalview.
multiple_sequence_alignment.png |
By zooming in to specific areas of the aligned protein sequence, the similarities and differences between each species' sequence can be observed. The colored columns show the extent of protein sequence conservation throughout different species. A conservation and quality score are also provided, as well as a consensus protein sequence. This can be seen in Figure 2 below.
In addition to color-coding the aligned sequence data, Jalview can also convert this data into phylogenetic trees. There are several methods used to generate appropriate phylogenetic trees, which are explained below.
BLOSUM Matrix
The BLOSUM matrix uses a scoring system that compares sequences after alignment. This is done comparing the amino acids at each position and assigning a score based on similarity and the likelihood that the match occurred due to chance. The sum of the all scores reflects the overall similarity between the sequences. For complement C4-A isoform 1 preproprotein and its homologous sequences, the BLOSUM62 matrix was used.
Percent Identity
This method also compares sequences after alignment. In contrast to the BLOSUM matrix, percent identity simply compares how identical the aligned sequences are without accounting for substitutions in which an amino acid is replaced with a similar amino acid.
Average Distance
The average distance method assumes that each organism's proteins diverged from a common ancestor, then uses the BLOSUM matrix or percent identity similarity scores to determine how related the species are. Equal branch lengths are assigned to each organism because of the assumption that all species evolved equally from the common ancestor.
Neighbor Joining
This method is similar to the average distance method in that it uses the BLOSUM matrix or percent identity scores to determine relatedness between species and generates a tree, but neighbor joining uses an algorithm to calculate unique branch lengths between related species. The longer the branch length, the higher the sequence divergence between the species that diverged at the branch point.
Below are four trees generated using a combination of the methods described above.
BLOSUM Matrix
The BLOSUM matrix uses a scoring system that compares sequences after alignment. This is done comparing the amino acids at each position and assigning a score based on similarity and the likelihood that the match occurred due to chance. The sum of the all scores reflects the overall similarity between the sequences. For complement C4-A isoform 1 preproprotein and its homologous sequences, the BLOSUM62 matrix was used.
Percent Identity
This method also compares sequences after alignment. In contrast to the BLOSUM matrix, percent identity simply compares how identical the aligned sequences are without accounting for substitutions in which an amino acid is replaced with a similar amino acid.
Average Distance
The average distance method assumes that each organism's proteins diverged from a common ancestor, then uses the BLOSUM matrix or percent identity similarity scores to determine how related the species are. Equal branch lengths are assigned to each organism because of the assumption that all species evolved equally from the common ancestor.
Neighbor Joining
This method is similar to the average distance method in that it uses the BLOSUM matrix or percent identity scores to determine relatedness between species and generates a tree, but neighbor joining uses an algorithm to calculate unique branch lengths between related species. The longer the branch length, the higher the sequence divergence between the species that diverged at the branch point.
Below are four trees generated using a combination of the methods described above.
Analysis
From the different trees we can see many similarities as well as differences in how the information is displayed. In both trees generated using average distance methods, the fruit fly is the outgroup while in neighbor joining methods the fruit fly C4A is more shown to be more closely related to chickens, frogs, and zebrafish. In all four of the trees generated, the human tended to cluster near the primates (chimpanzee, gorilla, and rhesus macaque). There are many other methods for generating phylogenetic trees in addition to these four.
References
Phylogenetic tree image: https://www.fossiel.net/information/article.php?id=87&/Evolutie