top of page

          In this present study, we first compiled sequencing data for three proteins, Envelope glycoprotein, Hemagglutinin protein, and Glycoprotein NB, from different strains of Ebola virus, Influenza virus A (strain A, H1N1), and Influenza virus B (strain B), respectively. The amino acid frequencies for each of the sequences (from different strains) for each protein were compiled. Then, hierarchical clustering was performed, which displayed various levels of genetic relationships between the individual strains used.

          The hierarchical cluster revealed both a location-dependent and a time-dependent clustering pattern. In our analysis, we found that strains which were located in similar parts of the world cluster more exclusively together than those isolated from places physically distant from each other. In particular, we found that in considering the organization of the Ebola envelope glycoprotein strains, the African strains clustered more exclusively with each other than with strains found outside of Africa (Figure 1). The Reston strains, all originating in the Philippines, clustered more tightly within their own subtype as well. A similar pattern was reflected in the dendrograms for both Hemagglutinin protein and glycoprotein NB. Figure 2 shows a cluster organization which has generally separated the different strains of Hemagglutinin into three distinct clusters, roughly top, middle, and bottom. The top cluster, which includes 9 different strains, includes mostly strains isolated in North American nations, with the exception of the UK (Wilson-Smith-93) and Malaysia (Malaya-54). The middle cluster includes mostly strains isolated in nations in the Eastern hemisphere, namely China, the former Soviet Union/Russia, India, and New Zealand, with one exception each from the US (Memphis-96), Chile (83), and Brazil (78). The bottom cluster, like the top cluster, includes mostly strains isolated from North American countries, with the exception of Australia (80) and the Netherlands (85). Strains from Mongolia and Taiwan are equally clustered at the same distance away from the middle and bottom cluster on the dendrogram. As a result, they were not considered as parts of either. Finally, we analyzed results from hierarchical clustering of glycoprotein NB, found in different strains of Influenza B (strain B). We found that with one exception, USA: Memphis-89, a single cluster contained only strains isolated from nations in the Eastern Hemisphere (Figure 3). In addition, 3 strains of Influenza B from the USA (Lee, Maryland, and Oregon), while clustered separately on lower (more exclusive) levels of the dendrogram, are connected via the highest levels of the dendrogram. This implies a genetic relationship between these three strains that does not include those of the Eastern Hemisphere and USA: Memphis-89. The geographical pattern evident in strains from Ebola and Influenza A persists in strains of Influenza B.

          Time also played a major role in organizing the clusters at the lowest, most exclusive, levels of our dendrograms. Examining the dendrogram for strains of Ebola’s glycoprotein, we saw that all of the most exclusive pair clusters (clusters directly connecting two single strains) contained strains with years of isolations which were within 3 years of each other. The same is reflected in a majority of the most exclusive pair clusters in the dendrograms for strains of Influenza A’s hemagglutinin protein (6 out of 10 pair clusters) and for strains of Influenza B’s glycoprotein NB (2 out of 3 pair clusters). If we increased this time difference to 6 years, we would be able to include all but one of the most exclusive pair clusters in all three dendrograms shown. Similarity in time of isolation is thus a major factor in the hierarchical cluster organization of various strains of proteins of a single virus, which underlies the evolutionary relationship different strains of a single virus hold with one another through time.

          The time-dependence and location-dependence of the hierarchical cluster unveils an underlying force towards the trends observed. To better understand the force at hand, we must consider the importance of the viral proteins analyzed. In order for Ebola, Influenza A and Influenza B to infect the host cell, the viral proteins on the surface of the virion must interact with a certain host protein to establish an attachment. This attachment will lead to the eventual infection of the virus into the host cell. This mechanism reveals an important advantage and disadvantage that the virus has over its host cell. In one sense, the virus is able to use a naturally expressing protein from the host cell to its advantage to infect the host cell, but in another sense, this is a disadvantage as well because it also must rely on the host cell to present the specific protein in order for the virus to infect. This host-virus protein interface is the site of an ongoing battle between host and virus to survive, which is referred to as the evolutionary arms race (1). The evolutionary arms race is the concept of co-evolving genes between competing organisms, in which constant evolution is vital towards surviving (4). Moreover, the various strains for each virus introduced into the population indicate that the viruses are rapidly evolving in order to remain within the population. This is evident in the continuous emergence of different viral strains for Ebola, Influenza A and Influenza B, which reflect another important concept similar to the evolutionary arms race, the Red Queen hypothesis. The Red Queen hypothesis comes from the notable quote, “...it takes all the running you can do, to keep in the same place,” in Lewis Carroll’s Through the Looking Glass (5). In terms of evolutionary biology, the Red Queen hypothesis reflects how organisms must constantly adapt or evolve just to survive (6). The continuous emergence of the different viral strains for Ebola, Influenza A and Influenza B through the 20th century and into the 21st century indicate the rapid evolution occurring for these viruses, which further supports the evolutionary arms race and the Red Queen hypothesis. The key between these two concepts is the competition, or pressure from the competitor, which drive the evolution of the virus. Not only is the competitor driving evolution, but of course, time and environmental conditions will play a role in the evolution of the virus. The temporal and geographical trends observed from the hierarchical cluster support the evolutionary arms race because these trends are observed in the landscape of the amino acid sequence. Therefore, the clusters are a reflection of the genetic evolution occurring in the viruses. 

          Like geologists studying different layers of sedimentary rock to extrapolate Earth’s structural history, the hierarchical cluster reveals the amino acid differences for each viral entry protein to extrapolate the virus’ evolutionary history. When geologists study the layers of sedimentary rock, they know the location from which the rock was taken and they know that the deeper layers refer to earlier time periods. Within each layer they’re able to analyze the composition of the layer and extrapolate Earth’s conditions during that time period. In an analogous manner, the amino acid sequences provide a history of the protein at a certain time and location. However, the missing element towards connecting the progression of the virus through time and space is the boundary between “layers” in which these changes occur. When geologists look at sedimentary rock, they’re able to clearly see progression because of the order of the layers (i.e. deeper layers are earlier time periods while more recent formations pile on, with the top layers being younger developments). Just from looking at the strains of viruses from different locations and times, it’s unclear the relationship that these strains have with one another. The hierarchical cluster analysis provides the “layers,” or connections, needed to link the amino acid changes between strains. This analysis displays the time-dependence and location-dependence of the cluster and ultimately the evolution of the viral protein. In a sense, we are able to see the time-lapse of a shifting protein landscape through hierarchical clustering.

          With these ideas in mind, a possible next step is to investigate the historical forces that drove the evolution of these viral proteins, from one cluster to another, with space and time as overarching factors. Elucidation and elimination of particular causes may help to limit a virus’ progression into newer forms, which would mean introducing a new cluster to the existing dendrogram. From a bioinformatics standpoint, a logical step to take forward would include verifying that the differences we see between the clusters are statistically important, that each cluster does indicate a distinct “group”. From there, we move on to a larger analysis in which we include not only a single protein, but tens, hundreds, or even thousands of proteins at a time to determine what types of clusters we can achieve. Connecting the mountain of information we have in the present with the incidents of the past will empower us to learn more about the future, and with a hopeful chance, perhaps even change its course.

Discussion 

bottom of page