Saturday, February 7, 2015

Web Scraping is fun

OK, this post has nothing to do with my research. Wait, it probably does something with R, which I used for part of my data analysis. I had a lot of fun in R programming so I did this small project, check this out:

Another new post about influenza virus genome mutation simulation is coming soon...

Tuesday, November 19, 2013

Refreshed post: virus replication, dynamic modeling and within-host competition

My motivation for this post came from a great course I took, Ecology of Infectious Diseases. (In addition I am trying to not to be a sluggard!) The theme of the course is: imagine you are a microparasite (pathogen) that is willing to take every chance to survive. This leads my interesting thoughts on a paper by B R. Levin et al. In the section titled "Within-Host Population Dynamics of Pathogen Proliferation", he wrote: 

If the course of a microparasite infection in a vertebrate host were described without jargon, the process would be readily recognized as one of population dynamics and evolution. 

This perspective on pathogenesis and the immune response as ecological, population-dynamical, and evolutionary processes has been well recognized for some time (11). However, it has had little impact on contemporary research on the mechanisms of pathogenesis. Much of this research is qualitative rather than quantitative, and it can be described as a quest to characterize (genetically, biochemically, and physiologically) the interaction between infectious pathogens and the host's immune defenses. Although this research provides an indispensable basis for understanding pathogenesis and the host's response to infection, it tells only a part of the story. A complete account of the course of an infectious disease must include a quantitative description of the major forces that determine the abundance, diversity, and distribution of a pathogen population within an infected host and the immune defenses involved in its control.   

It is fascinating that the author pointed out the same thing came to me when I learned human population genetics. Why wouldn't we study viruses from a population biology point of view?

For decades the virologists characterized function of tremendous amount of viral proteins, host factors interaction, by saying that "XXX protein interacts with XXX host factors, which is increased/decreased during infection". These facts are important but only "parts of the story". For some reasons would trust the author as we always lose a global view in biomedical research. 

Viruses could be great source to study from population genetics of view, if we think the genome of every virus as the haplotype of every individuals. Some alleles in the viral genome confers "susceptibility" to the host, that is, loss of pathogenicity/infectiousness. This sounds very familiar when an allele in human was found causative of the susceptibility to a pathogen in a human subpopulation. 

Another way to think about the viral population genetics is to refer the mosaicism of the organism. The viruses with slightly different genome function as a whole parasite. These minute mutations of each viral particle have a huge impact on the population level. This sounds like the quasi species theory proposed long time ago.

Friday, July 12, 2013

Latest version of BLAST 2.2.28: makeblastdb, sequence ID, and some other tips

This is what I did for a few weeks ago when I used BLAST package ( to analyze my Illumina 2x250 MiSeq data. It is a whole bunch of short reads collection, about 300bp per reads. Basically, all the sequences in FASTA look like this:


I also had a small collection of reference sequences at hand, which are annotated and published data (about 22 annoated genes in total). The goal here is to assign all the reads an official gene name, based on the similarity of the reads compared with the reference sequences. The first thing is to build a local database by makeblastdb:

makeblastdb -dbtype nucl -in database.fa -input_type fasta -out allreads

and then there will be a database containing all my reads built in. The funny thing is after you build the database in fasta format, it is hard to retrieve the sequences again by its accession number assigned by the blast--------same problem mentioned by Peter Cock in his blog ( Thats all because the SUPER COMPLICATED NOMENCLATURE SYSTEM OF NCBI DATABASE...

this shows the original database:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "%f"

And then if I want to find the accession numbers/OID/GI in the database, I failed:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available

so I give up on extracting the sequence by all the command in blast. Instead, I used a reverse approach, making all the reads as query and all the reference sequences as database, and blastn will automatically match all the reads with the perfect hit in the reference sequences, and then the hits is exactly the official name that I want! However, there are still some defects in the BLAST command, I winded up with using Filemaker pro to process the output of BLAST, and life is becoming much easier.

Sunday, April 7, 2013

Why there is genome instability? – Discovery of sleeping beauty

Why there is genome instability? – Discovery of sleeping beauty

Transposable elements (TE) are a large family of DNA elements widely presented in the genome of organism. These elements may not share the exactly same sequence, but consist of elements that confer similar biological function, which is inserting into genome and causing target site duplications. There are two classes of TE in eukaryotic cells: retrotransposon elements and DNA transposons.

The first transposon, founded by Barbara McClintock in 1983, is a naughty gene jumping around in the genome of maize. She found that there were deletions and insertions in the genome that are related to the change in the color of corn kernels. This transposon belongs to class II TEs, being called as DNA copy & paste system (or ctrl+x & ctrl+v system, geeky joke). Class II transposons are composed of two inverted repeats (IR) at each end, and one transposase element in the middle, which determines the autonomy of the transposon. The structure enables class II transposons to insert in the genome, changing the size of the genome and leading to genome instability.

Class I transposon is a family of retroviral genome (I have talked about retrotransposon and viral oncology in a previous post). It is interesting to know that not only in retrovirus, in yeast cells there are transposons that resemble retrotransposon structure. Ty element includes TyA, a gag-like protein and TyB, homology to RTase. The structure of Ty element is shown as follows.

Indication of evolution

Comparing these transposons from different branches of species reveal the role of transposon in organism development and evolution. Because the unstable nature of TE leads to mutagenesis, changes in genome sizes and eventually deleterious mutations, most biologists regard these TEs as “selfish DNA parasites”. Therefore, selection force is “giving a harsh time” to the TEs. Many organisms have their special mechanism to inhibit the activity of TE. In human genome, most of the transposable elements have been inactivated by mutagenesis long time ago. Very few transposons still remain their function of being a troublemaker. For example, Alu element is a most common transposon in human and has been proved to have association with inherited diseases and cancer. The figure below shows the karyotype of human chromosome of a female (XX). Green color labels the hybridized Alu element widely distributed in the chromosomes.

The discovery of TE also shed light on the speciation of organism in a molecular level. Scientists claim that these TEs have a common ancestor, and that transposons help exchange genetic information in the horizontal gene transmission. However, some researchers believe that these TEs emerged independently in multiple times. But still, except for the understanding of retroviral genome integration into host genome, we are unable to tell how the transposon from one ancestral species has been introduced to another species.

The wake up of sleeping beauty

TEs are named as sleeping beauty by their inactivated nature. However, in 1997 Ivics et al successfully woke up the sleeping beauty in salmonid cells. The researchers used a powerful approach to construct an activated transposon in fish cells by mutagenesis. They mapped several inactivated mutation in the ORF of sleeping beauty, replaced these mutations with robust amino acid residues, and finally woke up the beauty. The following shows how they performed this elegant biological trial.

Ivics, Z., Hackett, P. B., Plasterk, R. H., & Izsvák, Z. (1997). Molecular Reconstruction of< i> Sleeping Beauty</i>, a< i> Tc1</i>-like Transposon from Fish, and Its Transposition in Human Cells. Cell, 91(4), 501-510.

Saturday, February 23, 2013

Notes for Genetics and Genomics Vol.1

It was so stressed out these days because I was officially trapping in one of the “Kick My Ass” genetics and genomics course. It is my first time to realize that graduate school is going to be tough because of the course titled “ADVANCED”. Anyway, I can make it fun to learn couple of decent genetic findings, as well as many terminologies that I’ve never heard before. So I set up my mind on summarizing the things I learned about genetics for this new post.

Dosage Compensation
It refers to a hypothetical mechanism that balances the expression of X-linked genes between males and females. Dosage compensation varies in different orgasm, but it is ubiquitous in eukaryotes. In human and other mammals, males (XY) express normal level of X-linked genes while in females (XY) one of X chromosome is inactivated. This was found by Ohno S in 1959, by showing two different X-chromosomes of the mammal cells; one is like autosome and the other is condensed and heterochromatic. On the contrary, in Drosophila, it is the female who dominates the gene dosage, so the males double the expression of genes on X-chromosome. As for hermaphrodite C. elegans, both X-chromosomes are somewhat repressed. The following picture shows that Xist RNA is coating only one X-chromosome, which indicates the inactivation of another.

How the Y-chromosome is maintained in human evolution?
Y-chromosome is the most unique, funky and charismatic guy in human males’ karyotype. It stands on its own, being without a homolog to pair with. It is scared by X-chromosome, which may try hard to invade Y by homologue recombination. Y has to make its own way for spermatogenesis. During meiosis, palindrome structures in male-specific region of Y (MSY) are maintained by gene conversion. These palindromes play a big role in protecting Y-chromosome from exchanging sequence with X-chromosome and the subsequent loss of function.

Lange et al investigated the unexpected consequence by maintaining palindromes in MSY and the genetic etiology for Turner’s syndrome. Their model is based on the isodicentric (idic Y) generated during the recombination of sister chromatids of Y by crossover pathway. Idic Y would then lost during the process of spermatogenesis of father’s germline cells, giving a daughter with 45, X0 karyotype. The daughter is diagnosed as Turner syndrome.

This picture shows how idic Y is produced in their model of Turner syndrome

Ohno, S. (1969). Evolution of sex chromosomes in mammals. Annual Review of Genetics, 3(1), 495. doi:10.1146/
Lange, J. (2009). Isodicentric Y chromosomes and sex disorders as byproducts of homologous recombination that maintains palindromes. Cell, 138(5), 855. doi:10.1016/j.cell.2009.07.042