Another new post about influenza virus genome mutation simulation is coming soon...
Influenzer Notes
Full-time virus and half-time human being. Xi (Cassie) Guo's personal blog.
Saturday, February 7, 2015
Web Scraping is fun
OK, this post has nothing to do with my research. Wait, it probably does something with R, which I used for part of my data analysis. I had a lot of fun in R programming so I did this small project, check this out:
Tuesday, November 19, 2013
Refreshed post: virus replication, dynamic modeling and within-host competition
My motivation for this post
came from a great course I took, Ecology of Infectious Diseases. (In addition I
am trying to not to be a sluggard!) The theme of the course is: imagine you are
a microparasite (pathogen) that is willing to take every chance to survive.
This leads my interesting thoughts on a paper by B R. Levin et al.
In the section titled "Within-Host Population Dynamics of Pathogen
Proliferation", he wrote:
If the course of a microparasite infection in a vertebrate host
were described without jargon, the process would be readily recognized as one
of population dynamics and evolution.
...
This
perspective on pathogenesis and the immune response as ecological,
population-dynamical, and evolutionary processes has been well recognized for
some time (11). However, it has had little impact on contemporary research on
the mechanisms of pathogenesis. Much of this
research is qualitative rather than quantitative, and it can be
described as a quest to characterize (genetically, biochemically, and
physiologically) the interaction between infectious pathogens and the host's
immune defenses. Although this research provides an indispensable basis for
understanding pathogenesis and the host's response to infection, it tells only
a part of the story. A complete account of the
course of an infectious disease must include a quantitative description of the
major forces that determine the abundance, diversity, and distribution of a
pathogen population within an infected host and the immune defenses involved in
its control.
It is fascinating that the
author pointed out the same thing came to me when I learned human population
genetics. Why wouldn't we study viruses from a population biology point of
view?
For decades the virologists
characterized function of tremendous amount of viral proteins, host factors
interaction, by saying that "XXX protein interacts with XXX host factors,
which is increased/decreased during infection". These facts are important
but only "parts of the story". For some reasons would trust the
author as we always lose a global view in biomedical research.
Viruses could be great source
to study from population genetics of view, if we think the genome of every
virus as the haplotype of every individuals. Some alleles in the viral genome
confers "susceptibility" to the host, that is, loss of
pathogenicity/infectiousness. This sounds very familiar when an allele in human
was found causative of the susceptibility to a pathogen in a human
subpopulation.
Another way to think about
the viral population genetics is to refer the mosaicism of the organism. The
viruses with slightly different genome function as a whole parasite. These
minute mutations of each viral particle have a huge impact on the population
level. This sounds like the quasi species theory proposed long time ago.
Friday, July 12, 2013
Latest version of BLAST 2.2.28: makeblastdb, sequence ID, and some other tips
This is what I did for a few weeks ago when I used BLAST package (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to analyze my Illumina 2x250 MiSeq data. It is a whole bunch of short reads collection, about 300bp per reads. Basically, all the sequences in FASTA look like this:
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
I also had a small collection of reference sequences at hand, which are annotated and published data (about 22 annoated genes in total). The goal here is to assign all the reads an official gene name, based on the similarity of the reads compared with the reference sequences. The first thing is to build a local database by makeblastdb:
makeblastdb -dbtype nucl -in database.fa -input_type fasta -out allreads
and then there will be a database containing all my reads built in. The funny thing is after you build the database in fasta format, it is hard to retrieve the sequences again by its accession number assigned by the blast--------same problem mentioned by Peter Cock in his blog (http://blastedbio.blogspot.com/2012/10/my-ids-not-good-enough-for-ncbi-blast.html). Thats all because the SUPER COMPLICATED NOMENCLATURE SYSTEM OF NCBI DATABASE...
this shows the original database:
x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "%f"
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
>1101:13653:2384
CTCCTGCTTCCAGGATCCTGGGCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGCATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGCAGCTCA
GGCACCCTGACCATCACGGGGCTCCAGGCTGAGGATGACGCTGAGTATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
And then if I want to find the accession numbers/OID/GI in the database, I failed:
x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available
so I give up on extracting the sequence by all the command in blast. Instead, I used a reverse approach, making all the reads as query and all the reference sequences as database, and blastn will automatically match all the reads with the perfect hit in the reference sequences, and then the hits is exactly the official name that I want! However, there are still some defects in the BLAST command, I winded up with using Filemaker pro to process the output of BLAST, and life is becoming much easier.
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
I also had a small collection of reference sequences at hand, which are annotated and published data (about 22 annoated genes in total). The goal here is to assign all the reads an official gene name, based on the similarity of the reads compared with the reference sequences. The first thing is to build a local database by makeblastdb:
makeblastdb -dbtype nucl -in database.fa -input_type fasta -out allreads
and then there will be a database containing all my reads built in. The funny thing is after you build the database in fasta format, it is hard to retrieve the sequences again by its accession number assigned by the blast--------same problem mentioned by Peter Cock in his blog (http://blastedbio.blogspot.com/2012/10/my-ids-not-good-enough-for-ncbi-blast.html). Thats all because the SUPER COMPLICATED NOMENCLATURE SYSTEM OF NCBI DATABASE...
this shows the original database:
x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "%f"
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
>1101:13653:2384
CTCCTGCTTCCAGGATCCTGGGCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGCATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGCAGCTCA
GGCACCCTGACCATCACGGGGCTCCAGGCTGAGGATGACGCTGAGTATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
And then if I want to find the accession numbers/OID/GI in the database, I failed:
x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available
so I give up on extracting the sequence by all the command in blast. Instead, I used a reverse approach, making all the reads as query and all the reference sequences as database, and blastn will automatically match all the reads with the perfect hit in the reference sequences, and then the hits is exactly the official name that I want! However, there are still some defects in the BLAST command, I winded up with using Filemaker pro to process the output of BLAST, and life is becoming much easier.
Sunday, April 7, 2013
Why there is genome instability? – Discovery of sleeping beauty
Why
there is genome instability? – Discovery of sleeping beauty
Transposable elements (TE) are a large family of DNA
elements widely presented in the genome of organism. These elements may not
share the exactly same sequence, but consist of elements that confer similar
biological function, which is inserting into genome and causing target site duplications.
There are two classes of TE in eukaryotic cells: retrotransposon elements and
DNA transposons.
The first transposon, founded by Barbara McClintock in 1983,
is a naughty gene jumping around in the genome of maize. She found that there
were deletions and insertions in the genome that are related to the change in
the color of corn kernels. This transposon belongs to class II TEs, being
called as DNA copy & paste system (or ctrl+x & ctrl+v system, geeky
joke). Class II transposons are composed of two inverted repeats (IR) at each
end, and one transposase element in the middle, which determines the autonomy
of the transposon. The structure enables class II transposons to insert in the
genome, changing the size of the genome and leading to genome instability.
Class I transposon is a family of retroviral genome (I have
talked about retrotransposon and viral oncology in a previous post). It is
interesting to know that not only in retrovirus, in yeast cells there are
transposons that resemble retrotransposon structure. Ty element includes TyA, a
gag-like protein and TyB, homology to RTase. The structure of Ty element is
shown as follows.
Indication of evolution
Comparing these transposons from different branches of
species reveal the role of transposon in organism development and evolution.
Because the unstable nature of TE leads to mutagenesis, changes in genome sizes
and eventually deleterious mutations, most biologists regard these TEs as
“selfish DNA parasites”. Therefore, selection force is “giving a harsh time” to
the TEs. Many organisms have their special mechanism to inhibit the activity of
TE. In human genome, most of the transposable elements have been inactivated by
mutagenesis long time ago. Very few transposons still remain their function of
being a troublemaker. For example, Alu element is a most common transposon in
human and has been proved to have association with inherited diseases and
cancer. The figure below shows the karyotype of human chromosome of a female
(XX). Green color labels the hybridized Alu element widely distributed in the
chromosomes.
The discovery of TE also shed light on the speciation of
organism in a molecular level. Scientists claim that these TEs have a common
ancestor, and that transposons help exchange genetic information in the
horizontal gene transmission. However, some researchers believe that these TEs
emerged independently in multiple times. But still, except for the understanding
of retroviral genome integration into host genome, we are unable to tell how
the transposon from one ancestral species has been introduced to another
species.
The wake up of sleeping beauty
TEs are named as sleeping beauty by their inactivated nature.
However, in 1997 Ivics et al successfully woke up the sleeping beauty in
salmonid cells. The researchers used a powerful approach to construct an
activated transposon in fish cells by mutagenesis. They mapped several
inactivated mutation in the ORF of sleeping beauty, replaced these mutations
with robust amino acid residues, and finally woke up the beauty. The following
shows how they performed this elegant biological trial.
Reference
Ivics, Z., Hackett, P. B., Plasterk, R. H., & Izsvák, Z.
(1997). Molecular Reconstruction of< i> Sleeping Beauty</i>, a<
i> Tc1</i>-like Transposon from Fish, and Its Transposition in Human
Cells. Cell, 91(4), 501-510.
http://en.wikipedia.org/wiki/Transposable_element
Saturday, February 23, 2013
Notes for Genetics and Genomics Vol.1
It was so stressed out
these days because I was officially trapping in one of the “Kick My Ass”
genetics and genomics course. It is my first time to realize that graduate
school is going to be tough because of the course titled “ADVANCED”. Anyway, I
can make it fun to learn couple of decent genetic findings, as well as many
terminologies that I’ve never heard before. So I set up my mind on summarizing
the things I learned about genetics for this new post.
Dosage Compensation
It refers to a
hypothetical mechanism that balances the expression of X-linked genes between
males and females. Dosage compensation varies in different orgasm, but it is
ubiquitous in eukaryotes. In human and other mammals, males (XY) express normal
level of X-linked genes while in females (XY) one of X chromosome is
inactivated. This was found by Ohno S in 1959, by showing two different
X-chromosomes of the mammal cells; one is like autosome and the other is
condensed and heterochromatic. On the contrary, in Drosophila, it is the female
who dominates the gene dosage, so the males double the expression of genes on
X-chromosome. As for hermaphrodite C. elegans, both X-chromosomes are somewhat
repressed. The following picture shows that Xist RNA is coating only one
X-chromosome, which indicates the inactivation of another.
How the Y-chromosome is
maintained in human evolution?
Y-chromosome is the most
unique, funky and charismatic guy in human males’ karyotype. It stands on its
own, being without a homolog to pair with. It is scared by X-chromosome, which
may try hard to invade Y by homologue recombination. Y has to make its own way
for spermatogenesis. During meiosis, palindrome structures in male-specific
region of Y (MSY) are maintained by gene conversion. These palindromes play a
big role in protecting Y-chromosome from exchanging sequence with X-chromosome
and the subsequent loss of function.
Lange et al investigated
the unexpected consequence by maintaining palindromes in MSY and the genetic
etiology for Turner’s syndrome. Their model is based on the isodicentric (idic
Y) generated during the recombination of sister chromatids of Y by crossover
pathway. Idic Y would then lost during the process of spermatogenesis of
father’s germline cells, giving a daughter with 45, X0 karyotype. The daughter
is diagnosed as Turner syndrome.
This picture shows how idic Y is produced in their
model of Turner syndrome
References
Ohno, S. (1969).
Evolution of sex chromosomes in mammals. Annual Review of Genetics, 3(1),
495. doi:10.1146/annurev.ge.03.120169.002431
Lange, J. (2009). Isodicentric Y chromosomes and
sex disorders as byproducts of homologous recombination that maintains
palindromes. Cell, 138(5), 855. doi:10.1016/j.cell.2009.07.042
Subscribe to:
Posts (Atom)