Friday, July 12, 2013

Latest version of BLAST 2.2.28: makeblastdb, sequence ID, and some other tips

This is what I did for a few weeks ago when I used BLAST package (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to analyze my Illumina 2x250 MiSeq data. It is a whole bunch of short reads collection, about 300bp per reads. Basically, all the sequences in FASTA look like this:

>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC


I also had a small collection of reference sequences at hand, which are annotated and published data (about 22 annoated genes in total). The goal here is to assign all the reads an official gene name, based on the similarity of the reads compared with the reference sequences. The first thing is to build a local database by makeblastdb:

makeblastdb -dbtype nucl -in database.fa -input_type fasta -out allreads


and then there will be a database containing all my reads built in. The funny thing is after you build the database in fasta format, it is hard to retrieve the sequences again by its accession number assigned by the blast--------same problem mentioned by Peter Cock in his blog (http://blastedbio.blogspot.com/2012/10/my-ids-not-good-enough-for-ncbi-blast.html). Thats all because the SUPER COMPLICATED NOMENCLATURE SYSTEM OF NCBI DATABASE...

this shows the original database:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "%f"
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
>1101:13653:2384
CTCCTGCTTCCAGGATCCTGGGCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGCATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGCAGCTCA
GGCACCCTGACCATCACGGGGCTCCAGGCTGAGGATGACGCTGAGTATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC

And then if I want to find the accession numbers/OID/GI in the database, I failed:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt "OID: %o GI: %g ACC: %a IDENTIFIER: %i"
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available

so I give up on extracting the sequence by all the command in blast. Instead, I used a reverse approach, making all the reads as query and all the reference sequences as database, and blastn will automatically match all the reads with the perfect hit in the reference sequences, and then the hits is exactly the official name that I want! However, there are still some defects in the BLAST command, I winded up with using Filemaker pro to process the output of BLAST, and life is becoming much easier.

No comments:

Post a Comment