SmartBLAST processes your protein query to present a concise summary of the five best protein matches from well-studied reference species in the landmark database (described below). If possible, the matches will be from different organisms. If SmartBLAST cannot find five matches in the landmark database, it will uses matches from the protein non-redundant (nr) database. SmartBLAST produces these results using a combination of an optimized BLASTP search, a new implementation of BLAST meant to find closely related matches, and a multiple alignment. Additionally, SmartBLAST presents Conserved Domain Database matches to your query. Additional matches to the nr database are presented lower in the report.
SmartBLAST is under active development and may change with little or no notice.
SmartBLAST accepts either a FASTA sequence or a protein accession/GI as input. SmartBLAST attempts to identify any FASTA input as an exact match to a known sequence. If it can identify a match, it will use the identifier for that sequence in the search. This allows SmartBLAST to display any available information (such as taxonomy) about the query. SmartBLAST accepts only one query at a time and only supports interactive searches through a web browser.
SmartBLAST uses a combination of BLAST searches and a multiple sequence alignment to produce its results. First, it searches your query against the landmark database (described below) with BLASTP and simultaneously searches the non-redundant (nr) protein database with an optimized version of BLAST targeted to closely related sequences. Second, SmartBLAST performs a multiple sequence alignment on six sequences (the query and five subject sequences) using the COBALT multiple sequence alignment program. The multiple sequence alignment and BLASTP serve different but complementary roles in this procedure. BLASTP identifies subject sequences similar to the query. It only calculates pairwise similarities between the query and individual subject sequences. The multiple alignment compares all six sequences to each other and produces an optimal alignment between all sequences. A multiple alignment is ideal for presentations, like a phylogenetic tree, that show the relationship among a set of sequences.
The top ("Summary") portion of the SmartBLAST report presents your query and five matching sequences in a unique view combining a phylogenetic tree and a graphical overview. Figure 1 presents an example. On the left is the phylogenetic tree and each leaf-node shows the organism of the sequence represented. If a common name is available, then that is used. In the middle is a short description of the protein. On the right is a graphical overview. The matches are color-coded: matches from the landmark database are green, matches from the non-redundant protein database are blue, and your query is yellow. Holding your mouse over a sequence description or a graphical overview bar brings up a panel with information about that sequences. Figure 2 presents an example.
SmartBLAST uses the multiple sequence alignment for the phylogenetic tree and the graphical overview. The graphical overview represents the multiple alignment but also presents information about the BLAST search results. The presentation grays out a range of a subject sequence to indicate that BLAST did not align the query there. It represents a deletion according to the multiple alignment by leaving a range on a graphical overview bar white. Figure 1 presents an example. The query and three subject sequences have deletions (compared to the other sequence) as shown by the white portions of the bars. Four subject sequences extend further to the right than the query as shown by the gray on the right side. Neither BLASTP nor the multiple alignment aligned them to the query. All of the subject sequences are gray at the left end. BLASTP did not align the subject sequences to the query on this end, but the multiple alignment did.
Figure 1: Summary section of the SmartBLAST report. The query species for this search is highlighted in yellow. Green indicates the reference database (see below). The query for this search is NP_251514.1.
SmartBLAST also presents a concise representation of Conserved Domain matches for the query (above the graphical overview). Specifically, if both specific hits and superfamily matches are available, it will only show the specific hit. If there is a superfamily match but no specific hit, then that is shown. It does not show matches from multi-domain models. SmartBLAST presents a summary of the domains found at the top of the middle column.
In order to present a concise report, SmartBLAST will not show any sequence more than once. If your query is identical to a sequence in the landmark or non-redundant set, then it will not be repeated as a match. The matches from the landmark database will also not be repeated as non-redundant matches.
You may see your results as a multiple alignment using the "See full multiple alignment" link.
The rest of the SmartBLAST report is similar to a standard BLAST report, though there are two sections of Descriptions. The first section ("Best hits") lists the matches shown in the Summary area. The second section ("Additional BLAST Hits") lists other BLAST matches found by SmartBLAST. Alignments for all matches are available below the Descriptions.
Figure 2: Information panel for a sequence.
Hover over a description or an overview bar with your mouse to see more information about a sequence. The "Alignment" link takes you to the BLASTP alignment on the same page. The "Sequence" link takes you to the GenPept record. The "Identical proteins" link takes you to information about protein records with an identical sequence.
The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies.