Introduction

SmartBLAST processes your protein query to present a concise summary of the five best protein matches from well-studied reference species in the landmark database (described below). If possible, the matches will be from different organisms. If SmartBLAST cannot find five matches in the landmark database, it will uses matches from the protein non-redundant (nr) database. SmartBLAST produces these results using a combination of an optimized BLASTP search, a new implementation of BLAST meant to find closely related matches, and a multiple alignment. Additionally, SmartBLAST presents Conserved Domain Database matches to your query. Additional matches to the nr database are presented lower in the report.

SmartBLAST is under active development and may change with little or no notice.

Your query

SmartBLAST accepts either a FASTA sequence or a protein accession/GI as input. SmartBLAST attempts to identify any FASTA input as an exact match to a known sequence. If it can identify a match, it will use the identifier for that sequence in the search. This allows SmartBLAST to display any available information (such as taxonomy) about the query. SmartBLAST accepts only one query at a time and only supports interactive searches through a web browser.

The alignment

SmartBLAST uses a combination of BLAST searches and a multiple sequence alignment to produce its results. First, it searches your query against the landmark database (described below) with BLASTP and simultaneously searches the non-redundant (nr) protein database with an optimized version of BLAST targeted to closely related sequences. Second, SmartBLAST performs a multiple sequence alignment on six sequences (the query and five subject sequences) using the COBALT multiple sequence alignment program. The multiple sequence alignment and BLASTP serve different but complementary roles in this procedure. BLASTP identifies subject sequences similar to the query. It only calculates pairwise similarities between the query and individual subject sequences. The multiple alignment compares all six sequences to each other and produces an optimal alignment between all sequences. A multiple alignment is ideal for presentations, like a phylogenetic tree, that show the relationship among a set of sequences.

The report

The top ("Summary") portion of the SmartBLAST report presents your query and five matching sequences in a unique view combining a phylogenetic tree and a graphical overview. Figure 1 presents an example. On the left is the phylogenetic tree and each leaf-node shows the organism of the sequence represented. If a common name is available, then that is used. In the middle is a short description of the protein. On the right is a graphical overview. The matches are color-coded: matches from the landmark database are green, matches from the non-redundant protein database are blue, and your query is yellow. Holding your mouse over a sequence description or a graphical overview bar brings up a panel with information about that sequences. Figure 2 presents an example.

SmartBLAST uses the multiple sequence alignment for the phylogenetic tree and the graphical overview. The graphical overview represents the multiple alignment but also presents information about the BLAST search results. The presentation grays out a range of a subject sequence to indicate that BLAST did not align the query there. It represents a deletion according to the multiple alignment by leaving a range on a graphical overview bar white. Figure 1 presents an example. The query and three subject sequences have deletions (compared to the other sequence) as shown by the white portions of the bars. Four subject sequences extend further to the right than the query as shown by the gray on the right side. Neither BLASTP nor the multiple alignment aligned them to the query. All of the subject sequences are gray at the left end. BLASTP did not align the subject sequences to the query on this end, but the multiple alignment did.

SmartBLAST summary

Figure 1: Summary section of the SmartBLAST report. The query species for this search is highlighted in yellow. Green indicates the reference database (see below). The query for this search is NP_251514.1.

SmartBLAST also presents a concise representation of Conserved Domain matches for the query (above the graphical overview). Specifically, if both specific hits and superfamily matches are available, it will only show the specific hit. If there is a superfamily match but no specific hit, then that is shown. It does not show matches from multi-domain models. SmartBLAST presents a summary of the domains found at the top of the middle column.

In order to present a concise report, SmartBLAST will not show any sequence more than once. If your query is identical to a sequence in the landmark or non-redundant set, then it will not be repeated as a match. The matches from the landmark database will also not be repeated as non-redundant matches.

You may see your results as a multiple alignment using the "See full multiple alignment" link.

The rest of the SmartBLAST report is similar to a standard BLAST report, though there are two sections of Descriptions. The first section ("Best hits") lists the matches shown in the Summary area. The second section ("Additional BLAST Hits") lists other BLAST matches found by SmartBLAST. Alignments for all matches are available below the Descriptions.

SmartBLAST summary with open popper

Figure 2: Information panel for a sequence.

Hover over a description or an overview bar with your mouse to see more information about a sequence. The "Alignment" link takes you to the BLASTP alignment on the same page. The "Sequence" link takes you to the GenPept record. The "Identical proteins" link takes you to information about protein records with an identical sequence.

Landmark Database

The landmark database includes proteomes from 27 genomes spanning a wide taxonomic range. This search set is produced using the best available genomic assemblies for each organism with the following procedure. First, the most recent representative assembly from each organism is identified. Second, all proteins annotated on each assembly are downloaded and compiled into the landmark BLAST database. The result is a taxonomically diverse non-redundant set of proteins supported by genomic assemblies.

Organisms and Assemblies for the Reference set protein
#OrganismAssemblySuperkingdom
1Bacillus subtilisGCF_000009045.1Bacteria
2Deinococcus radiodurans R1GCF_000008565.1Bacteria
3Escherichia coli str. K-12 substr. MG1655GCF_000005845.2Bacteria
4Microcystis aeruginosa NIES-843GCF_000010625.1Bacteria
5Mycobacterium tuberculosis H37RvGCF_000195955.2Bacteria
6Neisseria meningitidisGCF_000008805.1Bacteria
7Peptoclostridium difficile 630GCF_000009205.1Bacteria
8Pseudomonas aeruginosaGCF_000006765.1Bacteria
9Shewanella oneidensis MR-1GCF_000146165.2Bacteria
10Streptococcus pneumoniae R6GCF_000007045.1Bacteria
11Streptomyces coelicolorGCF_000203835.1Bacteria
12Synechocystis sp. PCC 1148GCF_000340785.1Bacteria
13Thermotoga maritima MSB8GCF_000008545.1Bacteria
14Methanothermobacter thermautotrophicus str. Delta HGCF_000008645.1Archaea
15Sulfolobus acidocaldarius DSM 639GCF_000012285.1Archaea
16Arabidopsis thaliana (thale cress)GCF_000001735.3Eukaryota
17Caenorhabditis elegansGCF_000002985.6Eukaryota
18Danio rerio (zebrafish)GCF_000002035.5Eukaryota
19Dictyostelium discoideum AX4GCF_000004695.1Eukaryota
20Drosophila melanogaster (fruit fly)GCF_000001215.4Eukaryota
21Glycine max (soybean) GCF_000004515.3Eukaryota
22Homo sapiens (human)GCF_000001405.30Eukaryota
23Leishmania donovaniGCF_000012285.1Eukaryota
24Mus musculus (house mouse)GCF_000001635.24Eukaryota
25Plasmodium falciparum 3D7GCF_000002765.31Eukaryota
26Saccharomyces cerevisiae (baker's yeast)GCF_000146045.2Eukaryota
27Schizosaccharomyces pombe (fission yeast)GCF_000002945.1Eukaryota