BLAST Search Parameters¶
Limit by Organism¶
A BLAST search may be limited by organism. The entry field will suggest completions once a user starts typing. A checkbox will exclude rather than include the organism in the search.
Limit by Entrez Query¶
A BLAST search can be limited to the result of an Entrez query against the database chosen. This restricts the search to a subset of entries from that database fitting the requirement of the Entrez query. Terms normally accepted by Entrez nucleotide or protein searches are accepted here. Examples are given below.
- protease NOT hiv1[organism]¶
This will limit a BLAST search to all proteases, except those in HIV 1.
- 1000:2000[slen]¶
This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.
- Mus musculus[organism] AND biomol_mrna[properties]¶
This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.
- 10000:100000[mlwt]¶
This is yet another example usage, which limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.
- src specimen voucher[properties]¶
This limits the search to entries that are annotated with a specimen_voucher qualifier on the source feature.
- all[filter] NOT enviromnental sample[filter] NOT metagenomes[orgn]¶
This excludes sequences from metagenome studies and uncultured sequences from anonymous environmental sample studies.
For help in constructing Entrez queries please see the Writing Advanced Search Statements section of the Entrez Help document. Knowing the content of a database and applying the Entrez terms accordingly are important. For example, biomol_mrna[prop] should not be applied to htgs or chromosome database since they do not contain mRNA entries!
Compositional adjustments¶
Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed; this procedure is called “composition-based statistics” (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003; Yu and Altschul, 2005; Altschul et al., 2005). Such “compositional score matrix adjustment” may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005); under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.
[1] Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucleic Acids Res. 29:2994-3005.
[2] Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) “The compositional adjustment of amino acid substitution matrices,” Proc. Natl. Acad. Sci. USA 100:15688-15693.
[3] Yu, Y.-K. and Altschul, S.F. (2005) “The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions,” Bioinformatics 21:902-911.
[4] Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) “Protein database searches using compositionally adjusted substitution matrices,” FEBS J 272(20):5101-9.
Filter¶
Filter (Low-complexity)
This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.
It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.
Filter (Human repeats)
This option masks Human repeats (LINE’s, SINE’s, plus retroviral repeasts) and is useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases. which contain large number of repeats (htgs). This filter should be checked for genomic queries to prevent potential problems that may arise from the numerous and often spurious matches to those repeat elements.
Filter (Mask for lookup table only)
BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purpose of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.
Mask Lower Case
with this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.
One can use different combinations of the above filter options to achieve optimal search result.
Word-size¶
BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding “hot-spots” that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., “blastn”) an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied. The webpage allows the word-sizes 2, 3, and 6.
Expect¶
This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.
Reward and Penalty for Nucleotide Programs¶
Many nucleotide searches use a simple scoring system that consists of a “reward” for a match and a “penalty” for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [1]. Read more here
Matrix and Gap Costs¶
Matrix
A key element in evaluating the quality of a pairwise sequence alignment is the “substitution matrix”, which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching. See more information on BLAST substitution matrices.
Gap Cost
The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.
PSSM
PSI-BLAST can save the Position Specific Score Matrix constructed through iterations. The PSSM thus constructed can be used in searches against other databases with the same query by copying and pasting the encoded text into the PSSM field.
To save a PSSM file:
Run a protein BLAST search.
Check the PSI-BLAST box on formatting page.
Click the “Format” Button.
On the PSI-BLAST results page, click the “Run PSI-BLAST Iteration 2” button.
Select the Download link at the top of the page and download the PSSM to your computer.
To use the PSSM in a new protein BLAST search against other databases:
Open a new protein BLAST page.
Select PSI-BLAST as the Algorithm under “Program Selection” (this may already be set).
Select the “+” next to “Algorithm parameters” at the bottom of the search page.
Scroll to the “PSI/PHI/DELTA BLAST” section and use the “Choose File” button to upload the PSSM that you saved in step 5 above.
Select a different target database.
Click “BLAST” button to start the search
If the database is the same as when the PSSM was stored, you’ll reproduce the iteration on which you’ve saved the PSSM; A different database will yield a different hit list.
PHI-BLAST Pattern¶
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question:
What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?
PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.