Frequently Asked questions
Q: What happened to the Month database?
- 2007/06/30:2007/07/31[mdat] (mdat = modification date)
- 1 month[filter]
- 2 months[filter]
- 6 months[filter]
Q: What are the lower case grey letters in the query sequence in BLAST results?
Q: Submitting primers or other short sequences
Primer-BLAST was designed to make primers that are specific to an input PCR template, using Primer3. It can also check user supplied primers for specificity.
The "Search for short, nearly exact matches" nucleotide and protein pages no longer exist. Instead, the nucleotide and protein blast programs automatically check for short queries and adjust the search parameters accordingly. This adjustment occurs when the query, either nucleotide or amino acid, is of length 30 or less. The translating blast programs or searches on the genome blast pages do not have this auto adjust feature.
Q: Default database for nucleotide-nucleotide searches
Q: Saving your search parameters
Q: How to limit a search to an organism or taxonomic group or exclude such groups
To search only sequences from an organism or taxonomic group, use the "Organism" text box. On the nucleotide blast pages, first click the radio button for "Others (nr etc.)". The "Organism" text box has an auto fill function. Begin to enter an organism common name (rat, bacteria, etc.), a genus or species (elegans, danio, etc.), or an NCBI taxonomy id; then select a name from the list.
The taxonomic group can also be excluded by using the "Exclude" checkbox to the right of the "Organism" box.
More taxonomic groups may be included or excluded wth the "+" box further to the right of the "Organism" text box.
You can also use Entrez Query terms as before. Put those in the Entrez Query box just below the Organism field; for example, rattus norvegicus[organism] or simply, rat[orgn]. Also, see the FAQ, "How to limit a search to a subset of database sequences."
You can search for taxa in the Taxonomy Browser.
Q: How to exclude models (XM/XP accessions) and uncultured enviromental sequences?
Q: How to limit a search to a subset of database sequences?
Q: How can I search a batch of sequences with BLAST?
The available batch BLAST options are:
- 1.) Standalone BLAST executables. These are command line programs which run BLAST searches against local, downloaded copies of the NCBI BLAST databases, or against custom databases formatted for BLAST. The programs will handle either a single large file with multiple FASTA query sequences, or you can create a script to send multiple files one at a time. The executables are available for a wide variety of platforms, including LINUX, Windows, and Mac OSX.
The Standalone package can be downloaded from links at this page.A manual is available for stand-alone BLAST here.
- 2.) Cloud providers. The NCBI has an Amazon Machine Image (AMI) at Amazon Web Services. This AMI provides access to the stand-alone BLAST exectuables, but also has a network API similar to the URL API provided by at the NCBI. A simple BLAST web page is also included. See the BLAST searches at a Cloud Provider page for details.
- 3.) Network BLAST client. The stand-alone executables can send searches to the BLAST server using the -remote flag. See the BLAST manual for details. This client uses NCBI compute resources and is considered a batch search. Searches will be run at lower priority than interactive searches from the NCBI BLAST web pages. Searches run at off-peak hours may have better throughput. Projects involving many searches should be run with stand-alone BLAST or through an instance at a cloud provider.
- 4.) Submit searches through the NCBI URL API. Documentation is avalable here. This service uses NCBI compute resources and is considered a batch search. Searches will be run at lower priority than interactive searches from the NCBI BLAST web pages. Searches run at off-peak hours may have better throughput. Projects involving many searches should be run with stand-alone BLAST or through an instance at a cloud provider.
Q: How to use BLAST to align two sequences without a database search.NCBI has a tool for aligning two sequences provided by the user. The tool is called BLAST 2 Sequences, which uses the chosen BLAST algorithm to align sequences as if they were found in a database search. This can be helpful for observing differences between two sequences, however, it still performs local alignments, not global alignments.Because BLAST 2 Sequences uses the size of the current nucleotide or protein nr database to calculate Expect values, you may need to significantly increase the Expect threshold in order to see shorter alignments. Also, the low complexity filter is on by default; this may be the cause of "missing" alignments.If comparing very large sequences (on the order of hundreds of kilobases), you may need to specify a sequence sub range with the "from" and "to" boxes. Also, submitting the shorter of two sequences as Sequence 1 may help when the two queries are of very different lengths.
Q: What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course.
The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.
What is "low-complexity" sequence?
Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. For amino acid queries this compositional bias is determined by the SEG program (Wootton and Federhen, 1996). For nucleotide queries it is determined by the DustMasker program (Morgulis, et al., 2006).
Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, high scoring hits may be reported only because of the presence of a low-complexity region. Most often, it is inappropriate to consider this type of match as the result of shared homology. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.
How to filter out (organism-specific) interspersed repeats?On the "blastn" (nucleotide-nucleotide) page there is an option to filter "Species-specific" repeats for a number of common organisms. This may be especially important if your query matches to the same or a related organism many times. To enable this, go to the "Algorithm parameters" section (at the bottom of the page), check "Species-specific repeats", and choose the proper organism.
ERROR: "No significant similarity found"Below are common reasons that a BLAST search results in the "No significant similarity found" message.
- Short query sequences: Short alignments may have Expect values above the default threshold, which is 10 on most pages, and, therefore, are not displayed. Try increasing the Expect threshold (under 'Algorithm parameters'). Also, see the FAQ Submitting primers or other short sequences.
- Filtering: Some of the BLAST programs mask regions of low complexity by default. These regions are not allowed to initiate alignments, so if your query is largely low complexity, the filter may prevent all hits to the database. On the Basic BLAST pages, adjust the filter settings in the section 'Filters and Masking', under 'Algorithm parameters'. For a description of low complexity filters, see "What is low-complexity sequence?"
ERROR: An error has occurred on the server, Too many HSPs to save allThis error occurs when the total number of high-scoring segment pairs (HSPs) is far too many for the BLAST servers to return the results. This is rare as the results have to be several hundred megabytes of information for this to happen. However, there are certain searches which could generate a huge amount of data. Most typically this error occurs when the default filters are turned off or when the query sequences have repeat elements in them. If you get this error, you have numerous options depending on your goals:
- 1.) Enable species specific repeats if applicable, see How to filter out (organism-specific) interspersed repeats.
- 2) If using tblastx, try blastx instead. The tblastx program is very CPU intensive as it not only translates the query in six reading frames but every database sequence as well. Often, using tblastx is a measure of last resort; a blastx search against a database of known proteins may provide what you need.
- 3) Search a smaller database, such as refseq_rna. Larger databases obviously contain more sequences and for some queries this results in numerous "background" hits. If you want a database of known mRNAs (and their translations) then refseq_rna is a good choice.
- 4) Break up large queries into smaller pieces; submit each piece in a separate search. A common cause of errors in BLAST is searching with a huge sequence, like a complete chromosome, against a large database like nr. This is better accomplished in portions rather than one large, continuous sequence.
- 5) Limit the database by taxonomy. Start with large groups, such as mammals, bacteria, etc. Any taxonomic node or tax id number that you can find in the Taxonomy browser can be used in the 'Organism' text box; see the BLAST FAQ, How to limit a search to an organism or taxonomic group." Also see the Taxonomy browser.
- 5) You may be hitting a large number of 'PREDICTED' or 'hypothetical protein' records. If you do not want these hits, use an Entrez Query such as: all[filter] NOT predicted[title].
- 6) For megablast and blastn searches, try increasing the word size and/or decreasing the Expect threshold.
ERROR: An error has occurred on the server, [blastsrv4.REAL]:Error: CPU usage limit was exceeded, resulting in SIGXCPU (24).This error occurs when your search is so large that the backend machines can not complete it in the time allowed, which is about one hour of combined CPU time. This is distinct from the "Too many HSPs" error in that there are no results and the servers have essentially killed the process. However, the causes, such as large output or unfiltered queries, are similar.
If you get this error you have numerous options depending on your goals. See the BLAST FAQ, "ERROR: An error has occurred on the server, Too many HSPs to save all".
Why do I get the message "ERROR:BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check query sequence" ?This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the BLAST FAQ on What is Low Complexity sequence?
Why some batch searches on the web may seem to take longer than expected.The NCBI WWW BLAST server is a shared resource and it would be unfair for a few users to monopolize it. To prevent this, the server keeps track of how many queries are in the queue for each user and penalizes those users with many queries in the queue. This is done by calculating a 'Time of Execution' (TOE). If a user has only one query in the queue, then the TOE is set to the current time. As a user adds more queries to the queue, then the TOE is set to the current time, plus 60 seconds for every query in the queue. An example would be if a user sent in five requests one after the other without waiting for any to be worked on, then the TOE's for the requests would be:
1st request: current time 2nd request: current time + 60 seconds 3rd request: current time + 120 seconds 4th request: current time + 180 seconds 5th request: current time + 240 seconds
The BLAST server works through requests in the order of earliest to latest TOE. A query will be executed before it's TOE, if there are no other queries with an earlier TOE. Submitting searches on off-hours (8 pm to 8 am EST) may provide better throughput. Users with large numbers of queries should use stand-alone BLAST or services at a cloud provider. See here for details.