The query sequence(s) to be used for a BLAST search should be pasted in the 'Search' text area. BLAST accepts a number of different types of input and automatically determines the format or the input. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3) below. Accepted input types are FASTA, bare sequence, or sequence identifiers .
Accepted Input Formats
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
Blank lines are not allowed in the middle of FASTA input.
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:
A adenosine C cytidine G guanine T thymidine N A/G/C/T (any) U uridine K G/T (keto) S G/C (strong) Y T/C (pyrimidine) M A/C (amino) W A/T (weak) R G/A (purine) B G/T/C D G/A/T H A/C/T V G/C/A - gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:
A alanine P proline B aspartate/asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate/glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate lengthNOTE:
¹ The degenerate nucleotide codes in red are treated as mismatches in nucleotide alignment. Too many such degenerate codes within an input nucleotide query will cause the BLAST webpage to reject the input. For protein queries, too many nucleotide-like code (A,C,G,T,N) may also cause similar rejection.
² For protein code, U is replaced by X first before the search since it is not specified in any scoring matrices.
³ The BLAST webpage will not accept "-" in the query. To represent gaps, use a string of N or X instead.
- Bare Sequence
This may be just lines of sequence data, without the FASTA definition line, e.g.:
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSPIt can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:
1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp
Blank lines are not allowed in the middle of bare sequence input.
Normally these are simply accession, accession.version or gi's (e.g., p01013, AAA68881.1, 129295), but a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted. These NCBI sequence identifiers have a very specific syntax as described in https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.T5. The identifier may consist of only one token (i.e., word). Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier are allowed). Examples of illegal input are:
ACCESSION P01013 AAA68881. 1 gi| 129295For the first example "ACCESSION" must be removed, in the second example there is a space before the version number of the accession, in the third example there is a space after the bar ("|"). If more than one query is specified, each identifier should be on a separate line.
This function allows users to upload a text file containing queries formatted in FASTA format. The file can also contain sequence identifiers instead of FASTA sequences.
A segment of the query sequences can be used in BLAST searching. You can enter the range in the "Form" and "To" boxes provided under "Query subrange" to specify the position of this segment. For example to limit matches to the region from 24 to 200 of a query sequence, you would enter 24 in the "From" field and 200 in the "To" field. If one of the limits you enter is out of range, the intersection of the [From,To] and [1,length] intervals will be searched, where length is the length of the whole query sequence.
Genetic code to be used in blastx and tblastx translation of the query. See list of Genetic Codes in Taxonomy.
A BLAST search may be limited by organism. The entry field will suggest completions once a user starts typing. A checkbox will exclude rather than include the organism in the search.
A BLAST search can be limited to the result of an Entrez query against the database chosen. This restricts the search to a subset of entries from that database fitting the requirement of the Entrez query. Terms normally accepted by Entrez nucleotide or protein searches are accepted here. Examples are given below.
- protease NOT hiv1[organism]
This will limit a BLAST search to all proteases, except those in HIV 1.
This limits the search to entries with lengths between 1000 to 2000 bases for nucleotide entries, or 1000 to 2000 residues for protein entries.
- Mus musculus[organism] AND biomol_mrna[properties]
This limits the search to mouse mRNA entries in the database. For common organisms, one can also select from the pulldown menu.
This is yet another example usage, which limits the search to protein sequences with calculated molecular weight between 10 kD to 100 kD.
- src specimen voucher[properties]
This limits the search to entries that are annotated with a /specimen_voucher qualifier on the source feature.
- all[filter] NOT enviromnental sample[filter] NOT metagenomes[orgn]
This excludes sequences from metagenome studies and uncultured sequences from anonymous environmental sample studies.
For help in constructing Entrez queries please see the " Writing Advanced Search Statements" section of the Entrez Help document. Knowing the content of a database and applying the Entrez terms accordingly are important. For example, biomol_mrna[prop] should not be applied to htgs or chromosome database since they do not contain mRNA entries!
Amino acid substitution matrices may be adjusted in various ways to compensate for the amino acid compositions of the sequences being compared. The simplest adjustment is to scale all substitution scores by an analytically determined constant, while leaving the gap scores fixed; this procedure is called "composition-based statistics" (Schaffer et al., 2001). The resulting scaled scores yield more accurate E-values than standard, unscaled scores. A more sophisticated approach adjusts each score in a standard substitution matrix separately to compensate for the compositions of the two sequences being compared (Yu et al., 2003; Yu and Altschul, 2005; Altschul et al., 2005). Such "compositional score matrix adjustment" may be invoked only under certain specific conditions for which it has been empirically determined to be beneficial (Altschul et al., 2005); under all other conditions, composition-based statistics are used. Alternatively, compositional adjustment may be invoked universally.
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) "Improving the
accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucleic Acids Res. 29:2994-3005.
 Yu, Y.-K., Wootton, J.C. and Altschul, S.F. (2003) "The compositional adjustment of amino acid substitution matrices," Proc. Natl. Acad. Sci. USA 100:15688-15693.
 Yu, Y.-K. and Altschul, S.F. (2005) "The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions," Bioinformatics 21:902-911.
 Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schaffer, A.A. and Yu, Y.-K. (2005) "Protein database searches using compositionally adjusted substitution matrices," FEBS J 272(20):5101-9.
- Filter (Low-complexity)
This function mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.
It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used.
- Filter (Human repeats)
This option masks Human repeats (LINE's, SINE's, plus retroviral repeasts) and is useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). This filter should be checked for genomic queries to prevent potential problems that may arise from the numerous and often spurious matches to those repeat elements.
For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions.
- Filter (Mask for lookup table only)
BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.
- Mask Lower Case
With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.
One can use different combinations of the above filter options to achieve optimal search result.
BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied. The webpage allows the word-sizes 2, 3, and 6.
This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.
Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved . Read more here
 States DJ, Gish W, and Altschul SF (1991) METHODS: A companion to Methods in Enzymology 3:66-70.
A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with (see the BLAST Frequently Asked Questions). See more information on BLAST substitution matrices.
- Gap Cost
The pull down menu shows the Gap Costs for the chosen Matrix. There can only be a limited number of options for these parameters. Increasing the Gap Costs will result in alignments which decrease the number of Gaps introduced.
PSI-BLAST can save the Position Specific Score Matrix constructed through iterations. The PSSM thus constructed can be used in searches against other databases with the same query by copying and pasting the encoded text into the PSSM field.To save a PSSM file:
- Run a protein BLAST search.
- Check the PSI-BLAST box on formatting page.
- Click the "Format" Button.
- On the PSI-BLAST results page, click the "Run PSI-BLAST Iteration 2" button.
- Now, on the Format page, select "PSSM" from the "Show" pull down menu.
- Click "Format" button.
- This will display text output with the ASCII-encoded PSSM. The "Save as..." option of the browser can be used to save this to a plain text file on your hard drive.
To use the PSSM in a new protein BLAST search against other databases:
- Copy the above PSSM from the browser
- Open a new protein BLAST page
- Paste the PSSM in the PSSM field in the page
- provide the SAME query in the search box
- select a different target database
- click "BLAST" button to start the search
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question:
What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. See PHI-BLAST pattern syntax for details.
C. Result Format Options
An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple segments of alignments to the same database sequence are connected by a thin grey line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.
Checking this option will allow BLAST formatter to parse out the annotated sequence features found in or around the vicinity of hits and display them within the BLAST result. For custom query sequences, it will also translate the CDS using the CDS translation annotated on matching database sequence as a guide. Mismatch in translation will be highlighted in pink. A representative example with CDS translation is given below.
>gi|46452254|gb|AY585334.1| Sus scrofa cystic fibrosis transmembrane conductance regulator (CFTR) mRNA, complete cds Length=4449 Score = 5453 bits (2751), Expect = 0.0 Identities = 4036/4449 (90%), Gaps = 6/4449 (0%) Strand=Plus/Plus CDS: Putative 1 1 M Q R S P L E K A S V V S K L F F S W T Query 133 ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACnnnnnnnCAGCTGGACC 192 |||||||||||||||||||||||||||||| | |||||||||||||||||||||||||| Sbjct 1 ATGCAGAGGTCGCCTCTGGAAAAGGCCAGCATCTTCTCCAAACTTTTTTTCAGCTGGACC 60 CDS:cystic fibrosis 1 M Q R S P L E K A S I F S K L F F S W T CDS: Putative 1 21 R P I L R K G Y R Q R L E L S D I Y Q I Query 193 AGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATC 252 |||||||||||||| |||||||| |||||||||||||||||||||||||||||||| ||| Sbjct 61 AGACCAATTTTGAGAAAAGGATATAGACAGCGCCTGGAATTGTCAGACATATACCATATC 120 CDS:cystic fibrosis 21 R P I L R K G Y R Q R L E L S D I Y H I CDS: Putative 1 41 P S V D S A D N L S E K L E R E W D R E Query 253 CCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGAGAATGGGATAGAGAG 312 ||||| ||| |||||||||||||| |||||||||||||||||||||||||| ||||| Sbjct 121 TCTTCTTCTGACTCTGCTGACAATCTGTCTGAAAAATTGGAAAGAGAATGGGACAGAGAA 180 CDS:cystic fibrosis 41 S S S D S A D N L S E K L E R E W D R E
There are two options that determines the way filter masked region should be displayed in.
- Masking Character
"X or N" displays the masked region in X for protein and N for nucleotide
"Lower Case" displays maksed region in lower case letters
- Masking Color The masked region can be "highlighted" with grey or red colored fonts
This option restricts the number of short descriptions of matching sequences reported to the number specified. Default setting varies from page to page. See also EXPECT.
The databases alignments are displayed as pairs of matches between query and subject sequence. A middle line between the query and subject sequence displays the status of a letter. For protein alignments (e.g, BLASTP/BLASTX/TBLASTN), identities present the letter, conservative substitutions present a "+", and nothing otherwise. For nucleotide alignments (e.g., BLASTN and megaBLAST) a "|" is shown for matches and nothing for mismatches. This is the default view.
- Pairwise with dots for identities
The databases alignments are anchored (shown in relation to) to the query sequence in pairwised fashion with mismatches colored in red. Sbjct will be in red and bold font if a line in the alignment contains mismatches. See example below.
- Query-anchored with dots for identities
The databases alignments are anchored (shown in relation to) to the query sequence. Identities are displayed as dots (.), with mismatches displayed as single letter abbreviations.
- Query-anchored with letters for identities
Identities are shown as single letter nucleotide abbreviations.
- Flat Query-anchored with dots for identities
The 'flat' display shows inserts as deletions on the query. Identities are displayed as dots (.), with mismatches displayed as single letter abbreviations.
- Flat Query-anchored with letters for identities
The 'flat' display shows inserts as deletions on the query. Identities are shown as single letter abbreviations.
>gi|21536448|ref|NM_002622.3| Homo sapiens prefoldin 1 (PFDN1), mRNA Length=1296 Score = 392 bits (212), Expect = 2e-107 Identities = 220/223 (98%), Gaps = 3/223 (1%) Strand=Plus/Plus Query 107 TCCTACCTGGAGCGAAG-GTTANAGGAAGCTGAGGACAACATCCGGGAGATGCTGATGGC 165 Sbjct 300 .................C....-..................................... 358 Query 166 ACGAAGGG-CCAGTAGGGAGCCTCTCTGGGAAGCTCTTCCTCCTGCCCCTCCCATTCCTG 224 Sbjct 359 ........C................................................... 418 Query 225 GTGGGGGCAGAGGAGTGTCTGCAGGGAAACAGCTTCTCCTCTGCCCCGATGGATGCTTTA 284 Sbjct 419 ............................................................ 478 Query 285 TTTGGATGGCCTGGCAACATCACATTTTCTGCATCACCCTGAG 327 Sbjct 479 ........................................... 521
The Download links allows downloads of XML, Text report, CSV, XML, ASN.1 or JSON.
The Position-Specific Iterated BLAST (PSI-BLAST) program performs iterative searches with a protein query, in which sequences found in one round of search are used to build a custom score model for the next round.
In PSI-BLAST the algorithm is not tied to a specific score matrix, such as BLOSUM62, which has been implemented using an AxA substitution matrix where A is the alphabet size. Instead, it uses a QxA matrix, where Q is the length of the query sequence. At each position the cost of a letter depends on the position with regard to the query and the letter in the subject sequence.
To run this search, "Format for PSI-BLAST" checkbox must be checked.
This sets the statistical significance threshold for including a sequence in the model used by PSI-BLAST to create the PSSM on the next iteration.
This function is similar to the "Limit by Entrez Query terms" in the option section. The only difference is that it applies only to the identified hits. In another word, it is applied post-search and allows users to see only hits fitting the requirement of the Entrez query terms. Default is to format without input query terms and allow users to see all the hits.
This instructs BLAST formatter to display hits with Expect value within the specified range. Default value is 0 to Expect value setting. Lower bound goes to the first box, higher bound goes to the second box.
D. Rules for pattern syntax for PHI-BLAST
Web PHI-BLAST search requires a pattern along with a protein sequence containing the pattern. A simple example and how to use PHI-BLAST is available from this page.
The syntax for pattern specification in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query.
|[ ]||means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T|
|-||nothing, used as a spacer to clearly separate each position|
|x||with nothing following means any residue|
|(n)||means the preceeding residue is repeated 5 times|
|(m,n)||the preceeding residue is repeated between m to n times (n > m)|
|>||only at the end of a pattern and means nothing it may occur before a period|
|.||may be used at the end, means nothing|
When using the stand-alone program, the pattern should be stored in a pattern input file, with the first line starting with ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting with PA followed by 2 spaces and then the pattern description.
All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST.
Here is an example from PROSITE:
ID CNMP_BINDING_2; PATTERN. AC PS00889;
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE Cyclic nucleotide-binding domain signature 2.
NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=1; /PARTIAL=1;
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;
The line starting with ID gives the pattern a name.
The lines starting with AC, DT, DE, NR, NR, and CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST.
The line starting with PA describes the pattern, which can be explained as the following.
|Pattern Position||Pattern Syntax||Meaning|
|1||[LIVMF]||one of LIVMF|
|4||X||any one residue|
|5||[GAS]||one of GAS|
|6||[LIVM]||one of LIVM|
|7||X(5,11)||5 to 11 any residue|
|9||[STAQ]||one of STAQ|
|11||X||any one residue|
|12||[LIVMA]||one of LIVMA|
|13||X||any one residue|
|14||[STACV]||any one of STACV|
|Note: total length of this motif/pattern is between 18 to 24 residues.|
In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Given below is another example, illustrating the use of an HI line.
ID ER_TARGET; PATTERN. PA [KRHQSA]-[DENQ]-E-L>. HI (19 22) HI (201 204)
In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments.
In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest.
This version of Mega BLAST is designed specifically for comparison of diverged sequences, especially sequences from different organisms, which have alignments with low degree of identity, where the original Mega BLAST is not very effective. The major difference is in the use of the 'discontiguous word' approach to finding initial offset pairs, from which the gapped extension is then performed.
Both Mega BLAST and all previous versions of nucleotide-nucleotide BLAST look for exact matches of certain length as the starting points for gapped alignments. When comparing less conserved sequences, i.e. when the expected share of identity between them is e.g. 80% and below, this traditional approach becomes much less productive than for the higher degree of conservation. Depending on the length of the exact match to start the alignments from, it either misses a lot of statistically significant alignments, or on the contrary finds too many short random alignments.
According to , as well as our own probability simulations, it turns out that if initial 'words' are based not on the exact match, but on a match of a certain set of nonconsecutive positions within longer segments of the sequences, the productivity of the word finding algorithm is much higher. This way fewer words are found overall, but more of them end up producing statistically significant alignments, than in the case of contiguous words of the same, and even shorter length than the number of matched positions in the discontiguous word.As an example, we can define a pattern (template) of 0s and 1s of length e.g. 21:
100101100101100101101. For each pair of offsets in the query and subject sequences that are being compared, we compare the 21 nucleotide segments in these sequences ending at these offsets, and require only those positions in those segments to match that correspond to the 1s in the above template.
There are several advantages in using this approach. First, the conditional probabilities of finding word hits satisfying discontiguous templates given the expected identity percentage in the alignments between two sequences, are higher than for contiguous words with the same number of positions required matched. If two word hits are required to initiate a gapped extension, the effect of the discontiguous word approach is even larger. In both cases higher sensitivity is achieved because there is less correlation between successive words as the database sequence is scanned across the query sequence. Second, when comparing coding sequences, the conservation of the third nucleotides in every codon is not essential, so there is no need to require it when matching initial words. This implies the advantage of using templates based on the '110' pattern, which are called 'coding'. Finally, to achieve even higher sensitivity, one might combine two different discontiguous word templates and require any one of them to match at a given position to qualify it for the initial word hit.
The following options specific to this approach are supported:Template length: 16, 18, 21. Word size (i.e. number of 1s in the template): 11, 12 Template type: coding, non-coding. Require two words for extension: yes/no.
The 'coding' templates are based on the 110 pattern, although more 0s are required for most of them, so some of the patterns become 010 or 100. These are the most effective for comparison of coding regions.
The non-coding templates attempt to minimize the correlation between successive words, when the database sequence is shifted by 4 positions against the query sequence. This means more 1s are concentrated at the ends of the template (at least 3 on each side).
When the option to require two words for extension is chosen, two word hits matching the template must be found within a distance of 50 nucleotides of one another.
Below are the exact discontiguous word template patterns for different combinations of word sizes and lengths:
W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5