Rules for pattern syntax for PHI-BLAST

Web PHI-BLAST search requires a pattern along with a protein sequence containing the pattern.

The syntax for pattern specification in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query.

Accepted PHI-BLAST Pattern Vocabulary

Symbols

Description

ABCDEFGHIKLMNPQRSTVWXYZU

Protein alphabet

ACGT

DNA alphabet

[ ]

means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T

-

nothing, used as a spacer to clearly separate each position

x

with nothing following means any residue

means the preceeding residue is repeated 5 times

(m,n)

the preceeding residue is repeated between m to n times (n > m)

>

only at the end of a pattern and means nothing it may occur before a period

.

may be used at the end, means nothing

When using the stand-alone program, the pattern should be stored in a pattern input file, with the first line starting with ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting with PA followed by 2 spaces and then the pattern description.

All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST.

Here is an example from PROSITE:

ID CNMP_BINDING_2; PATTERN.
AC PS00889;
DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).
DE Cyclic nucleotide-binding domain signature 2.
PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
NR /RELEASE=32,49340;
NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR /FALSE_NEG=1; /PARTIAL=1;
CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

The line starting with ID gives the pattern a name.

The lines starting with AC, DT, DE, NR, NR, and CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST.

The line starting with PA describes the pattern, which can be explained as the following.

Explanation of PROSITE example

Pattern Position

Pattern Syntax

Meaning

1

[LIVMF]

one of LIVMF

2

G

G

3

E

E

4

X

any one residue

5

[GAS]

one of GAS

6

[LIVM]

one of LIVM

7

X(5,11)

5 to 11 any residue

8

R

R

9

[STAQ]

one of STAQ

10

A

one A

11

X

any one residue

12

[LIVMA]

one of LIVMA

13

X

any one residue

14

[STACV]

any one of STACV

Note

total length of this motif/pattern is between 18 to 24 residues.

In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Given below is another example, illustrating the use of an HI line.

ID ER_TARGET; PATTERN.
ID ER_TARGET; PATTERN.
PA [KRHQSA]-[DENQ]-E-L>.
HI (19 22)
HI (201 204)
HI (201 204)

In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments.

In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest.