2023 BLAST NEWS

Tue, 28 Nov 2023

BLAST+ 2.15.0 is here!

We have included two exciting new features in the latest BLAST+ release

One will run searches faster for you. The other allows you to limit your search easily by organism.

Let’s talk about how this version of BLAST runs faster in some cases. If you run BLAST with multiple threads (more than one CPU), there are two ways that BLAST can divide the work among the threads. Which method works better depends (among other factors) upon the size of the database, the blast program, and the number of queries. Picking the appropriate threading model can speed up a search of a small database(e.g., Swiss-Prot, 364MB) and a lot of queries by a factor of 2 to 10 without affecting your results, which is what this change does. Read more about this feature and the two BLAST threading models here.

The second feature significantly simplifies limiting searches to a non-leaf taxonomic node (e.g., bacteria). To limit a search by taxonomy, use a taxID (the number that specifies a taxonomic node) for your search. Read more about this feature here.

The exciting part is that these two features can be used together to deliver more targeted and faster results: when you limit your search taxonomically, you are effectively searching a smaller database. BLAST figures this out, and if your new database size is small enough (often the case with taxonomic limits), switching threading methods means BLAST can work much faster. It also limits the results to what you asked for.

Download BLAST+ 2.15.0

Check out all the BLAST release notes.

Questions or comments? Please write the BLAST help desk.

Thu, 24 Aug 2023

ClusteredNR database on BLAST+

The ClusteredNR database is now available for BLAST+

Accessing cluster information from the experimental ClusteredNR database for BLAST+

This document shows how to retrieve cluster information from a local copy of the ClusteredNR database on your computer/machine.

The partial results below are from a protein BLAST search against the ClusteredNR database with BLAST+. The accessions in the second column are matches to your query. These accessions correspond to the representative sequences for each cluster and they serve as the identifiers for each cluster. You can use these representative accessions to retrieve cluster information using the scripts described below, which are provided with the ClusteredNR database.

Protein BLAST results for the ClusteredNR database on BLAST+

$ blastp -db nr_cluster_seq -query query.faa -num_threads 8 -outfmt 6 | head -3

XP_013375972.1      XP_013375972.1  100.000 651     0       0       1       651     1       651  0.0      1354
XP_013375972.1      XP_010640962.1  89.555  651     66      2       1       651     1       649  0.0      1188
XP_013375972.1      KFO25476.1      88.786  651     71      2       1       651     113     761  0.0      1180

Included scripts

  • get-cluster-members.sh - lists all member accessions for a cluster containing the provided representative accession.

  • count-cluster-members.sh - returns the size of the cluster for a provided representative sequence.

  • get-cluster-representatives-for-taxid.sh - lists the representative accessions from a given taxonomy identifier (taxID).

  • get-cluster-repr-for-accession.sh - returns the representative accession for the cluster containing the given accession.

These scripts allow you to retrieve all member accessions of a cluster based on the representative accession (i.e., the cluster identifier). They also allow you to retrieve the representative accession with any member accession. You can also retrieve extra fields such as the NCBI taxonomy ID (taxID), an integer identifying a specific taxonomic node, or the title for each member accession in a cluster.

The scripts use an SQLite3 database that must be present in the same directory. You should use the accession.version (e.g., XP_013375972.1) and not just the accession for these scripts to work correctly.

Read the NCBI blog post on the new ClusteredNR database to learn about the value and basics of clustering.

Prerequisites

  • Linux or macOS operating system.

  • SQLite3 version 3.35.4 or more recent.

  • BLAST+ 2.13 or more recent. Read about installing it here.

  • Download the clusteredNR database(nr_cluster_seq) here.

Usage examples

Note: Invoke any of the scripts with the -h option to see their usage instructions

Retrieving cluster members, taxonomy information, and sequence titles

./get-cluster-members.sh -a XP_013375972.1 -T -t

member_accession      member_taxid     member_title
--------------------  ---------------  --------------------------------------------------------------------------------
XP_013375972.1        34839            PREDICTED: rab proteins geranylgeranyltransferase component A 2 [Chinchilla lani
XP_005005664.1        10141            rab proteins geranylgeranyltransferase component A 2 [Cavia porcellus]
XP_012369165.1        10160            rab proteins geranylgeranyltransferase component A 2 [Octodon degus]
XP_021121136.1        10181            rab proteins geranylgeranyltransferase component A 2 [Heterocephalus glaber]

Getting the number of members for a cluster

$ ./count-cluster-members.sh -a XP_013375972.1

4

Retrieving all representative accessions for an NCBI taxonomy ID (taxID)

Note: this includes all taxIDs beyond the chosen node. For example, a taxID for a genus will include all representatives for the species and subspecies.

$ ./get-cluster-representatives-for-taxid.sh -t 10141

representative
--------------------
XP_012997687.1
XP_013013097.1
XP_012999071.1
XP_005005726.1
XP_013008508.2
XP_013005496.1
XP_003461046.1
XP_012997197.2
XP_003465871.2
...

Listing the cluster representative accession and the protein title for any protein accession

$ ./get-cluster-repr-for-accession.sh -a XP_021121136.1 -T

representative        title
--------------------  ----------------------------------------------------------------------------
XP_013375972.1        rab proteins geranylgeranyltransferase component A 2 [Heterocephalus glaber]

Questions or comments? Please tell us what you think. Write the BLAST help desk.

Tue, 22 Aug 2023

Try BLAST+ 2.14.1 today!

We added the cleanup-blastdb-volumes.py script to remove unused BLAST database volumes. Read the documentation here.

We also switched the protocol from ftp to https to access BLAST databases for increased performance and reliability when downloading data from the NCBI with the update_blastdb.pl script.

And we fixed a few bugs related to downloading data from the NCBI, and mt_mode crashing blastn and blastx.

Check out the release notes.

Download BLAST+ 2.14.1

Questions or comments? Please write the BLAST help desk.

Thu, 22 Jun 2023

BLAST Quick Start guides!

Need some help getting started with BLAST?

Use the BLAST quick start guides to learn how to perform a BLAST search and understand your results.

These quick start guides for the search and result pages show you the minimal steps needed to perform a BLAST search and how to navigate your search results.

Take a little time to check out the BLAST search page guide and the result page guide.

Questions or comments? Please write the BLAST help desk.

Fri, 28 Apr 2023

BLAST+ 2.14.0 is here!

BLASTP, BLASTX, and TBLASTN are faster than before.

We have made BLAST searches faster for proteins and translated DNA(BLASTP, BLASTX, and TBLASTN) faster by improving support for initial long words. This improvement helps us speed up the fast modes (e.g., using -task blastp-fast on the command-line).

In one example, a search of phage reads(ERR7959948) against swissprot using “-task blastx-fast” was four times faster than the default search (4 hours vs. 16 hours). This query(_ERR7959948) has 2.2 million reads and 324.4 million bases.

We have also fixed a number of bugs and added some other improvements.

Check out the release notes.

Download BLAST+ 2.14.0.

Mon, 24 Apr 2023

Faster BLASTP and BLASTX searches on the web!

BLASTP and BLASTX performance has been improved on the web.

Improvements

We have improved the BLASTP (protein-protein) and BLASTX (DNA-protein) searches with better support for longer word-sizes. With this change, searches against nr run about 20% faster with faster speed improvements for smaller databases like ClusteredNR, UniProtKB/Swiss-Prot, or PDB. This new version of BLAST produces equivalent results, with the overwhelming majority of searches returning the same results as the previous version of BLASTP and BLASTX.

Questions or comments? Please write the BLAST help desk.

Checkout BLASTP and BLASTX on the BLAST web service.

Tue, 21 Mar 2023

IgBLAST 1.21.0 is now available!

The improvements in this latest version are available to both the command line and web IgBLAST users.

Improvements

  • Added gaps to all _alignments_aa fields (such as sequence_alignment_aa) to reflect gaps in nucleotide sequence alignment.

  • Added the new AIRR format field: sequence_aa. This is the direct translation (no gaps) of a nucleotide sequence using the reading frame determined by the nucleotide alignment to its closest germline V gene.

  • Added the new AIRR format field: d_frame. This is the D gene frame that is in-frame with the J gene coding frame. IgBLAST offers built-in IGHD gene frame support for mouse as defined by Ichihara Y et al (European Journal of Immunology Volume 19, Issue 10 p. 1849-1854). Users can use their own custom D gene definition with IgBLAST depending on their needs.

Download IgBLAST here https://ftp.ncbi.nlm.nih.gov/blast/executables/igblast/release/LATEST.

Checkout the IgBLAST GitHub page at https://ncbi.github.io/igblast/.

Mon, 09 Jan 2023

ElasticBLAST 1.0.0 is Now available!

ElasticBLAST version 1.0.0 has support for faster cheaper disks at AWS and better supports Kubernetes on GCP!

ElasticBLAST versions prior to version 1.0.0 will stop working on GCP after January 31, 2023.

This is because older versions of ElasticBLAST rely on version 1.21 of kubernetes, which will reach its end of life on the Google Kubernetes Engine on that date. Please upgrade your installation of ElasticBLAST to its latest version.

Improvements

Bug fixes

  • ElasticBLAST uses GCP’s recommended way of dealing with read/write persistent disk.

  • Long user names no longer cause errors in AWS.

  • Fixed error caused by APIs not being enabled in GCP.

Please checkout this bioRxiv paper: ElasticBLAST: Accelerating Sequence Search via Cloud Computing.