Configuration variables

Cloud provider configuration

GCP project

Optional: GCP project ID to use.

Also supported via the environment variable: ELB_GCP_PROJECT. To see the default gcloud project you can run the command: gcloud config get project. To set the default project run the command: gcloud config set project <INSERT_YOUR_GCP_PROJECT_ID_HERE>.

[cloud-provider]
gcp-project = my-gcp-project

GCP region

Name of the GCP region to use.

Also supported via the environment variable: ELB_GCP_REGION.

[cloud-provider]
gcp-region = us-east4

GCP zone

Name of the GCP zone to use.

Also supported via the environment variable: ELB_GCP_ZONE.

[cloud-provider]
gcp-zone = us-east4-b

GCP network

Optional: GCP network name to use. If provided, the GCP subnetwork must also be provided.

  • Default: default

  • Values: String

  • Applies to: GCP

To see the available networks, you can run the command gcloud compute networks list.

[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name

GCP sub-network

Optional: GCP sub-network name to use. If provided, the GCP network must also be provided.

  • Default: N/A

  • Values: String

  • Applies to: GCP

To see the available sub-networks for a given network, you can run the command gcloud compute networks subnets list --filter="<INSERT_NETWORK_NAME_HERE>".

[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name

Google Kubernetes Engine (GKE) version

GKE version to use.

  • Default: 1.21

  • Values: String

  • Applies to: GCP

To see GKE versions available in GCP in a given zone, you can run the command gcloud container get-server-config --zone <INSERT_GCP_ZONE_HERE>.

[cloud-provider]
gke-version = 1.21

AWS Region

Name of the AWS region to use. Recommended value: us-east-1.

For background information about AWS regions, please see the AWS documentation.

[cloud-provider]
aws-region = us-east-1

AWS VPC

Optional: AWS VPC ID to use (must exist in the chosen AWS region) or keyword none.

  • Default:

    • For AWS Accounts that support EC2-VPC, the default VPC will be used.

    • For AWS accounts without default VPCs or if none is specified, a new VPC will be created with as many subnets as there are availability zones in the region.

  • Values: String

  • Applies to: AWS

AWS Subnet

Optional: A comma-separated list of AWS Subnet IDs to use; must exist in the chosen AWS region and AWS VPC.

  • Default:

    • For AWS Accounts that support EC2-VPC, the default subnets will be used.

    • For AWS accounts without default VPCs or if left unspecified, as many subnets as there are availability zones in the region will be created.

  • Values: String

  • Applies to: AWS

[cloud-provider]
aws-subnet = subnet-SOME-RANDOM-STRING

AWS Security Group

Optional: Name of the AWS security group to use; must exist in the chosen AWS region.

  • Default: None

  • Values: String

  • Applies to: AWS

[cloud-provider]
aws-security-group = sg-SOME-RANDOM-STRING

AWS Key Pair

Optional: Name of the AWS key pair to use to login to EC2 instances; must exist in the chosen AWS region.

  • Default: None

  • Values: String

  • Applies to: AWS

[cloud-provider]
aws-key-pair = my-aws-key-name

Cluster configuration

Cluster name

Name of the GKE cluster created or the AWS CloudFormation stack (and related resources).

The name may contain only lowercase alphanumerics and ‘-’, must start with a letter and end with an alphanumeric, and must be no longer than 40 characters.

Note: This name must be unique for each of your ElasticBLAST searches, otherwise this may lead to undefined behavior.

  • Default: elasticblast-${USER}-X, where X is the first 8 characters of hashing the value of the results URI.

  • Values: String

Also supported via the environment variable: ELB_CLUSTER_NAME.

[cluster]
name = my-cluster

Number of worker nodes

Specifies the maximum number of worker nodes of the configured machine type to use.

  • Default: 1

  • Values: Positive integer

[cluster]
num-nodes = 4

Use preemptible nodes

Use spot instances and preemptible nodes to run ElasticBLAST. This may lead to reduced costs, but longer runtimes.”

Note: This is an experimental feature in AWS. Turning this on bids on instance prices up to the full price, which is almost guaranteed to save you money.

Note: Pre-emptible nodes are rebooted after 24 hours (by GCP). This is fine in most cases as Kubernetes will restart the node and resubmit the search (i.e., batch) that was interrupted. The batches that have already been processed are not lost. The only issue is if a single batch takes longer than 24 hours. We expect the overwhelming majority of Elastic-BLAST searches to take at most several hours, so this should not be an issue at all.

  • Default: no

  • Values: Any string. Set to yes enable.

Also supported via the environment variable: ELB_USE_PREEMPTIBLE.

[cluster]
use-preemptible = yes

Machine type

Type of GCP or AWS machine to start as worker node(s).

WARNING: ElasticBLAST will select a machine type for you with sufficient RAM to hold your database in memory if you search an NCBI provided database or provide metadata for your custom database (see Create BLAST database metadata). This is the recommended way to use ElasticBLAST. Specifiying the machine type will override this feature, and you need to be sure that your machine type has sufficient memory to hold you database.

NOTE: The machine’s available RAM should be large enough to contain the sequences in the database (one byte per residue or one byte per four bases) plus ~20%.

  • Default: n1-highmem-32 for GCP, m5.8xlarge for AWS.

  • The default machines have 32 cores and about 120GB of RAM.

  • These default values only apply if you use a custom database and do not provide metadata.

  • Values: String, see GCP machine types or AWS instance types accordingly.

[cluster]
machine-type = n1-standard-32

Number of CPUs

Number of CPUs to use per BLAST execution in a kubernetes or AWS Batch job.

Must be less than the number of CPUs for the chosen machine type.

  • Default: 16 or as many CPUs as are available on the selected machine type, whichever is smaller.

  • Values: Positive integer

[cluster]
num-cpus = 16

Persistent disk size

Size of the persistent disk attached to the cluster (GCP) or individual instances (AWS). This should be large enough to store the BLAST database, query sequence data and the BLAST results.

Format as <number> immediately followed by G for gigabytes, M for megabytes.

Note: ElasticBLAST uses pd-standard block storage by default. Per the GCP documentation on block storage, smaller disks than 1000G result in performance degradation for ElasticBLAST in GCP.

  • Default: 3000G for GCP, 1000G for AWS.

  • Values: String

[cluster]
pd-size = 3000G

Local SSD support

Note: This is an experimental feature in GCP. This limits local storage to 375GB.

Configure ElasticBLAST to use a single local SSD disk instead of a persistent disk to store BLAST database and query sequence batches.

Consider using this configuration setting if your disk quota is too small (e.g.: 500GB) and it impacts performance (see GCP documentation on block storage performance), but only if the BLAST database you are searching, your query sequence, and its results can fit into 375GB.

  • Default: None

  • Values: true or false

  • Applies to: GCP

[cluster]
exp-use-local-ssd = true

Cloud resource labels

Specifies the labels to attach to cloud resources created by ElasticBLAST.

  • Default: cluster-name={cluster_name},client-hostname={hostname},created={create_date},owner={username},project=elastic-blast,billingcode=elastic-blast,creator={username},program={blast_program},db={db},name={cluster_name},results={ELB_RESULTS}

  • Values: String of key-value pairs separated by commas. Keys must be all lowercase. Keys that overlap with the default labels are overriden with the values provided, otherwise key-value pairs are appended to the default set of labels.

[cluster]
labels = key1=value1,key2=value2

BLAST configuration options

BLAST program

BLAST program to run.

  • Default: None

  • Values: One of: blastp, blastn, blastx, tblastn, tblastx, psiblast, rpsblast, rpstblastn

[blast]
program = blastp

BLAST options

BLAST options to customize BLAST invocation.

Note: the default output format in ElasticBLAST is 11 (BLAST archive).

If you do not specify an output format (with -outfmt), you can use blast_formatter to format the results in any desired output format.

Below, we have specified “-outfmt 7” for the BLAST tabular format and requested blastp-fast mode.

[blast]
options = -task blastp-fast -outfmt 7

BLAST database

BLAST database name to search.

To search a database provided in the cloud by the NCBI, simply use its name.

To search your own custom database, upload the database files to a cloud storage bucket and provide the bucket’s universal resource identifier (URI) plus the database name (see example and tip below). We also recommend that you include a metadata file for your database, which allows ElasticBLAST to better configure the memory requirements for your search. See Create BLAST database metadata for instructions on producing the metadata file.

  • Default: None

  • Values: String.

Sample BLAST database configuration
[blast]
db = nr
Sample custom BLAST database configuration
[blast]
db = gs://my-database-bucket/mydatabase

Tip: to upload your BLAST database to a cloud bucket, please refer to the cloud vendor documentation (AWS or GCP).

If you have BLAST+ available in your machine, you can run the command below to get a list of BLAST databases provided by NCBI:

When working on AWS
update_blastdb.pl --source aws --showall pretty
When working on GCP
update_blastdb.pl --source gcp --showall pretty

Batch length

Number of bases/residues per query batch.

NOTE: this value should change with BLAST program.

  • Default: Auto-configured for supported programs.

  • Values: Positive integer

Also supported via the environment variable: ELB_BATCH_LEN.

[blast]
batch-len = 10000

BLAST_USAGE_REPORT

Controls the usage reporting via the environment variable BLAST_USAGE_REPORT.

For additional details, please see the BLAST+ privacy statement.

  • Default: true

  • Values: true or false

Input/output configuration options

Query sequence data

Query sequence data for BLAST.

Can be provided as a local path or GCS or AWS bucket URI to a file/tarball. Multiple files can be provided as as space-separated list or in “list files”. Any file with the file extension .query-list is considered a “list file”, where each line contains a local path or a cloud bucket URI.

  • Default: None

  • Values: String

[blast]
queries = /home/${USER}/blast-queries.tar.gz

Results

GCS or AWS S3 bucket URI where to save the results from ElasticBLAST.

This value uniquely identifies a single ElasticBLAST search - please keep track of this.

Note: This bucket must exist prior to invoking ElasticBLAST and it must include the gs:// or s3:// prefix.

  • Default: None

  • Values: String

[blast]
results = ${YOUR_RESULTS_BUCKET}

Log file

File name to save logging output. Can only be set via the command line argument --logfile.

  • Default: elastic-blast.log

  • Values: String

Log level

Sets the logging threshold. Can only be set via the command line argument --loglevel.

  • Default: DEBUG

  • Values: One of DEBUG, INFO, WARNING, ERROR, CRITICAL

Timeout configuration options

BLAST timeout

Timeout in minutes after which kubernetes will terminate a single BLAST job (i.e.: that corresponds to one of the query batches).

  • Default: 10080 (1 week)

  • Values: Positive integer

  • Applies to: GCP

[timeouts]
blast-k8s-job = 10080

BLASTDB initialization timeout

Timeout in minutes to wait for the persistent disk to be initialized with the selected BLAST database.

  • Default: 45

  • Values: Positive integer

  • Applies to: GCP

[timeouts]
init-pv = 45

Developer configuration options

ELB_DONT_DELETE_SETUP_JOBS

Set via an environment variable, applies to GCP only.

  • Default: Disabled

  • Values: Any string. Set to any value to enable.

  • Applies to: GCP

Do not delete the kubernetes setup jobs after they complete.

ELB_PAUSE_AFTER_INIT_PV

Set via an environment variable, applies to GCP only.

  • Default: 120

  • Values: Positive integer.

  • Applies to: GCP

Time in seconds to wait after persistent volume gets initialized to prevent mount errors on BLAST kubernetes jobs.