Configuration variables¶

Cloud provider configuration¶

`GCP project`¶

Optional: GCP project ID to use.

Default: Default gcloud project

Values: String, see Identifying projects

Applies to: GCP

Also supported via the environment variable: ELB_GCP_PROJECT. To see the default gcloud project you can run the command: gcloud config get project. To set the default project run the command: gcloud config set project <INSERT_YOUR_GCP_PROJECT_ID_HERE>.

[cloud-provider]
gcp-project = my-gcp-project

`GCP region`¶

Name of the GCP region to use.

Default: us-east4

Values: String, see GCP region/zone documentation

Applies to: GCP

Also supported via the environment variable: ELB_GCP_REGION.

[cloud-provider]
gcp-region = us-east4

`GCP zone`¶

Name of the GCP zone to use.

Default: us-east4-b

Values: String, see GCP region/zone documentation

Applies to: GCP

Also supported via the environment variable: ELB_GCP_ZONE.

[cloud-provider]
gcp-zone = us-east4-b

`GCP network`¶

Optional: GCP network name to use. If provided, the GCP subnetwork must also be provided.

Default: default

Values: String

Applies to: GCP

To see the available networks, you can run the command gcloud compute networks list.

[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name

`GCP sub-network`¶

Optional: GCP sub-network name to use. If provided, the GCP network must also be provided.

Default: N/A

Values: String

Applies to: GCP

To see the available sub-networks for a given network, you can run the command gcloud compute networks subnets list --filter="<INSERT_NETWORK_NAME_HERE>".

[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name

`Kubernetes version`¶

Kubernetes version version to use; must be one of the supported versions in GKE. For additional details, please see the GKE release notes.

Default: The default kubernetes version from the regular GKE release channel.

Values: String. Examples: 1.25, 1.25.9, or 1.26.1-gke.1500. For additional details, please see the relevant GKE documentation

Applies to: GCP

To see kubernetes versions available in GCP in a given zone, you can run the command gcloud container get-server-config --zone <INSERT_GCP_ZONE_HERE>.

[cloud-provider]
gke-version = 1.25

`AWS Region`¶

Name of the AWS region to use. Recommended value: us-east-1.

Default: None for the configuration file interface, us-east-1 for the command line interface.

Values: String, any region that supports Batch, see AWS documentation for details

Applies to: AWS

For background information about AWS regions, please see the AWS documentation.

[cloud-provider]
aws-region = us-east-1

`AWS VPC`¶

Optional: AWS VPC ID to use (must exist in the chosen AWS region) or keyword none.

Default:

For AWS Accounts that support EC2-VPC, the default VPC will be used.

For AWS accounts without default VPCs or if none is specified, a new VPC will be created with as many subnets as there are availability zones in the region.

Values: String

Applies to: AWS

`AWS Subnet`¶

Optional: A comma-separated list of AWS Subnet IDs to use; must exist in the chosen AWS region and AWS VPC.

Default:

For AWS Accounts that support EC2-VPC, the default subnets will be used.

For AWS accounts without default VPCs or if left unspecified, as many subnets as there are availability zones in the region will be created.

Values: String

Applies to: AWS

[cloud-provider]
aws-subnet = subnet-SOME-RANDOM-STRING

`AWS Security Group`¶

Optional: Name of the AWS security group to use; must exist in the chosen AWS region.

Default: None

Values: String

Applies to: AWS

[cloud-provider]
aws-security-group = sg-SOME-RANDOM-STRING

`AWS Key Pair`¶

Optional: Name of the AWS key pair to use to login to EC2 instances; must exist in the chosen AWS region.

Default: None

Values: String

Applies to: AWS

[cloud-provider]
aws-key-pair = my-aws-key-name

Cluster configuration¶

`Number of worker nodes`¶

Specifies the maximum number of worker nodes of the configured machine type to use.

Default: 1

Values: Positive integer

[cluster]
num-nodes = 4

`Use preemptible nodes`¶

Use spot instances and preemptible nodes to run ElasticBLAST. This can reduce costs substantially.

Note: Pre-emptible nodes are rebooted after 24 hours (by GCP). This is fine as Kubernetes will restart the node and resubmit the search (i.e., batch) that was interrupted. The batches that have already been processed are not lost. ElasticBLAST batches take at most several hours.

Default: no

Values: Any string. Set to yes enable.

Also supported via the environment variable: ELB_USE_PREEMPTIBLE.

[cluster]
use-preemptible = yes

`Machine type`¶

Type of GCP or AWS machine to start as worker node(s).

WARNING: ElasticBLAST will select a machine type for you with sufficient RAM to hold your database in memory if you search an NCBI provided database or provide metadata for your custom database (see Create BLAST database metadata). This is the recommended way to use ElasticBLAST. Specifiying the machine type will override this feature, and you need to be sure that your machine type has sufficient memory to hold you database.

NOTE: The machine’s available RAM should be large enough to contain the sequences in the database (one byte per residue or one byte per four bases) plus ~20%.

Default: n1-highmem-32 for GCP, m5.8xlarge for AWS.

The default machines have 32 cores and about 120GB of RAM.

These default values only apply if you use a custom database and do not provide metadata.

Values: String, see GCP machine types or AWS instance types accordingly.

[cluster]
machine-type = n1-standard-32

`Number of CPUs`¶

Number of CPUs to use per BLAST execution in a kubernetes or AWS Batch job.

Must be less than the number of CPUs for the chosen machine type.

Default: 16 or as many CPUs as are available on the selected machine type, whichever is smaller.

Values: Positive integer

[cluster]
num-cpus = 16

`Persistent disk size`¶

Size of the persistent disk attached to the cluster (GCP) or individual instances (AWS). This should be large enough to store the BLAST database, query sequence data and the BLAST results.

Format as <number> immediately followed by G for gigabytes, M for megabytes.

Note: ElasticBLAST uses pd-standard block storage by default. Per the GCP documentation on block storage, smaller disks than 1000G result in performance degradation for ElasticBLAST in GCP.

Default: 3000G for GCP, 1000G for AWS.

Values: String

[cluster]
pd-size = 3000G

`Local SSD support`¶

Note: This is an experimental feature in GCP. This limits local storage to 375GB.

Configure ElasticBLAST to use a single local SSD disk instead of a persistent disk to store BLAST database and query sequence batches.

Consider using this configuration setting if your disk quota is too small (e.g.: 500GB) and it impacts performance (see GCP documentation on block storage performance), but only if the BLAST database you are searching, your query sequence, and its results can fit into 375GB.

Default: None

Values: true or false

Applies to: GCP

[cluster]
exp-use-local-ssd = true

`Cloud resource labels`¶

Specifies the labels to attach to cloud resources created by ElasticBLAST.

Default: cluster-name={cluster_name},client-hostname={hostname},created={create_date},owner={username},project=elastic-blast,billingcode=elastic-blast,creator={username},program={blast_program},db={db},name={cluster_name},results={ELB_RESULTS}

Values: String of key-value pairs separated by commas. Keys must be all lowercase. Keys that overlap with the default labels are overriden with the values provided, otherwise key-value pairs are appended to the default set of labels.

[cluster]
labels = key1=value1,key2=value2

BLAST configuration options¶

`BLAST program`¶

BLAST program to run.

Default: None

Values: One of: blastp, blastn, blastx, tblastn, tblastx, psiblast, rpsblast, rpstblastn

[blast]
program = blastp

`BLAST options`¶

BLAST options to customize BLAST invocation.

Note: the default output format in ElasticBLAST is 11 (BLAST archive).

If you do not specify an output format (with -outfmt), you can use blast_formatter to format the results in any desired output format.

Below, we have specified “-outfmt 7” for the BLAST tabular format and requested blastp-fast mode.

Default: None

Values: String, see BLAST+ options

[blast]
options = -task blastp-fast -outfmt 7

`BLAST database`¶

BLAST database name to search.

To search a database provided in the cloud by the NCBI, simply use its name.

To search your own custom database, upload the database files to a cloud storage bucket and provide the bucket’s universal resource identifier (URI) plus the database name (see example and tip below). We also recommend that you include a metadata file for your database, which allows ElasticBLAST to better configure the memory requirements for your search. See Create BLAST database metadata for instructions on producing the metadata file.

Default: None

Values: String.

Sample BLAST database configuration¶

[blast]
db = nr

Sample custom BLAST database configuration¶

[blast]
db = gs://my-database-bucket/mydatabase

Tip: to upload your BLAST database to a cloud bucket, please refer to the cloud vendor documentation (AWS or GCP).

If you have BLAST+ available in your machine, you can run the command below to get a list of BLAST databases provided by NCBI:

When working on AWS¶

update_blastdb.pl --source aws --showall pretty

When working on GCP¶

update_blastdb.pl --source gcp --showall pretty

`Batch length`¶

Number of bases/residues per query batch.

NOTE: this value should change with BLAST program.

Default: Auto-configured for supported programs.

Values: Positive integer

Also supported via the environment variable: ELB_BATCH_LEN.

[blast]
batch-len = 10000

`Memory request for BLAST search`¶

Minimum amount of RAM to allocate to a BLAST search.

Format as <number> immediately followed by G for gigabytes, M for megabytes.

Must be less than available RAM for the chosen machine type.

Default: Auto-configured based on database choice. Minimal value is 0.5G.

Values: String

Applies to: GCP

See also:

Motivation for memory requests and limits

Exceed a container’s memory limit

[blast]
mem-request = 95G

`Memory limit for BLAST search`¶

Maximum amount of RAM that a BLAST search can use.

Format as <number> immediately followed by G for gigabytes, M for megabytes.

Must be less than available RAM for the chosen machine type.

Default: Auto-configured based on database choice. Maximal value is 0.95 of the RAM available in the machine type.

Values: String

See also:

Motivation for memory requests and limits

Exceed a container’s memory limit

[blast]
mem-limit = 115G

`BLAST_USAGE_REPORT`¶

Controls the usage reporting via the environment variable BLAST_USAGE_REPORT.

For additional details, please see the BLAST+ privacy statement.

Default: true

Values: true or false

Input/output configuration options¶

`Query sequence data`¶

Query sequence data for BLAST.

Can be provided as a local path or GCS or AWS bucket URI to a file/tarball. Multiple files can be provided as as space-separated list or in “list files”. Any file with the file extension .query-list is considered a “list file”, where each line contains a local path or a cloud bucket URI.

Default: None

Values: String

[blast]
queries = /home/${USER}/blast-queries.tar.gz

`Results`¶

GCS or AWS S3 bucket URI where to save the results from ElasticBLAST.

This value uniquely identifies a single ElasticBLAST search - please keep track of this.

Note: This bucket must exist prior to invoking ElasticBLAST and it must include the gs:// or s3:// prefix.

Default: None

Values: String

[blast]
results = ${YOUR_RESULTS_BUCKET}

`Log file`¶

File name to save logging output. Can only be set via the command line argument --logfile.

Default: elastic-blast.log

Values: String

`Log level`¶

Sets the logging threshold. Can only be set via the command line argument --loglevel.

Default: DEBUG

Values: One of DEBUG, INFO, WARNING, ERROR, CRITICAL

Timeout configuration options¶

`BLAST timeout`¶

Timeout in minutes after which kubernetes will terminate a single BLAST job (i.e.: that corresponds to one of the query batches).

Default: 10080 (1 week)

Values: Positive integer

Applies to: GCP

[timeouts]
blast-k8s-job = 10080

`BLASTDB initialization timeout`¶

Timeout in minutes to wait for the persistent disk to be initialized with the selected BLAST database.

Default: 45

Values: Positive integer

Applies to: GCP

[timeouts]
init-pv = 45

Developer configuration options¶

`Cluster name`¶

Name of the GKE cluster created or the AWS CloudFormation stack (and related resources).

The name may contain only lowercase alphanumerics and ‘-’, must start with a letter and end with an alphanumeric, and must be no longer than 40 characters.

Note: This name must be unique for each of your ElasticBLAST searches, otherwise this may lead to undefined behavior.

IMPORTANT: Please do not set configuration variable unless you are intimately familiar with the internals of ElasticBLAST. This is NOT a way to reuse an existing GKE cluster to run ElasticBLAST.

Default: elasticblast-${USER}-X, where X is the first 8 characters of hashing the value of the results URI.

Values: String

Also supported via the environment variable: ELB_CLUSTER_NAME.

[cluster]
name = my-cluster

`Minimal compressed query file size to split on client`¶

For single, compressed (i.e.: those with a file name ending in .gz) query files stored on the cloud, this configuration setting specifies the minimal file size (in bytes) to download the file and split on the local machine. Files larger than this threshold will be split in the cloud.

Default: 5 000 000

Values: Positive integer

[blast]
min-qsize-for-client-split-compressed = 10000000

`Minimal uncompressed query file size to split on client`¶

For single, uncompressed (i.e.: those any file extension except .gz) query files stored on the cloud, this configuration setting specifies the minimal file size (in bytes) to download the file and split on the local machine. Files larger than this threshold will be split in the cloud.

Default: 20 000 000

Values: Positive integer

[blast]
min-qsize-for-client-split-uncompressed = 100000000

`ELB_DONT_DELETE_SETUP_JOBS`¶

Set via an environment variable, applies to GCP only.

Default: Disabled

Values: Any string. Set to any value to enable.

Applies to: GCP

Do not delete the kubernetes setup jobs after they complete.

`ELB_PAUSE_AFTER_INIT_PV`¶

Set via an environment variable, applies to GCP only.

Default: 120

Values: Positive integer.

Applies to: GCP

Time in seconds to wait after persistent volume gets initialized to prevent mount errors on BLAST kubernetes jobs.

Configuration variables¶

Cloud provider configuration¶

GCP project¶

GCP region¶

GCP zone¶

GCP network¶

GCP sub-network¶

Kubernetes version¶

AWS Region¶

AWS VPC¶

AWS Subnet¶

AWS Security Group¶

AWS Key Pair¶

Cluster configuration¶

Number of worker nodes¶

Use preemptible nodes¶

Machine type¶

Number of CPUs¶

Persistent disk size¶

Local SSD support¶

Cloud resource labels¶

BLAST configuration options¶

BLAST program¶

BLAST options¶

BLAST database¶

Batch length¶

Memory request for BLAST search¶

Memory limit for BLAST search¶

BLAST_USAGE_REPORT¶