Configuration variables¶
Cloud provider configuration¶
GCP project
¶
Optional: GCP project ID to use.
Default: Default
gcloud
projectValues: String, see Identifying projects
Applies to: GCP
Also supported via the environment variable:
ELB_GCP_PROJECT
. To see the defaultgcloud
project you can run the command:gcloud config get project
. To set the default project run the command:gcloud config set project <INSERT_YOUR_GCP_PROJECT_ID_HERE>
.
[cloud-provider]
gcp-project = my-gcp-project
GCP region
¶
Name of the GCP region to use.
Default:
us-east4
Values: String, see GCP region/zone documentation
Applies to: GCP
Also supported via the environment variable:
ELB_GCP_REGION
.
[cloud-provider]
gcp-region = us-east4
GCP zone
¶
Name of the GCP zone to use.
Default:
us-east4-b
Values: String, see GCP region/zone documentation
Applies to: GCP
Also supported via the environment variable:
ELB_GCP_ZONE
.
[cloud-provider]
gcp-zone = us-east4-b
GCP network
¶
Optional: GCP network name to use. If provided, the GCP subnetwork must also be provided.
Default:
default
Values: String
Applies to: GCP
To see the available networks, you can run the command
gcloud compute networks list
.
[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name
GCP sub-network
¶
Optional: GCP sub-network name to use. If provided, the GCP network must also be provided.
Default: N/A
Values: String
Applies to: GCP
To see the available sub-networks for a given network, you can run the command
gcloud compute networks subnets list --filter="<INSERT_NETWORK_NAME_HERE>"
.
[cloud-provider]
gcp-network = default
gcp-subnetwork = subnet-name
Kubernetes version
¶
Kubernetes version version to use; must be one of the supported versions in GKE. For additional details, please see the GKE release notes.
Default: The default kubernetes version from the regular GKE release channel.
Values: String. Examples:
1.25
,1.25.9
, or1.26.1-gke.1500
. For additional details, please see the relevant GKE documentationApplies to: GCP
To see kubernetes versions available in GCP in a given zone, you can run the command
gcloud container get-server-config --zone <INSERT_GCP_ZONE_HERE>
.
[cloud-provider]
gke-version = 1.25
AWS Region
¶
Name of the AWS region to use. Recommended value:
us-east-1
.
Default: None for the configuration file interface,
us-east-1
for the command line interface.Values: String, any region that supports Batch, see AWS documentation for details
Applies to: AWS
For background information about AWS regions, please see the AWS documentation.
[cloud-provider]
aws-region = us-east-1
AWS VPC
¶
Optional: AWS VPC ID to use (must exist in the chosen AWS region) or keyword
none
.
Default:
For AWS Accounts that support
EC2-VPC
, the default VPC will be used.For AWS accounts without default VPCs or if
none
is specified, a new VPC will be created with as many subnets as there are availability zones in the region.Values: String
Applies to: AWS
AWS Subnet
¶
Optional: A comma-separated list of AWS Subnet IDs to use; must exist in the chosen AWS region and AWS VPC.
Default:
For AWS Accounts that support
EC2-VPC
, the default subnets will be used.For AWS accounts without default VPCs or if left unspecified, as many subnets as there are availability zones in the region will be created.
Values: String
Applies to: AWS
[cloud-provider]
aws-subnet = subnet-SOME-RANDOM-STRING
AWS Security Group
¶
Optional: Name of the AWS security group to use; must exist in the chosen AWS region.
Default: None
Values: String
Applies to: AWS
[cloud-provider]
aws-security-group = sg-SOME-RANDOM-STRING
AWS Key Pair
¶
Optional: Name of the AWS key pair to use to login to EC2 instances; must exist in the chosen AWS region.
Default: None
Values: String
Applies to: AWS
[cloud-provider]
aws-key-pair = my-aws-key-name
Cluster configuration¶
Number of worker nodes
¶
Specifies the maximum number of worker nodes of the configured machine type to use.
Default:
1
Values: Positive integer
[cluster]
num-nodes = 4
Use preemptible nodes
¶
Use spot instances and preemptible nodes to run ElasticBLAST. This can reduce costs substantially.
Note: Pre-emptible nodes are rebooted after 24 hours (by GCP). This is fine as Kubernetes will restart the node and resubmit the search (i.e., batch) that was interrupted. The batches that have already been processed are not lost. ElasticBLAST batches take at most several hours.
Default:
no
Values: Any string. Set to
yes
enable.Also supported via the environment variable:
ELB_USE_PREEMPTIBLE
.
[cluster]
use-preemptible = yes
Machine type
¶
Type of GCP or AWS machine to start as worker node(s).
WARNING: ElasticBLAST will select a machine type for you with sufficient RAM to hold your database in memory if you search an NCBI provided database or provide metadata for your custom database (see Create BLAST database metadata). This is the recommended way to use ElasticBLAST. Specifiying the machine type will override this feature, and you need to be sure that your machine type has sufficient memory to hold you database.
NOTE: The machine’s available RAM should be large enough to contain the sequences in the database (one byte per residue or one byte per four bases) plus ~20%.
Default:
n1-highmem-32
for GCP,m5.8xlarge
for AWS.The default machines have 32 cores and about 120GB of RAM.
These default values only apply if you use a custom database and do not provide metadata.
Values: String, see GCP machine types or AWS instance types accordingly.
[cluster]
machine-type = n1-standard-32
Number of CPUs
¶
Number of CPUs to use per BLAST execution in a kubernetes or AWS Batch job.
Must be less than the number of CPUs for the chosen machine type.
Default: 16 or as many CPUs as are available on the selected machine type, whichever is smaller.
Values: Positive integer
[cluster]
num-cpus = 16
Persistent disk size
¶
Size of the persistent disk attached to the cluster (GCP) or individual instances (AWS). This should be large enough to store the BLAST database, query sequence data and the BLAST results.
Format as <number> immediately followed by G for gigabytes, M for megabytes.
Note: ElasticBLAST uses
pd-standard
block storage by default. Per the GCP documentation on block storage, smaller disks than1000G
result in performance degradation for ElasticBLAST in GCP.
Default:
3000G
for GCP,1000G
for AWS.Values: String
[cluster]
pd-size = 3000G
Local SSD support
¶
Note: This is an experimental feature in GCP. This limits local storage to 375GB.
Configure ElasticBLAST to use a single local SSD disk instead of a persistent disk to store BLAST database and query sequence batches.
Consider using this configuration setting if your disk quota is too small (e.g.: 500GB) and it impacts performance (see GCP documentation on block storage performance), but only if the BLAST database you are searching, your query sequence, and its results can fit into 375GB.
Default: None
Values:
true
orfalse
Applies to: GCP
[cluster]
exp-use-local-ssd = true
Cloud resource labels
¶
Specifies the labels to attach to cloud resources created by ElasticBLAST.
Default:
cluster-name={cluster_name},client-hostname={hostname},created={create_date},owner={username},project=elastic-blast,billingcode=elastic-blast,creator={username},program={blast_program},db={db},name={cluster_name},results={ELB_RESULTS}
Values: String of key-value pairs separated by commas. Keys must be all lowercase. Keys that overlap with the default labels are overriden with the values provided, otherwise key-value pairs are appended to the default set of labels.
[cluster]
labels = key1=value1,key2=value2
BLAST configuration options¶
BLAST program
¶
BLAST program to run.
Default: None
Values: One of:
blastp
,blastn
,blastx
,tblastn
,tblastx
,psiblast
,rpsblast
,rpstblastn
[blast]
program = blastp
BLAST options
¶
BLAST options to customize BLAST invocation.
Note: the default output format in ElasticBLAST is 11 (BLAST archive).
If you do not specify an output format (with -outfmt), you can use blast_formatter to format the results in any desired output format.
Below, we have specified “-outfmt 7” for the BLAST tabular format and requested blastp-fast mode.
Default: None
Values: String, see BLAST+ options
[blast]
options = -task blastp-fast -outfmt 7
BLAST database
¶
BLAST database name to search.
To search a database provided in the cloud by the NCBI, simply use its name.
To search your own custom database, upload the database files to a cloud storage bucket and provide the bucket’s universal resource identifier (URI) plus the database name (see example and tip below). We also recommend that you include a metadata file for your database, which allows ElasticBLAST to better configure the memory requirements for your search. See Create BLAST database metadata for instructions on producing the metadata file.
Default: None
Values: String.
[blast]
db = nr
[blast]
db = gs://my-database-bucket/mydatabase
Tip: to upload your BLAST database to a cloud bucket, please refer to the cloud vendor documentation (AWS or GCP).
If you have BLAST+ available in your machine, you can run the command below to get a list of BLAST databases provided by NCBI:
update_blastdb.pl --source aws --showall pretty
update_blastdb.pl --source gcp --showall pretty
Batch length
¶
Number of bases/residues per query batch.
NOTE: this value should change with BLAST program.
Default: Auto-configured for supported programs.
Values: Positive integer
Also supported via the environment variable:
ELB_BATCH_LEN
.
[blast]
batch-len = 10000
Memory request for BLAST search
¶
Minimum amount of RAM to allocate to a BLAST search.
Format as <number> immediately followed by G for gigabytes, M for megabytes.
Must be less than available RAM for the chosen machine type.
Default: Auto-configured based on database choice. Minimal value is
0.5G
.Values: String
Applies to: GCP
See also:
[blast]
mem-request = 95G
Memory limit for BLAST search
¶
Maximum amount of RAM that a BLAST search can use.
Format as <number> immediately followed by G for gigabytes, M for megabytes.
Must be less than available RAM for the chosen machine type.
Default: Auto-configured based on database choice. Maximal value is
0.95
of the RAM available in the machine type.Values: String
See also:
[blast]
mem-limit = 115G
BLAST_USAGE_REPORT
¶
Controls the usage reporting via the environment variable
BLAST_USAGE_REPORT
.For additional details, please see the BLAST+ privacy statement.
Default:
true
Values:
true
orfalse
Input/output configuration options¶
Query sequence data
¶
Query sequence data for BLAST.
Can be provided as a local path or GCS or AWS bucket URI to a file/tarball. Multiple files can be provided as as space-separated list or in “list files”. Any file with the file extension
.query-list
is considered a “list file”, where each line contains a local path or a cloud bucket URI.
Default: None
Values: String
[blast]
queries = /home/${USER}/blast-queries.tar.gz
Results
¶
GCS or AWS S3 bucket URI where to save the results from ElasticBLAST.
This value uniquely identifies a single ElasticBLAST search - please keep track of this.
Note: This bucket must exist prior to invoking ElasticBLAST and it must include the
gs://
ors3://
prefix.
Default: None
Values: String
[blast]
results = ${YOUR_RESULTS_BUCKET}
Log file
¶
File name to save logging output. Can only be set via the command line argument
--logfile
.
Default:
elastic-blast.log
Values: String
Log level
¶
Sets the logging threshold. Can only be set via the command line argument
--loglevel
.
Default:
DEBUG
Values: One of
DEBUG
,INFO
,WARNING
,ERROR
,CRITICAL
Timeout configuration options¶
BLAST timeout
¶
Timeout in minutes after which kubernetes will terminate a single BLAST job (i.e.: that corresponds to one of the query batches).
Default:
10080
(1 week)Values: Positive integer
Applies to: GCP
[timeouts]
blast-k8s-job = 10080
BLASTDB initialization timeout
¶
Timeout in minutes to wait for the persistent disk to be initialized with the selected BLAST database.
Default:
45
Values: Positive integer
Applies to: GCP
[timeouts]
init-pv = 45
Developer configuration options¶
Cluster name
¶
Name of the GKE cluster created or the AWS CloudFormation stack (and related resources).
The name may contain only lowercase alphanumerics and ‘-’, must start with a letter and end with an alphanumeric, and must be no longer than 40 characters.
Note: This name must be unique for each of your ElasticBLAST searches, otherwise this may lead to undefined behavior.
IMPORTANT: Please do not set configuration variable unless you are intimately familiar with the internals of ElasticBLAST. This is NOT a way to reuse an existing GKE cluster to run ElasticBLAST.
Default:
elasticblast-${USER}-X
, whereX
is the first 8 characters of hashing the value of the results URI.Values: String
Also supported via the environment variable:
ELB_CLUSTER_NAME
.
[cluster]
name = my-cluster
Minimal compressed query file size to split on client
¶
For single, compressed (i.e.: those with a file name ending in
.gz
) query files stored on the cloud, this configuration setting specifies the minimal file size (in bytes) to download the file and split on the local machine. Files larger than this threshold will be split in the cloud.
Default: 5 000 000
Values: Positive integer
[blast]
min-qsize-for-client-split-compressed = 10000000
Minimal uncompressed query file size to split on client
¶
For single, uncompressed (i.e.: those any file extension except
.gz
) query files stored on the cloud, this configuration setting specifies the minimal file size (in bytes) to download the file and split on the local machine. Files larger than this threshold will be split in the cloud.
Default: 20 000 000
Values: Positive integer
[blast]
min-qsize-for-client-split-uncompressed = 100000000
ELB_DONT_DELETE_SETUP_JOBS
¶
Set via an environment variable, applies to GCP only.
Default: Disabled
Values: Any string. Set to any value to enable.
Applies to: GCP
Do not delete the kubernetes setup jobs after they complete.
ELB_PAUSE_AFTER_INIT_PV
¶
Set via an environment variable, applies to GCP only.
Default: 120
Values: Positive integer.
Applies to: GCP
Time in seconds to wait after persistent volume gets initialized to prevent mount errors on BLAST kubernetes jobs.