nf-core/configs: czbiohub_aws
Chan Zuckerberg Biohub AWS Batch profile provided by nf-core/configs.
CZ Biohub Configuration
All nf-core pipelines have been successfully configured for use on the AWS Batch at the Chan Zuckerberg Biohub here.
To use, run the pipeline with -profile czbiohub_aws
. This will download and launch the czbiohub_aws.config
which has been pre-configured with a setup suitable for the AWS Batch. Using this profile, a docker image containing all of the required software will be downloaded, and converted to a Singularity image before execution of the pipeline.
Ask Olga (olga.botvinnik@czbiohub.org) if you have any questions!
Run the pipeline from a small AWS EC2 Instance
The pipeline will monitor and submit jobs to AWS Batch on your behalf. To ensure that the pipeline is successful, it will need to be run from a computer that has constant internet connection. Unfortunately for us, Biohub has spotty WiFi and even for short pipelines, it is highly recommended to run them from AWS.
1. Start tmux
tmux is a “Terminal Multiplexer” that allows for commands to continue running even when you have closed your laptop. Start a new tmux session with tmux new
and we’ll name this session nextflow
.
Now you can run pipelines with abandon!
2. Make a GitHub repo for your workflows (optional :)
To make sharing your pipelines and commands easy between your teammates, it’s best to share code in a GitHub repository. One way is to store the commands in a Makefile (example) which can contain multiple nextflow run
commands so that you don’t need to remember the S3 bucket or output directory for every single one. Makefiles are broadly used in the software community for running many complex commands. Makefiles can have a lot of dependencies and be confusing, so we’re only going to write simple Makefiles.
Human_Mouse_Zebrafish:
Merkin2012_AWS:
In this example, one would run the rnaseq
rule and the nextflow command beneath it with:
If one wanted to run a different command, e.g. human_mouse_zebrafish
, they would specify that command instead. For example:
Makefiles are a very useful way of storing longer commands with short mnemonic words.
Once you create a new repository (best to initialize with a .gitignore
, license - MIT and README
), clone that repository to your EC2 instance. For example, if the repository is called kh-workflows
, this is what the command would look like:
Now both create and edit a Makefile
:
Write your rule with a colon after it, and on the next line must be a tab, not spaces. Once you’re done, exit the program (the ^
command shown in nano means “Control”), write the file, add it to git, commit it, and push it up to GitHub.
3. Run your workflow
Remember to specify -profile czbiohub_aws
to grab the CZ Biohub-specific AWS configurations, and an --outdir
with an AWS S3 bucket so you don’t run out of space on your small AMI
4. If you lose connection, how do you restart the jobs
If you close your laptop, get onto the train, or lose WiFi connection, you may lose connection to AWS and may need to restart the jobs. To reattach, use the command tmux attach
and you should see your Nextflow output! To get the named session, use:
To restart the jobs from where you left off, add the -resume
flag to your nextflow
command:
It’s important that this command be re-run from the same directory as there is a “hidden” .nextflow
folder that contains all the metadata and information about previous runs.
iGenomes specific configuration
A local copy of the iGenomes resource has been made available on s3://czbiohub-reference/igenomes
(in us-west-2
region) so you should be able to run the pipeline against any reference available in the igenomes.config
specific to the nf-core pipeline.
You can do this by simply using the --genome <GENOME_ID>
parameter.
For Human and Mouse, we use GENCODE gene annotations. This doesn’t change how you would specify the genome name, only that the pipelines run with the czbiohub_aws
profile would be with GENCODE rather than iGenomes.
NB: You will need an account to use the HPC cluster on PROFILE CLUSTER in order to run the pipeline. If in doubt contact IT. NB: Nextflow will need to submit the jobs via the job scheduler to the HPC cluster and as such the commands above will have to be executed on one of the login nodes. If in doubt contact IT.
High Priority Queue
If you would like to run with the High Priority queue, specify the highpriority
config profile after czbiohub_aws
. When applied after the main czbiohub_aws
config, it overwrites the process queue
identifier.
To use it, submit your run with with -profile czbiohub_aws,highpriority
.
Note that the order of config profiles here is important. For example, -profile highpriority,czbiohub_aws
will not work.