1 Introduction

1.1 Prerequisites

Familiarity with the Linux shell, as well as basic programming constructs such as for/while loops and if/else statements is assumed. Familiarity with at least one scripting language such as R or Python will be beneficial. Basic knowledge of virtual environments and software containers would be helpful. Basic knowledge of git is beneficial.

1.2 What is Nextflow?

Nextflow is an open-source workflow manager consisting of a domain-specific language built as a superset of the Groovy programming language, which is itself a superset of Java. The purpose of Nextflow is to make it easier to coordinate the execution of complex data analysis pipelines and to facilitate the execution of such pipelines in cloud environments and high-performance clusters.

In practice, Nextflow allows you to declare how the output of some processes in a pipeline is fed as the input to other processes, leading to the production of a final result. In addition, Nextflow allows the specification of software requirements for each process using Conda environments, Docker containers, and a variety of other solutions.

Once a pipeline has been written in Nextflow, it can be easily scaled from a local laptop to a high-performance cluster or cloud environment without any modification, with Nextflow taking care of environment-specific commands (for example, submitting appropriate jobs to a batch scheduler). Moreover, the pipeline execution is parallelised if the dependencies among different processes allow it. A motivation for using a workflow manager such as Nextflow is also to increase the reproducibility of your data analysis. Finally, each process in Nextflow is executed in its own unique directory, with automatic staging of the required inputs, so there is no need to think about filename collisions and file locations when developing your pipeline.

Nextflow is developed and maintained by Seqera Labs. The project was started in Cedric Notredame’s lab at the Centre for Genomic Regulation (CRG) in Barcelona, Spain. Nextflow was conceived with bioinformatics as its key use case, but it is domain-agnostic and can be used for any data-intensive workflow.

1.3 Learning objectives

After following this tutorial, the learner will be autonomous in using Nextflow for their own data analysis. They will understand fundamental Nextflow concepts such as processes, workflows, and channels. They will be able to write a configuration file to alter the resources allocated for a process and the software environment of execution. They will be able to deploy a community-curated pipeline from nf-core.

1.4 Setup

Install Mamba, a faster alternative to Conda

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh

Close and re-open your shell. Then, create a mamba environment for installing Nextflow

mamba create -c bioconda -n nextflow nextflow

Now activate the environment that you just created

mamba activate nextflow

Create an empty git repository in your GitHub account and clone it locally. I will pretend that this repository is called my_repo in this workshop. Don’t forget to replace this with the name of your actual repository.

cd my_repo

You can use a text editor of your choice to write Nextflow code. For example, you may want to use Atom. I recommend to add Nextflow language support to Atom by clicking here.

Create some essential files that we will be working on inside your new repository

touch main.nf
touch nextflow.config

1.5 Getting started

1.5.1 General anatomy of a Nextflow workflow

A Nextflow workflow is usually represented by a git repository containing all the files describing the workflow logic, configurations, and dependencies.

Text files containing Nextflow code usually have the .nf extension (i.e. my_file.nf). It does not matter how you name your files, as long as they contain valid Nextflow code, but for consistency, the best practice is to call the file describing the workflow as main.nf and place it at the root of your repository.

The main.nf file can access code from other Nextflow scripts, called sub-workflows, that can have arbitrary names. The best practice is to place such additional Nextflow scripts under the folder workflows at the root of the repository. So for example an additional Nextflow script could be named workflows/preprocessing.nf. We will explore later the usage of sub-workflows, so don’t worry if you don’t understand their purpose yet.

The configurations for the pipeline need to be placed in a file called nextflow.config at the root of the repository. Note that differently from what was said before regarding main.nf, this name is mandatory for Nextflow to properly recognise the file. The nextflow.config file can contain for example resource requirements, execution, parameters, and metadata about your workflow. We will explore in-depth the usage of configuration files in a later section.

Scripts (in any language, for example R or Python) that are used in the execution of specific processes in your pipeline should be placed in the folder bin. Scripts placed here are automatically recognised by Nextflow during task execution without the need to specify their full path.

Additional files needed for your workflow, such as a table containing some metadata for your samples, are usually placed under the folder assets. However, this is not required and many workflows do not have an assets folder.

Inputs to the workflow are usually defined in a so-called “sample sheet” that should contain the (preferably) absolute file path to the input files end eventual sample-specific parameters for task execution. The best practice is to have your sample sheet formatted as a CSV file with an appropriate header. Note that the sample sheet is not part of the pipeline itself. The absolute path to the sample sheet is usually provided to Nextflow either as a command-line parameter or in the nextflow.config file.

To execute your workflow, you would run nextflow run main.nf at the root of your repository if your workflow is in a file named main.nf. This would prompt Nextflow to read the main.nf file and eventual sub-workflows, extract the workflow logic from it, load the configurations in nextflow.config, and coordinate the execution of the different steps in your pipeline to produce the final result.

1.5.2 Nextflow domain-specific language versions

Before starting, it is important to know that Nextflow as it was initially developed is referred to as Domain-Specific Language 1 (DSL1). A major change in the Nextflow syntax was done by its developers, and the new syntax is referred to as DSL 2. In this workshop, we will be using exclusively the new DSL2 syntax.

For this reason, you need to add the following line at the top of your main.nf file:

nextflow.enable.dsl = 2

So if in the future you will find yourself looking at a Nextflow workflow which is written very differently from what you are used to, it will probably be written according to DSL1 instead of DSL2.

1.5.3 Core concepts

A Nextflow workflow is composed of a set of processes, which execute a single task each, which are coordinated by one or more workflows. So for example there may be a process called align_fastq that takes in input a fastq file and a reference genome and uses the software bwa-mem2 to align it, producing an aligned cram file in output.

A process defines as a minimum the command to be executed. Outputs are typically defined but may be omitted if the process is run for its side effects. Inputs are usually also defined but there may be processes that do not need any input and instead perform a static computation. Many additional parameters can be specified for a process, such as the software needed, or how much memory or time is required for its execution. We will explore processes and their definition more in detail in a later section.

If processes determine what is to be executed and how, workflows instead determine the logic of execution and how different processes communicate with each other. So for example there may be a workflow called SNP_CALLING that takes a fastq file in input, uses the process align_fastq to obtain an aligned cram file, then gives that cram file in input to another process called gatk_call that uses it to create a vcf file containing genetic variants.

Processes communicate with each other using so-called “channels”, which are unidirectional First-In-First-Out (FIFO) queues. This means that a channel is populated with a set of elements (for example, files) produced by a process in the order in which they are produced. Then, these elements are consumed one by one by another process in the same order. Several ways to manipulate, merge, and split channels exist, and we will explore them later. So for example the workflow that I described above may define an input channel containing the fastq files in input, and a channel connecting the output of align_fastq to the input of gatk_call. Note that channels are unidirectional: communication happens only in one direction, from outputs to inputs.

1.5.4 Your first workflow

Now that you know the basics of what a Nextflow workflow is, we will write a simple workflow and we will execute it.

First of all, open the file main.nf that you created before at the root of your repository in a text editor and add the following to it

// this is a comment
process say_hello {
    // comments can be written anywhere
    output:
        path "hello_world.txt"
    script:
        """
        echo "This is the EBI predoc course" > hello_world.txt
        """
}

workflow {
    say_hello()
    say_hello.out.view()
}

If you want to see how your main.nf should look like at this stage open this hidden section

Code

nextflow.enable.dsl = 2

// this is a comment
process say_hello {
    // comments can be written anywhere
    output:
        path "hello_world.txt"
    script:
        """
        echo "This is the EBI predoc course" > hello_world.txt
        """
}

workflow {
    say_hello()
    say_hello.out.view()
}

Now open the file nextflow.config and add the following line

// you can also put comments in nextflow.config
workDir = "../nextflow_workdir"

Now run the workflow executing from the root of your repository the following

nextflow run main.nf

You should see a bunch of information about your workflow appear on your terminal followed by the name of the file we just created, hello_world.txt. Note that the full path to the file will be printed in your terminal, which will look something like /home/myname/nextflow_workdir/00/fjdanurew9gihwepw1455idusodhfweioru/hello_world.txt.

Let’s now examine the code step by step:

In nextflow.config we declared our working directory to be ../nextflow_workdir with the keyword workDir. The working directory is where Nextflow stores your files during intermediate processing.
Lines starting with // define a comment in groovy, and they are ignored.
The keyword process defines a process, with the statement that follows defining the process name. So process say_hello creates a process named say_hello.
Curly braces enclose a block of code, so process say_hello {<some code>} is understood by Nextflow as defining <some code> to belong to the process named say_hello.
The keyword output: when used inside of a process defines the expected outputs that that process should produce. If the declared output is absent at the end of the process execution the execution fails.
Several types of outputs can be defined, and path is one of them. The keyword path defines what comes after it to be a file.
Output files are named in the output block after the path qualifier. They should be placed inside a string. Strings are enclosed in quotes in groovy ("this is a string").
The script: keyword defines the actual command to be executed in the process. The command should be provided as a string. Since commands can be long, here a multi-line string is used. In groovy, multi-line strings are enclosed in three quotes

"""
this is a multi-line string
it is very long
"""

What is declared in the script: block is executed in the shell by default, so we should write bash code there
- The command that we wrote, echo "This is the EBI predoc course" > hello_world.txt, creates a file called hello_world.txt containing the string "This is the EBI predoc course"
The keyword workflow defines a workflow, that we left unnamed in this case. The last workflow to be written in your main.nf file is automatically executed by Nextflow.
- The workflow that we defined executes the process say_hello
- The output of the process say_hello is captured by say_hello.out, which is a channel. In this case, it will contain just the file hello_world.txt
- The operator view() just echoes to the terminal the content of the channel. In this case, it prints hello_world.txt to our shell.

So to put everything together, when you ran nextflow run main.nf Nextflow read the main.nf file, it found the last workflow, which we left unnamed, and executed it. Some additional notes:

It is possible to write path("a_file.txt") or path "a_file.txt", they are equivalent statements
There is no need to think about where we are creating our files, Nextflow under the hood creates a private directory for each execution and takes care of moving files around as needed by other processes
It is common practice to chain different operators on a channel
It is possible to chain operators on separate lines
- The following is equivalent to say_hello.out.view()
```
say_hello.out
  .view()
```
Note that spaces and tab charachters are just for visual clarity and are not required
Explore the working directory ../nextflow_workdir. It contains a directory with a strange alphanumeric name like 00/fjdanurew9gihwepw1455idusodhfweioru (NOTE: yours will have a different name). One different directory like this is created by Nextflow for the execution of each task (each time a process is executed on a set of inputs, such execution is defined as a task).
- Inside the directory with the strange name, you will see the file that we just created, hello_world.txt
- Read the content of the file with cat hello_world.txt. You should see This is the EBI predoc course in your terminal.