Welcome to the haddock-runner docs

image

The haddock-runner is an effort to reduce code duplication and to streamline the execution of HADDOCK benchmark.

It is a standalone program, freely available at https://github.com/haddocking/haddock-runner.

It is designed to be used with both the production-ready HADDOCK2.4, the pre-release HADDOCK2.5 and the experimental (unpublished) HADDOCK3 versions.

When running a benchmark, users/developers may be interested in the following (in no specific order):

  • The quality of the docking results when using different parameters
  • Comparing the results of different versions
  • The time it takes to run HADDOCK on a set of targets

Have a look at the menu on the left for more information on how to use it.

Getting help

If you encounter any issues or have any questions, please open an issue on the GitHub repository, contact us at bonvinlab.support@uu.nl or join the BioExcel forum and post your question there.

How does it work?

The execution of the haddock-runner consists of a few steps:

  1. Setup the benchmark

    • Copy the target structures to the location where the HADDOCK run will be executed
  2. Setup the HADDOCK run

    • For HADDOCK2.4, writing the run.param file and executing the haddock2.4 program once to setup the folder structure
    • For HADDOCK3, writing the run.toml
  3. Distribute several HADDOCK runs in a HPC-friendly manner


The final goal of haddock-runner is to automate these steps, additionally giving the user the possibility of setting up various scenarios.

A scenario is a set of parameters that will be used to run HADDOCK. For example, a user may want to run HADDOCK against a set of targets with different sampling values, different restraints, different parameters, etc.

Installation

The tool is designed for users/students/developers that are familiar with HADDOCK, command-line scripting and with access to a HPC infrastructure.

If this is the first time you are using HADDOCK, please familiarize first yourself with the software by running the basic HADDOCK2.4 or HADDOCK3 tutorials.

This tool is not meant to be used by end-users who want to run a single target, or a small set of targets; for that purpose we recommend instead using the HADDOCK2.4 web server.

VERY IMPORTANT: You need to have HADDOCK installed on your system. This is not covered in this documentation. Please refer to the HADDOCK2.4 installation instructions or HADDOCK3.0 repository for more information.

haddock-runner is a standalone open-source software licensed under Apache 2.0 and freely available from the following repository: github.com/haddocking/haddock-runner.

To us it simply download the latest binary from the releases page:

$ wget https://github.com/haddocking/haddock-runner/releases/download/v1.10.0/haddock-runner_1.10.0_linux_386.tar.gz
$ tar -zxvf haddock-runner_1.10.0_linux_386.tar.gz
$ ./haddock-runner -version
haddock-runner version v1.10.0

Alternatively, you can build the latest version from source (you probably don't need to do that), make sure go is installed and run the following commands:

$ git clone https://github.com/haddocking/haddock-runner.git
$ cd haddock-runner
$ go build -o haddock-runner
$ ./haddock-runner -version
haddock-runner version v1.10.0

Usage

This chapter will go over the steps needed to use haddock-runner.

  1. Writing the input file list of the targets input.list
  2. Writing a run-haddock.sh script
  3. Preparing the configuration file, benchmark.yaml
  4. Running haddock-runner

Writing a input.list file

The input list is a flat text file with the paths of the targets;

# input.list
/home/rodrigo/projects/haddock-benchmark/data/complex1_r_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex1_l_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex1_ti.tbl
#
# comments are allowed, use it to organize your file
#
/home/rodrigo/projects/haddock-benchmark/data/complex2_r_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex2_l_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex2_ti.tbl
/home/rodrigo/projects/haddock-benchmark/data/complex2_ligand.top
/home/rodrigo/projects/haddock-benchmark/data/complex2_ligand.param

Note that this file must follow the pattern:

path/to/the/structure/NAME_receptor_suffix.pdb
path/to/the/structure/NAME_ligand_suffix.pdb

In the above example, complex1 and complex2 correspond thus to NAME, identifying the complex which is modelled. Each PDB file (indicated by the .pdb extension) has a suffix, this is extremely important as it will be used to organize the data. For example, the file complex1_r_u.pdb is the receptor of the target complex1 and complex1_l_u is the ligand of the same target.

In this example the suffixes are:

  • receptor_suffix: _r_u
  • ligand_suffix: _l_u

These suffixes are defined in the benchmark.yaml file, see here for more details.

The same logic applies to the restraints files, in the example above the pattern for the ambiguous restraint can be defined as ambig: "ti", so the file complex1_ti.tbl will be used as the ambiguous restraint for the target complex1, complex2_ti.tbl for the target complex2, etc. See section 3.2.2 for information specific to the definition of restraints when setting up a HADDOCK3.0 run.

HADDOCK supports many modified amino acids/bases/glycans/ions (check the full list). However if your target molecule is not present in this library, you can also provide it following the same logic; topology: "_ligand.top" and param: "_ligand.param" will use the files protein2_ligand.top and protein2_ligand.param for the target protein2.

IMPORTANT: For ensembles, provide each model individually and append a number to the suffix, for example: complex1_l_u_1.pdb, complex1_l_u_2.pdb, etc.

See below a full example of the input.list file

# -------------------------------- #
# 1A2K
./example/1A2K/1A2K_r_u.pdb
./example/1A2K/1A2K_l_u.pdb
./example/1A2K/1A2K_ligand.top
./example/1A2K/1A2K_ligand.param
./example/1A2K/1A2K_ti.tbl
./example/1A2K/1A2K_unambig.tbl
# 1GGR
./example/1GGR/1GGR_r_u.pdb
./example/1GGR/1GGR_l_u_1.pdb
./example/1GGR/1GGR_l_u_2.pdb
./example/1GGR/1GGR_l_u_3.pdb
./example/1GGR/1GGR_l_u_4.pdb
./example/1GGR/1GGR_l_u_5.pdb
./example/1GGR/1GGR_ti.tbl
# 1PPE
./example/1PPE/1PPE_l_u.pdb
./example/1PPE/1PPE_r_u.pdb
./example/1PPE/1PPE_ti.tbl
./example/1PPE/1PPE_hb.tbl
./example/1PPE/1PPE_unambig.tbl
# 2OOB
./example/2OOB/2OOB_l_u.pdb
./example/2OOB/2OOB_r_u.pdb
./example/2OOB/2OOB_ti.tbl
./example/2OOB/2OOB_hb.tbl
# -------------------------------- #

Writing a run-haddock.sh script

The run-haddock.sh script is a bash script that will be executed by haddock-runner for each target. The purpose of this script is to provide an "adapter" to account for different HADDOCK versions and/or different python versions and even different operating systems and configurations on your cluster.

This script should contain all the commands necessary to run HADDOCK and it must be customized for your installation, for example:

haddock24.sh

#!/bin/bash
#===============================================================================
# HADDOCK2.4 runs on python2.7, which is EOL.
# This script is a workaround to run HADDOCK with a custom python2 installation

## With pyenv
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
pyenv shell 2.7.18

## With Anaconda
# source $HOME/miniconda3/etc/profile.d/conda.sh
# conda create -n haddock24_env python=2.7
# conda activate haddock24_env

#===============================================================================
# Configure HADDOCK2.4
export HADDOCK="$HOME/repos/haddock24"
export HADDOCKTOOLS="$HADDOCK/tools"
export PYTHONPATH="${PYTHONPATH}:$HADDOCK"

python "$HADDOCK/Haddock/RunHaddock.py"
#===============================================================================

haddock3.sh

#!/bin/bash
#===============================================================================
HADDOCK3_DIR="$HOME/repos/haddock3"

# Activate the virtual environment
source "$HADDOCK3_DIR/venv/bin/activate" || exit

### Or if installed with conda
## source $HOME/miniconda3/etc/profile.d/conda.sh
## conda activate haddock3

# Mind the "$@" at the end, this is necessary to pass the arguments to the script
haddock3 "$@"
#===============================================================================

Writing a benchmark.yaml file

The benchmark.yaml file is a configuration file in YAML format that will be used by haddock-runner to run the benchmark. The whole idea is that one configuration file can define multiple scenarios, each scenario being a set of parameters that will be used to run HADDOCK.

This file should be the replicable part of the benchmark, i.e. the part that you want to share with others. It should contain all the information needed to run the benchmark, alongside the input list.

This file is divided in 3 main sections; general, slurm and scenarios.

General section

Here you must define the following parameters:

  • executable: Path to the run-haddock.sh script (see here for more details).
  • max_concurrent: Maximum number of jobs that can be executed at a given time
  • haddock_dir: Path to the HADDOCK installation, this is used to validate the parameters of the scenarios section
  • receptor_suffix: This pattern will identify what is the receptor file in the the suffix used to identify the receptor files
  • ligand_suffix: This will be used to identify the ligand files
  • shape_suffix: This will be used to identify the shape files
  • input_list: The path to the input list (see here for more details)
  • work_dir: The path where the results will be stored

See below an example:

general:
  executable: /workspaces/haddock-runner/example/haddock3.sh
  max_concurrent: 4
  haddock_dir: /opt/haddock3
  receptor_suffix: _r_u
  ligand_suffix: _l_u
  input_list: /workspaces/haddock-runner/example/input_list.txt
  work_dir: /workspaces/haddock-runner/bm-goes-here

Slurm section

This section is option but highly recomended! For these to take effect you must be running the benchmark in a HPC environment. These will be used internally by the runner to compose the .job file. Here you can define the following parameters, if left blank, SLURM will pick up the default values:

  • partition: The name of the partition to be used
  • cpus_per_task: Number of CPUs per task
  • ntasks_per_node: Number of tasks per node
  • nodes: Number of nodes
  • time: Maximum time for the job
  • account: Account to be used
  • mail_user: Email to be notified when the job starts and ends

See below an example:

slurm:
  partition: short # use the short partition
  cpus_per_task: 8 # use 8 cores per task

Scenario section

Here you must define the scenarios that you want to run, these are slightly different for HADDOCK2.4 and HADDOCK3.0

HADDOCK2.4

For HADDOCK2.4 you must define the following:

  • name: the name of the scenario
  • parameters: the parameters to be used in the scenario
    • run_cns: parameters that will be used in the run.cns file
    • restraints: patterns used to identify the restraints files
      • ambig: pattern used to identify the ambiguous restraints file
      • unambig: pattern used to identify the unambiguous restraints file
      • hbonds: pattern used to identify the hydrogen bonds restraints file
    • custom_toppar: patterns used to identify the custom topology files
      • topology: pattern used to identify the topology file
      • param: pattern used to identify the parameter file
# HADDOCK2.4
scenarios:
  - name: true-interface
    parameters:
      run_cns:
        noecv: false
        structures_0: 1000
        structures_1: 200
        waterrefine: 200
      restraints:
        ambig: ti

HADDOCK3.0

Note: HADDOCK3.0 is still under development and is not meant to be used for production runs! Please use HADDOCK2.4 instead. For information about the available modules, please refer to the HADDOCK3 tutorial and the documentation.

For HADDOCK3.0 you must define the following:

  • name: the name of the scenario
  • parameters: the parameters to be used in the scenario
    • general: general parameters; those are the ones defined in the "top" section of the run.toml script
    • modules: this subsection is related to the parameters of each module in HADDOCK3.0
      • order: the order of the modules to be used in HADDOCK3.0
      • <module-name>: parameters for the module
# HADDOCK3.0
scenarios:
  - name: true-interface
    parameters:
      general:
        mode: local
        ncores: 4

      modules:
        order: [topoaa, rigidbody, seletop, flexref, emref]
        topoaa:
          autohis: true
        rigidbody:
          ambig_fname: _ti.tbl
        seletop:
          select: 200
        flexref:
        emref:

Running haddock-runner

Assuming the config input file and the config .yaml file have been properly set, you can run the benchmark by executing the haddock-runner simply with:

haddock-runner my-benchmark-config-file.yaml

haddock-runner will read the input file, create the working directory, copy the input files to a data/ directory and start the benchmark. Make sure you have enough space in your disk to store the input files and the results.

Restarting a benchmark

In v1.7.0 we introduced the possibility to restart a benchmark. This is useful when you want to continue a benchmark that was interrupted for some reason. To restart a benchmark you must have the benchmark.yaml file and the input.list file used in the original benchmark. The benchmark.yaml file must have the work_dir parameter set to the directory where the original benchmark was run.

Just run it again without the need of any special flags or parameters:

haddock-runner my-benchmark-config-file.yaml

haddock-runner should automagically detect which runs are completed and which are not. It does this by searching the log produced by Haddock (both v2 and v3) and based on keywords it will assign a status to it during runtime.

I1116 13:28:09.721754   58085 main.go:192] ############################################
W1116 13:28:09.721797   58085 main.go:207] +++ 2OOB_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.721810   58085 main.go:204] 1GGR_center-of-mass - DONE - skipping
I1116 13:28:09.721823   58085 main.go:204] 1A2K_random-restraints - DONE - skipping
W1116 13:28:09.721988   58085 main.go:207] +++ 1GGR_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.721999   58085 main.go:204] 1GGR_random-restraints - DONE - skipping
I1116 13:28:09.722030   58085 main.go:204] 1A2K_center-of-mass - DONE - skipping
I1116 13:28:09.722010   58085 main.go:204] 1PPE_random-restraints - DONE - skipping
W1116 13:28:09.722072   58085 main.go:207] +++ 1PPE_true-interface is INCOMPLETE - restarting +++
W1116 13:28:09.722087   58085 main.go:207] +++ 1A2K_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.722165   58085 main.go:204] 1PPE_center-of-mass - DONE - skipping
I1116 13:28:09.722041   58085 main.go:204] 2OOB_center-of-mass - DONE - skipping
I1116 13:28:09.722483   58085 main.go:204] 2OOB_random-restraints - DONE - skipping
I1116 13:28:57.531951   58085 main.go:226] 2OOB_true-interface - DONE in 47.81 seconds
I1116 13:29:46.939726   58085 main.go:226] 1GGR_true-interface - DONE in 97.22 seconds
I1116 13:29:56.830500   58085 main.go:226] 1PPE_true-interface - DONE in 107.11 seconds
I1116 13:30:40.741859   58085 main.go:226] 1A2K_true-interface - DONE in 151.02 seconds
I1116 13:30:40.741907   58085 main.go:235] ############################################

To make sure the results are consistent, it will create a checksum of both the configuration yaml and of the input txt and show you a warning. This ensures that parameters and input has not changed mid-execution.

Examples

Here is a full example of the benchmark.yaml file for both HADDOCK2.4 and HADDOCK3.0.

HADDOCK2.4

general:
  executable: /workspaces/haddock-runner/haddock24.sh
  max_concurrent: 2
  haddock_dir: /Users/rodrigo/repos/haddock
  receptor_suffix: _r_u
  ligand_suffix: _l_u
  input_list: /workspaces/haddock-runner/example/input_list.txt
  work_dir: /workspaces/haddock-runner/bm-goes-here

scenarios:
  - name: true-interface
    parameters:
      run_cns:
        noecv: false
        structures_0: 1000
        structures_1: 200
        waterrefine: 200
      restraints:
        ambig: ambig
        unambig: restraint-bodies
        hbonds: hbonds
      custom_toppar:
        topology: _ligand.top
        param: _ligand.param

  - name: center-of-mass
    parameters:
      run_cns:
        cmrest: true
        structures_0: 10000
        structures_1: 400
        waterrefine: 400
        anastruc_1: 400
      custom_toppar:
        topology: _ligand.top
        param: _ligand.param

  - name: random-restraints
    parameters:
      run_cns:
        ranair: true
        structures_0: 10000
        structures_1: 400
        waterrefine: 400
        anastruc_1: 400
      custom_toppar:
        topology: _ligand.top
        param: _ligand.param

  #-----------------------------------------------

HADDOCK3.0

general:
  executable: /workspaces/haddock-runner/example/haddock3.sh
  max_concurrent: 4
  haddock_dir: /opt/haddock3
  receptor_suffix: _r_u
  ligand_suffix: _l_u
  input_list: /workspaces/haddock-runner/example/input_list.txt
  work_dir: /workspaces/haddock-runner/bm-goes-here

scenarios:
  - name: true-interface
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order:
          [topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
        topoaa:
          autohis: true
        rigidbody:
          ambig_fname: _ambig.tbl
          unambig_fname: _restraint-bodies.tbl
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        seletop:
          select: 200
        flexref:
          ambig_fname: _ambig.tbl
          unambig_fname: _restraint-bodies.tbl
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        emref:
          ambig_fname: _ambig
        clustfcc:
        seletopclusts:

  - name: center-of-mass
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order:
          [topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
        topoaa:
          autohis: true
        rigidbody:
          sampling: 10000
          cmrest: true
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        seletop:
          select: 400
        flexref:
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        emref:
        clustfcc:
        seletopclusts:

  - name: random-restraints
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order:
          [topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
        topoaa:
          autohis: true
        rigidbody:
          sampling: 10000
          ranair: true
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        seletop:
          select: 400
        flexref:
          contactairs: true
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        emref:
          contactairs: true
          ligand_param_fname: _ligand.param
          ligand_top_fname: _ligand.top
        clustfcc:
        seletopclusts:

  #-----------------------------------------------

Setting up BM5

The Protein-Protein docking benchmark v5 (Vreven, 2015), namely BM5, contains a is a large set of non-redundat high-quality structures, check here the full set.

The BonvinLab provides a HADDOCK-ready sub-version of the BM5 which can be easily used as input for haddock-runner. This version is available the following repository; github.com/haddocking/BM5-clean. Below we will go over step-by-step instructions on how to use it as input.

Create a working directory

Create a working directory and change to it;

mkdir -p ~/projects/benchmarking && cd ~/projects/benchmarking

Download the BM5-clean and create a bm5-input.list file

Clone the repository and checkout a version. Note that its always recomended to use a specific version, as the main branch might change and for reproducibility.

As previously mentioned, the BM5-clean repository is already an organized sub-version, thus its very simple to create the bm5-input.list file with a few bash commands;

git clone https://github.com/haddocking/BM5-clean.git ~/projects/benchmarking/BM5-clean && \
  cd ~/projects/benchmarking/BM5-clean && \
  git checkout v1.1 && \
  ls ~/projects/benchmarking/BM5-clean/HADDOCK-ready/**/*.{pdb,tbl} | grep -v "ana_scripts\|matched\|cg" | sort > bm5-input.list && \
  cp bm5-input.list ~/projects/benchmarking/ && \
  cd ~/projects/benchmarking

Prepare a haddock3.sh script

See below an example of a haddock3.sh script that can be used to run HADDOCK3.0 locally;

#!/bin/bash
source /opt/conda/etc/profile.d/conda.sh
conda activate env
haddock3 "$@"

Make sure to make this script executable;

chmod +x ~/projects/benchmarking/haddock3.sh

Prepare the bm5.yaml configuration file

Below is a template for the bm5.yaml configuration file using haddock3; keep in mind that this must be adapted to your specific setup!

general:
  executable: /home/dev/projects/benchmarking/haddock3.sh
  max_concurrent: 100
  haddock_dir: /opt/haddock3
  receptor_suffix: _r_u
  ligand_suffix: _l_u
  input_list: /home/dev/projects/benchmarking/bm5-input.list
  work_dir: /home/dev/projects/benchmarking/my-benchmarking

slurm:
  cpus_per_task: 8

scenarios:
  - name: true-interface
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order: [topoaa, rigidbody, seletop, flexref, emref, caprieval]
        topoaa:
          autohis: true
        rigidbody:
          sampling: 1000
          ambig_fname: _ti.tbl
          unambig_fname: _unambig.tbl
          ligand_top_fname: _ligand.top
          ligand_param_fname: _ligand.param
        seletop:
          select: 200
        flexref:
          ambig_fname: _ti.tbl
          unambig_fname: _unambig.tbl
          ligand_top_fname: _ligand.top
          ligand_param_fname: _ligand.param
        emref: ~
        caprieval:
          reference_fname: _ref.pdb

  - name: center-of-mass
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order: [topoaa, rigidbody, seletop, caprieval]
        topoaa:
          autohis: true
        rigidbody:
          sampling: 10000
          cmrest: true
        seletop:
          select: 400
        caprieval:
          reference_fname: _ref.pdb

  - name: random-restraints
    parameters:
      general:
        mode: local
        ncores: 8

      modules:
        order: [topoaa, rigidbody, seletop, caprieval]
        topoaa:
          autohis: true
        rigidbody:
          sampling: 10000
          ranair: true
        seletop:
          select: 400
        caprieval:
          reference_fname: _ref.pdb

Run the benchmarking

Finally, run the benchmarking with the following command;

haddock-runner bm5.yaml

Development

The code repository contains a DevContainer configuration that can be used to setup a development environment, have a look at Developing inside a Container for more information.

The only caveat is that a cns binary must be in the .devcontainer path before building the container.

The development container comes pre-configured with Go, Haddock3 and Slurm.

This is the recommended way to develop the tool, as it will ensure that the development environment is consistent across different platforms with minimal setup.