Welcome to the haddock-runner
docs
The haddock-runner
is an effort to reduce code duplication and to streamline the execution of HADDOCK benchmark.
It is a standalone program, freely available at https://github.com/haddocking/haddock-runner.
It is designed to be used with both the production-ready HADDOCK2.4, the pre-release HADDOCK2.5 and the experimental (unpublished) HADDOCK3 versions.
When running a benchmark, users/developers may be interested in the following (in no specific order):
- The quality of the docking results when using different parameters
- Comparing the results of different versions
- The time it takes to run HADDOCK on a set of targets
Have a look at the menu on the left for more information on how to use it.
Getting help
If you encounter any issues or have any questions, please open an issue on the GitHub repository, contact us at bonvinlab.support@uu.nl or join the BioExcel forum and post your question there.
How does it work?
The execution of the haddock-runner
consists of a few steps:
-
Setup the benchmark
- Copy the target structures to the location where the HADDOCK run will be executed
-
Setup the HADDOCK run
- For HADDOCK2.4, writing the
run.param
file and executing thehaddock2.4
program once to setup the folder structure - For HADDOCK3, writing the
run.toml
- For HADDOCK2.4, writing the
-
Distribute several HADDOCK runs in a HPC-friendly manner
The final goal of haddock-runner
is to automate these steps, additionally giving the user the possibility of setting up various scenarios.
A scenario is a set of parameters that will be used to run HADDOCK. For example, a user may want to run HADDOCK against a set of targets with different sampling values, different restraints, different parameters, etc.
Installation
The tool is designed for users/students/developers that are familiar with HADDOCK, command-line scripting and with access to a HPC infrastructure.
If this is the first time you are using HADDOCK, please familiarize first yourself with the software by running the basic HADDOCK2.4 or HADDOCK3 tutorials.
This tool is not meant to be used by end-users who want to run a single target, or a small set of targets; for that purpose we recommend instead using the HADDOCK2.4 web server.
VERY IMPORTANT: You need to have HADDOCK installed on your system. This is not covered in this documentation. Please refer to the HADDOCK2.4 installation instructions or HADDOCK3.0 repository for more information.
haddock-runner
is a standalone open-source software licensed under Apache 2.0 and freely available from the following repository: github.com/haddocking/haddock-runner.
To us it simply download the latest binary from the releases page:
$ wget https://github.com/haddocking/haddock-runner/releases/download/v1.10.0/haddock-runner_1.10.0_linux_386.tar.gz
$ tar -zxvf haddock-runner_1.10.0_linux_386.tar.gz
$ ./haddock-runner -version
haddock-runner version v1.10.0
Alternatively, you can build the latest version from source (you probably don't need to do that), make sure go is installed and run the following commands:
$ git clone https://github.com/haddocking/haddock-runner.git
$ cd haddock-runner
$ go build -o haddock-runner
$ ./haddock-runner -version
haddock-runner version v1.10.0
Usage
This chapter will go over the steps needed to use haddock-runner
.
- Writing the input file list of the targets
input.list
- Writing a
run-haddock.sh
script - Preparing the configuration file,
benchmark.yaml
- Running
haddock-runner
Writing a input.list
file
The input list is a flat text file with the paths of the targets;
# input.list
/home/rodrigo/projects/haddock-benchmark/data/complex1_r_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex1_l_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex1_ti.tbl
#
# comments are allowed, use it to organize your file
#
/home/rodrigo/projects/haddock-benchmark/data/complex2_r_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex2_l_u.pdb
/home/rodrigo/projects/haddock-benchmark/data/complex2_ti.tbl
/home/rodrigo/projects/haddock-benchmark/data/complex2_ligand.top
/home/rodrigo/projects/haddock-benchmark/data/complex2_ligand.param
Note that this file must follow the pattern:
path/to/the/structure/NAME_receptor_suffix.pdb
path/to/the/structure/NAME_ligand_suffix.pdb
In the above example, complex1
and complex2
correspond thus to NAME
, identifying the complex which is modelled.
Each PDB file (indicated by the .pdb
extension) has a suffix, this is extremely important as it will be used to organize the data. For example, the file complex1_r_u.pdb
is the receptor of the target complex1
and complex1_l_u
is the ligand of the same target.
In this example the suffixes are:
receptor_suffix: _r_u
ligand_suffix: _l_u
These suffixes are defined in the benchmark.yaml
file, see here for more details.
The same logic applies to the restraints files, in the example above the pattern for the ambiguous restraint can be defined as ambig: "ti"
, so the file complex1_ti.tbl
will be used as the ambiguous restraint for the target complex1
, complex2_ti.tbl
for the target complex2
, etc. See section 3.2.2 for information specific to the definition of restraints when setting up a HADDOCK3.0 run.
HADDOCK supports many modified amino acids/bases/glycans/ions (check the full list). However if your target molecule is not present in this library, you can also provide it following the same logic; topology: "_ligand.top"
and param: "_ligand.param"
will use the files protein2_ligand.top
and protein2_ligand.param
for the target protein2
.
IMPORTANT: For ensembles, provide each model individually and append a number to the suffix, for example:
complex1_l_u_1.pdb
,complex1_l_u_2.pdb
, etc.
See below a full example of the input.list
file
# -------------------------------- #
# 1A2K
./example/1A2K/1A2K_r_u.pdb
./example/1A2K/1A2K_l_u.pdb
./example/1A2K/1A2K_ligand.top
./example/1A2K/1A2K_ligand.param
./example/1A2K/1A2K_ti.tbl
./example/1A2K/1A2K_unambig.tbl
# 1GGR
./example/1GGR/1GGR_r_u.pdb
./example/1GGR/1GGR_l_u_1.pdb
./example/1GGR/1GGR_l_u_2.pdb
./example/1GGR/1GGR_l_u_3.pdb
./example/1GGR/1GGR_l_u_4.pdb
./example/1GGR/1GGR_l_u_5.pdb
./example/1GGR/1GGR_ti.tbl
# 1PPE
./example/1PPE/1PPE_l_u.pdb
./example/1PPE/1PPE_r_u.pdb
./example/1PPE/1PPE_ti.tbl
./example/1PPE/1PPE_hb.tbl
./example/1PPE/1PPE_unambig.tbl
# 2OOB
./example/2OOB/2OOB_l_u.pdb
./example/2OOB/2OOB_r_u.pdb
./example/2OOB/2OOB_ti.tbl
./example/2OOB/2OOB_hb.tbl
# -------------------------------- #
Writing a run-haddock.sh script
The run-haddock.sh
script is a bash script that will be executed by haddock-runner
for each target. The purpose of this script is to provide an "adapter" to account for different HADDOCK versions and/or different python versions and even different operating systems and configurations on your cluster.
This script should contain all the commands necessary to run HADDOCK and it must be customized for your installation, for example:
haddock24.sh
#!/bin/bash
#===============================================================================
# HADDOCK2.4 runs on python2.7, which is EOL.
# This script is a workaround to run HADDOCK with a custom python2 installation
## With pyenv
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
pyenv shell 2.7.18
## With Anaconda
# source $HOME/miniconda3/etc/profile.d/conda.sh
# conda create -n haddock24_env python=2.7
# conda activate haddock24_env
#===============================================================================
# Configure HADDOCK2.4
export HADDOCK="$HOME/repos/haddock24"
export HADDOCKTOOLS="$HADDOCK/tools"
export PYTHONPATH="${PYTHONPATH}:$HADDOCK"
python "$HADDOCK/Haddock/RunHaddock.py"
#===============================================================================
haddock3.sh
#!/bin/bash
#===============================================================================
HADDOCK3_DIR="$HOME/repos/haddock3"
# Activate the virtual environment
source "$HADDOCK3_DIR/venv/bin/activate" || exit
### Or if installed with conda
## source $HOME/miniconda3/etc/profile.d/conda.sh
## conda activate haddock3
# Mind the "$@" at the end, this is necessary to pass the arguments to the script
haddock3 "$@"
#===============================================================================
Writing a benchmark.yaml file
The benchmark.yaml
file is a configuration file in YAML
format that will be used by haddock-runner
to run the benchmark. The whole idea is that one configuration file can define multiple scenarios, each scenario being a set of parameters that will be used to run HADDOCK.
This file should be the replicable part of the benchmark, i.e. the part that you want to share with others. It should contain all the information needed to run the benchmark, alongside the input list.
This file is divided in 3 main sections; general
, slurm
and scenarios
.
General section
Here you must define the following parameters:
executable
: Path to therun-haddock.sh
script (see here for more details).max_concurrent
: Maximum number of jobs that can be executed at a given timehaddock_dir
: Path to the HADDOCK installation, this is used to validate the parameters of thescenarios
sectionreceptor_suffix
: This pattern will identify what is the receptor file in the the suffix used to identify the receptor filesligand_suffix
: This will be used to identify the ligand filesshape_suffix
: This will be used to identify the shape filesinput_list
: The path to the input list (see here for more details)work_dir
: The path where the results will be stored
See below an example:
general:
executable: /workspaces/haddock-runner/example/haddock3.sh
max_concurrent: 4
haddock_dir: /opt/haddock3
receptor_suffix: _r_u
ligand_suffix: _l_u
input_list: /workspaces/haddock-runner/example/input_list.txt
work_dir: /workspaces/haddock-runner/bm-goes-here
Slurm section
This section is option but highly recomended! For these to take effect you must be running the benchmark in a HPC environment. These will be used internally by the runner to compose the .job
file. Here you can define the following parameters, if left blank, SLURM will pick up the default values:
partition
: The name of the partition to be usedcpus_per_task
: Number of CPUs per taskntasks_per_node
: Number of tasks per nodenodes
: Number of nodestime
: Maximum time for the jobaccount
: Account to be usedmail_user
: Email to be notified when the job starts and ends
See below an example:
slurm:
partition: short # use the short partition
cpus_per_task: 8 # use 8 cores per task
Scenario section
Here you must define the scenarios that you want to run, these are slightly different for HADDOCK2.4 and HADDOCK3.0
HADDOCK2.4
For HADDOCK2.4 you must define the following:
name
: the name of the scenarioparameters
: the parameters to be used in the scenariorun_cns
: parameters that will be used in therun.cns
filerestraints
: patterns used to identify the restraints filesambig
: pattern used to identify the ambiguous restraints fileunambig
: pattern used to identify the unambiguous restraints filehbonds
: pattern used to identify the hydrogen bonds restraints file
custom_toppar
: patterns used to identify the custom topology filestopology
: pattern used to identify the topology fileparam
: pattern used to identify the parameter file
# HADDOCK2.4
scenarios:
- name: true-interface
parameters:
run_cns:
noecv: false
structures_0: 1000
structures_1: 200
waterrefine: 200
restraints:
ambig: ti
HADDOCK3.0
Note: HADDOCK3.0 is still under development and is not meant to be used for production runs! Please use HADDOCK2.4 instead. For information about the available modules, please refer to the HADDOCK3 tutorial and the documentation.
For HADDOCK3.0 you must define the following:
name
: the name of the scenarioparameters
: the parameters to be used in the scenariogeneral
: general parameters; those are the ones defined in the "top" section of therun.toml
scriptmodules
: this subsection is related to the parameters of each module in HADDOCK3.0order
: the order of the modules to be used in HADDOCK3.0<module-name>
: parameters for the module
# HADDOCK3.0
scenarios:
- name: true-interface
parameters:
general:
mode: local
ncores: 4
modules:
order: [topoaa, rigidbody, seletop, flexref, emref]
topoaa:
autohis: true
rigidbody:
ambig_fname: _ti.tbl
seletop:
select: 200
flexref:
emref:
Running haddock-runner
Assuming the config input file and the config .yaml
file have been properly set, you can run the benchmark by executing the haddock-runner
simply with:
haddock-runner my-benchmark-config-file.yaml
haddock-runner
will read the input file, create the working directory, copy the input files to a data/
directory and start the benchmark. Make sure you have enough space in your disk to store the input files and the results.
Restarting a benchmark
In v1.7.0 we introduced the possibility to restart a benchmark. This is useful when you want to continue a benchmark that was interrupted for some reason. To restart a benchmark you must have the benchmark.yaml
file and the input.list
file used in the original benchmark. The benchmark.yaml
file must have the work_dir
parameter set to the directory where the original benchmark was run.
Just run it again without the need of any special flags or parameters:
haddock-runner my-benchmark-config-file.yaml
haddock-runner
should automagically detect which runs are completed and which are not. It does this by searching the log produced by Haddock (both v2 and v3) and based on keywords it will assign a status to it during runtime.
I1116 13:28:09.721754 58085 main.go:192] ############################################
W1116 13:28:09.721797 58085 main.go:207] +++ 2OOB_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.721810 58085 main.go:204] 1GGR_center-of-mass - DONE - skipping
I1116 13:28:09.721823 58085 main.go:204] 1A2K_random-restraints - DONE - skipping
W1116 13:28:09.721988 58085 main.go:207] +++ 1GGR_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.721999 58085 main.go:204] 1GGR_random-restraints - DONE - skipping
I1116 13:28:09.722030 58085 main.go:204] 1A2K_center-of-mass - DONE - skipping
I1116 13:28:09.722010 58085 main.go:204] 1PPE_random-restraints - DONE - skipping
W1116 13:28:09.722072 58085 main.go:207] +++ 1PPE_true-interface is INCOMPLETE - restarting +++
W1116 13:28:09.722087 58085 main.go:207] +++ 1A2K_true-interface is INCOMPLETE - restarting +++
I1116 13:28:09.722165 58085 main.go:204] 1PPE_center-of-mass - DONE - skipping
I1116 13:28:09.722041 58085 main.go:204] 2OOB_center-of-mass - DONE - skipping
I1116 13:28:09.722483 58085 main.go:204] 2OOB_random-restraints - DONE - skipping
I1116 13:28:57.531951 58085 main.go:226] 2OOB_true-interface - DONE in 47.81 seconds
I1116 13:29:46.939726 58085 main.go:226] 1GGR_true-interface - DONE in 97.22 seconds
I1116 13:29:56.830500 58085 main.go:226] 1PPE_true-interface - DONE in 107.11 seconds
I1116 13:30:40.741859 58085 main.go:226] 1A2K_true-interface - DONE in 151.02 seconds
I1116 13:30:40.741907 58085 main.go:235] ############################################
To make sure the results are consistent, it will create a checksum of both the configuration yaml and of the input txt and show you a warning. This ensures that parameters and input has not changed mid-execution.
Examples
Here is a full example of the benchmark.yaml
file for both HADDOCK2.4 and HADDOCK3.0.
HADDOCK2.4
general:
executable: /workspaces/haddock-runner/haddock24.sh
max_concurrent: 2
haddock_dir: /Users/rodrigo/repos/haddock
receptor_suffix: _r_u
ligand_suffix: _l_u
input_list: /workspaces/haddock-runner/example/input_list.txt
work_dir: /workspaces/haddock-runner/bm-goes-here
scenarios:
- name: true-interface
parameters:
run_cns:
noecv: false
structures_0: 1000
structures_1: 200
waterrefine: 200
restraints:
ambig: ambig
unambig: restraint-bodies
hbonds: hbonds
custom_toppar:
topology: _ligand.top
param: _ligand.param
- name: center-of-mass
parameters:
run_cns:
cmrest: true
structures_0: 10000
structures_1: 400
waterrefine: 400
anastruc_1: 400
custom_toppar:
topology: _ligand.top
param: _ligand.param
- name: random-restraints
parameters:
run_cns:
ranair: true
structures_0: 10000
structures_1: 400
waterrefine: 400
anastruc_1: 400
custom_toppar:
topology: _ligand.top
param: _ligand.param
#-----------------------------------------------
HADDOCK3.0
general:
executable: /workspaces/haddock-runner/example/haddock3.sh
max_concurrent: 4
haddock_dir: /opt/haddock3
receptor_suffix: _r_u
ligand_suffix: _l_u
input_list: /workspaces/haddock-runner/example/input_list.txt
work_dir: /workspaces/haddock-runner/bm-goes-here
scenarios:
- name: true-interface
parameters:
general:
mode: local
ncores: 8
modules:
order:
[topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
topoaa:
autohis: true
rigidbody:
ambig_fname: _ambig.tbl
unambig_fname: _restraint-bodies.tbl
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
seletop:
select: 200
flexref:
ambig_fname: _ambig.tbl
unambig_fname: _restraint-bodies.tbl
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
emref:
ambig_fname: _ambig
clustfcc:
seletopclusts:
- name: center-of-mass
parameters:
general:
mode: local
ncores: 8
modules:
order:
[topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
topoaa:
autohis: true
rigidbody:
sampling: 10000
cmrest: true
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
seletop:
select: 400
flexref:
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
emref:
clustfcc:
seletopclusts:
- name: random-restraints
parameters:
general:
mode: local
ncores: 8
modules:
order:
[topoaa, rigidbody, seletop, flexref, emref, clustfcc, seletopclusts]
topoaa:
autohis: true
rigidbody:
sampling: 10000
ranair: true
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
seletop:
select: 400
flexref:
contactairs: true
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
emref:
contactairs: true
ligand_param_fname: _ligand.param
ligand_top_fname: _ligand.top
clustfcc:
seletopclusts:
#-----------------------------------------------
Setting up BM5
The Protein-Protein docking benchmark v5 (Vreven, 2015), namely BM5, contains a is a large set of non-redundat high-quality structures, check here the full set.
The BonvinLab provides a HADDOCK-ready sub-version of the BM5 which can be easily used as input for haddock-runner
. This version is available the following repository; github.com/haddocking/BM5-clean. Below we will go over step-by-step instructions on how to use it as input.
Create a working directory
Create a working directory and change to it;
mkdir -p ~/projects/benchmarking && cd ~/projects/benchmarking
Download the BM5-clean and create a bm5-input.list
file
Clone the repository and checkout a version. Note that its always recomended to use a specific version, as the main branch might change and for reproducibility.
As previously mentioned, the BM5-clean
repository is already an organized sub-version, thus its very simple to create the bm5-input.list
file with a few bash commands;
git clone https://github.com/haddocking/BM5-clean.git ~/projects/benchmarking/BM5-clean && \
cd ~/projects/benchmarking/BM5-clean && \
git checkout v1.1 && \
ls ~/projects/benchmarking/BM5-clean/HADDOCK-ready/**/*.{pdb,tbl} | grep -v "ana_scripts\|matched\|cg" | sort > bm5-input.list && \
cp bm5-input.list ~/projects/benchmarking/ && \
cd ~/projects/benchmarking
Prepare a haddock3.sh
script
See below an example of a haddock3.sh
script that can be used to run HADDOCK3.0 locally;
#!/bin/bash
source /opt/conda/etc/profile.d/conda.sh
conda activate env
haddock3 "$@"
Make sure to make this script executable;
chmod +x ~/projects/benchmarking/haddock3.sh
Prepare the bm5.yaml
configuration file
Below is a template for the bm5.yaml
configuration file using haddock3; keep in mind that this must be adapted to your specific setup!
general:
executable: /home/dev/projects/benchmarking/haddock3.sh
max_concurrent: 100
haddock_dir: /opt/haddock3
receptor_suffix: _r_u
ligand_suffix: _l_u
input_list: /home/dev/projects/benchmarking/bm5-input.list
work_dir: /home/dev/projects/benchmarking/my-benchmarking
slurm:
cpus_per_task: 8
scenarios:
- name: true-interface
parameters:
general:
mode: local
ncores: 8
modules:
order: [topoaa, rigidbody, seletop, flexref, emref, caprieval]
topoaa:
autohis: true
rigidbody:
sampling: 1000
ambig_fname: _ti.tbl
unambig_fname: _unambig.tbl
ligand_top_fname: _ligand.top
ligand_param_fname: _ligand.param
seletop:
select: 200
flexref:
ambig_fname: _ti.tbl
unambig_fname: _unambig.tbl
ligand_top_fname: _ligand.top
ligand_param_fname: _ligand.param
emref: ~
caprieval:
reference_fname: _ref.pdb
- name: center-of-mass
parameters:
general:
mode: local
ncores: 8
modules:
order: [topoaa, rigidbody, seletop, caprieval]
topoaa:
autohis: true
rigidbody:
sampling: 10000
cmrest: true
seletop:
select: 400
caprieval:
reference_fname: _ref.pdb
- name: random-restraints
parameters:
general:
mode: local
ncores: 8
modules:
order: [topoaa, rigidbody, seletop, caprieval]
topoaa:
autohis: true
rigidbody:
sampling: 10000
ranair: true
seletop:
select: 400
caprieval:
reference_fname: _ref.pdb
Run the benchmarking
Finally, run the benchmarking with the following command;
haddock-runner bm5.yaml
Development
The code repository contains a DevContainer configuration that can be used to setup a development environment, have a look at Developing inside a Container for more information.
The only caveat is that a cns
binary must be in the .devcontainer
path before building the container.
The development container comes pre-configured with Go, Haddock3 and Slurm.
This is the recommended way to develop the tool, as it will ensure that the development environment is consistent across different platforms with minimal setup.