PDB preprocessing

Process input PDB files to ensure compatibility with HADDOCK3.

This module checks and modifies PDB files for compatibility with HADDOCK3. There are three types of checks/modifications:

Performed to each PDB line-by-line, in a equal fashion of pdb-tools. In fact, this step mostly uses the pdb-tools package.
Performed on each PDB as a whole.
Performed on all PDBs together.

Main functions

process_pdbs()
read_additional_residues()

Corrections performed on 1)

The following actions are perfomed sequentially over all PDBs:

from pdb-tools: pdb_keepcoord
from pdb-tools: pdb_tidy with strict=True
from pdb-toos: pdb_element
from pdb-tools: pdb_selaltloc
from pdb-tools: pdb_pdb_occ with occupancy=1.00
replace MSE to MET
replace HSD to HIS
replace HSE to HIS
replace HID to HIS
replace HIE to HIS
add_charges_to_ions, see add_charges_to_ions()
convert ATOM to HETATM for those atoms that should be HETATM. Considers the additional residues provided by the user. See convert_ATOM_to_HETATM().
convert HETATM to ATOM for those atoms that should be ATOM,
from pdb-toos: pdb_fixinsert, with option_list=[].
remove unsupported HETATM. Considers residues provided by the user.
remove unsupported ATOM. Considers residues provided by the user.
from pdb-tools: pdb_reatom, start from 1.
from pdb-tools: pdb_tidy with strict=True

Corrections performed on 2)

The following actions are performed sequentially for each PDB:

models_should_have_the_same_labels()
solve_no_chainID_no_segID()
homogenize_chains()

Read the documentation of the above functions for details what they do.

Corrections performed on 3)

The following actions are performed to all PDBs together:

correct_equal_chain_segids()

Read the documentation of the above functions for details what they do.

When it happens

The PDB processing step is performed by default when reading the input molecules and copying them to the data/ folder inside the run directory. When PDBs are processed, a copy of the original input PDBs is also stored in the data/ folder.

To deactivate this initial PDB processing, set skip_preprocess = False in the general parameters of the configuration file.

Additional information

If you are a developer and want to read more about the history of this preprocessing module, visit:

https://github.com/haddocking/haddock3/projects/16

exception haddock.gear.preprocessing.ModelsDifferError[source]

Bases: HaddockError

MODELS of the PDB differ in atom labels.

haddock.gear.preprocessing.add_charges_to_ions(fhandler: Iterable[str]) → Generator[str, None, None][source]

Add charges to ions according to HADDOCK3 specifications.

Check if charge is correctly defined in residue name. If so, yield the line with correct residue name and charge at the end.
Check if charge is correctly defined in atom name.
Create charge from element. This might need manual edit in case the atom as an unconventional charge.

Parameters:: fhandler (file-hanlder, list, or list-like) – Lines of the PDB file. This function will consumes lines over a for loop; mind it if you use a generator.
Yields:: line (str) – Line-by-line: modified ion lines and any other line.

haddock.gear.preprocessing.convert_ATOM_to_HETATM(fhandler: Iterable[str], *, record: str = 'ATOM', other_record: str = 'HETATM', residues: Container[str] = {'A2G', 'ABE', 'ACD', 'ACN', 'ACT', 'ADN', 'ADP', 'ADY', 'AG', 'AG1', 'AL', 'AL3', 'AMN', 'AMP', 'AR', 'AS', 'ATP', 'AU', 'AU1', 'AU3', 'BDP', 'BDY', 'BEN', 'BGC', 'BMA', 'BR', 'BR1', 'BUT', 'CA', 'CA2', 'CD', 'CD2', 'CHE', 'CIT', 'CL', 'CL1', 'CO', 'CO2', 'CO3', 'COH', 'COM', 'CR', 'CR2', 'CR3', 'CS', 'CS1', 'CU', 'CU1', 'CU2', 'CYA', 'DFO', 'DME', 'DMS', 'DOD', 'EOL', 'ETA', 'ETH', 'F', 'F1', 'FAD', 'FCA', 'FCB', 'FE', 'FE2', 'FE3', 'FLC', 'FUC', 'FUL', 'GAL', 'GDP', 'GLA', 'GLC', 'GMP', 'GTP', 'GXL', 'HEB', 'HEC', 'HG', 'HG1', 'HG2', 'HO', 'HO3', 'HOH', 'I', 'I1', 'IMI', 'IR', 'IR3', 'K', 'K1', 'KR', 'LI1', 'MAG', 'MAN', 'MER', 'MG', 'MG2', 'MIY', 'MMA', 'MN', 'MN2', 'MN3', 'MO', 'MO3', 'NA', 'NA1', 'NAD', 'NAG', 'NAP', 'NDG', 'NDP', 'NGA', 'NI', 'NI2', 'O2', 'OS', 'OS4', 'PB', 'PB2', 'PHN', 'PO4', 'PT', 'PT2', 'RAM', 'SIA', 'SIB', 'SO4', 'SR', 'SR2', 'THS', 'TIP', 'TIP3', 'U3', 'U4', 'URE', 'V', 'V2', 'V3', 'WAT', 'WO4', 'XE', 'XYP', 'XYS', 'YB', 'YB2', 'YB3', 'ZN', 'ZN2'}) → Generator[str, None, None]

Convert ATOM to HETATM for HADDOCK3 supported HETATM.

See also

haddock.core.supported_molecules.supported_HETATM

haddock.gear.preprocessing.convert_HETATM_to_ATOM(fhandler: Iterable[str], *, record: str = 'HETATM', other_record: str = 'ATOM ', residues: Container[str] = {'A', 'ACE', 'ALA', 'ALY', 'ARG', 'ASH', 'ASN', 'ASP', 'C', 'CFE', 'CHX', 'CIR', 'CSP', 'CTN', 'CYC', 'CYF', 'CYM', 'CYS', 'DA', 'DC', 'DDZ', 'DG', 'DJ', 'DT', 'DUM', 'G', 'GLH', 'GLN', 'GLU', 'GLY', 'HIS', 'HLY', 'HY3', 'HYP', 'ILE', 'LEU', 'LYS', 'M3L', 'MET', 'MLY', 'MLZ', 'MSE', 'NEP', 'NME', 'PCA', 'PHE', 'PNS', 'PRO', 'PTR', 'QSR', 'SEC', 'SEP', 'SER', 'SHA', 'THR', 'TOP', 'TRP', 'TYP', 'TYR', 'TYS', 'U', 'VAL'}) → Generator[str, None, None]

Convert HETATM to ATOM for HADDOCK3 supported ATOM.

See also

haddock.core.supported_molecules.supported_ATOM

haddock.gear.preprocessing.convert_record(fhandler: Iterable[str], record: str, other_record: str, residues: Container[str]) → Generator[str, None, None][source]

Convert on record to another for specified residues.

For example, replace ATOM by HETATM for specific residues.

Parameters:

fhandler (list-like) – Contains lines of file.
record (str) – The PDB RECORD to match; for example, ATOM or HETATM.
other_record (str) – The PDB RECORD to replace with; for example, ATOM or HETATM.
residues (list, tuple, or set) – List of residues to replace the record.

haddock.gear.preprocessing.correct_equal_chain_segids(structures: list[list[str]]) → list[list[str]][source]

Correct for repeated chainID in the input PDB files.

Repeated chain IDs are replaced by an upper case character ([A-Z]) in order.

Parameters:: structures (list of lists of str) – The input data.
Returns:: list of lists of str – The new structures.

haddock.gear.preprocessing.homogenize_chains(lines: list[str]) → list[str][source]

Homogenize chainIDs within the same PDB.

If there are multiple chain identifiers in the PDB file, make all them equal to the first one.

ChainIDs are copied to segIDs afterwards.

Returns:: list – The modified lines.

haddock.gear.preprocessing.models_should_have_the_same_labels(lines: Iterable[str]) → Iterable[str][source]

Confirm models have the same labels.

In an ensemble of structures, where the PDB file has multiple MODELS, all models should have the same labels; hence the same number and typ of atoms.

Parameters:: lines (list of strings.) – List containing the lines of the PDB file. Must NOT be a generator.
Returns:: list – The original lines in case no errors are found.
Raises:: ModelsDifferError – In case MODELS differ. Reports on which models differ.

haddock.gear.preprocessing.process_pdbs(*inputdata: Iterable[str] | str | Path, dry: bool = False, user_supported_residues: Iterable[str] | None = None) → list[list[str]][source]

Process PDB file contents for compatibility with HADDOCK3.

Parameters:

inputdata (list of (str, path, list of str [lines], file handler)) – A flat list where in each index it can contain:
- file objects
- paths to files
- strings representing paths
- lists or tuples of lines
The above types can be mixed in the input list.

Files are read to lines in a list. Line separators are stripped.

Do not provide nested lists with lists containing paths inside lists.
dry (bool) – Perform a dry run. That is, does not change anything, and just report.
user_supported_residues (list, tuple, or set) – The new residues that are allowed.

Returns:

list of (list of str) – The corrected (processed) PDB content in the same order as inputdata.

haddock.gear.preprocessing.remove_unsupported_atom(lines: Iterable[str], *, haddock3_defined: set[str] | None = {'A', 'ACE', 'ALA', 'ALY', 'ARG', 'ASH', 'ASN', 'ASP', 'C', 'CFE', 'CHX', 'CIR', 'CSP', 'CTN', 'CYC', 'CYF', 'CYM', 'CYS', 'DA', 'DC', 'DDZ', 'DG', 'DJ', 'DT', 'DUM', 'G', 'GLH', 'GLN', 'GLU', 'GLY', 'HIS', 'HLY', 'HY3', 'HYP', 'ILE', 'LEU', 'LYS', 'M3L', 'MET', 'MLY', 'MLZ', 'MSE', 'NEP', 'NME', 'PCA', 'PHE', 'PNS', 'PRO', 'PTR', 'QSR', 'SEC', 'SEP', 'SER', 'SHA', 'THR', 'TOP', 'TRP', 'TYP', 'TYR', 'TYS', 'U', 'VAL'}, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = 'ATOM') → Generator[str, None, None]

Remove unsupported molecules in ATOM lines.

Uses remove_unsupported_molecules() by populating its haddock3_define and line_startswith parameters.

See also

remove_unsupported_hetatm()

haddock.gear.preprocessing.remove_unsupported_hetatm(lines: Iterable[str], *, haddock3_defined: set[str] | None = {'A2G', 'ABE', 'ACD', 'ACN', 'ACT', 'ADN', 'ADP', 'ADY', 'AG', 'AG1', 'AL', 'AL3', 'AMN', 'AMP', 'AR', 'AS', 'ATP', 'AU', 'AU1', 'AU3', 'BDP', 'BDY', 'BEN', 'BGC', 'BMA', 'BR', 'BR1', 'BUT', 'CA', 'CA2', 'CD', 'CD2', 'CHE', 'CIT', 'CL', 'CL1', 'CO', 'CO2', 'CO3', 'COH', 'COM', 'CR', 'CR2', 'CR3', 'CS', 'CS1', 'CU', 'CU1', 'CU2', 'CYA', 'DFO', 'DME', 'DMS', 'DOD', 'EOL', 'ETA', 'ETH', 'F', 'F1', 'FAD', 'FCA', 'FCB', 'FE', 'FE2', 'FE3', 'FLC', 'FUC', 'FUL', 'GAL', 'GDP', 'GLA', 'GLC', 'GMP', 'GTP', 'GXL', 'HEB', 'HEC', 'HG', 'HG1', 'HG2', 'HO', 'HO3', 'HOH', 'I', 'I1', 'IMI', 'IR', 'IR3', 'K', 'K1', 'KR', 'LI1', 'MAG', 'MAN', 'MER', 'MG', 'MG2', 'MIY', 'MMA', 'MN', 'MN2', 'MN3', 'MO', 'MO3', 'NA', 'NA1', 'NAD', 'NAG', 'NAP', 'NDG', 'NDP', 'NGA', 'NI', 'NI2', 'O2', 'OS', 'OS4', 'PB', 'PB2', 'PHN', 'PO4', 'PT', 'PT2', 'RAM', 'SIA', 'SIB', 'SO4', 'SR', 'SR2', 'THS', 'TIP', 'TIP3', 'U3', 'U4', 'URE', 'V', 'V2', 'V3', 'WAT', 'WO4', 'XE', 'XYP', 'XYS', 'YB', 'YB2', 'YB3', 'ZN', 'ZN2'}, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = 'HETATM') → Generator[str, None, None]

Remove unsupported molecules in HETATM lines.

Uses remove_unsupported_molecules() by populating its haddock3_define and line_startswith parameters.

See also

remove_unsupported_atom()

haddock.gear.preprocessing.remove_unsupported_molecules(lines: Iterable[str], haddock3_defined: set[str] | None = None, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = ('ATOM', 'HETATM')) → Generator[str, None, None][source]

Remove HADDOCK3 unsupported molecules.

This function is abstract and you need to provide the set of residues supported by HADDOCK3. See parameters.

Residues not provided in haddock3_defined and user_defined are removed from the PDB lines.

Other lines are yieled unmodified.

Parameters:

lines (list or list-like) – Lines of the PDB file. This function will consumes lines over a for loop; mind it if you use a generator.
haddock3_defined (set) – Set of residues supported by HADDOCK3. Defaults to None.
user_defined (set) – An additional set of allowed residues given by the user. Defaults to None.
line_startswith (tuple) – The lines to consider. Defaults to ("ATOM", "HETATM").

Yields:

line (str) – Line-by-line. Lines for residues not supported are not yielded.

See also

Other functions use this function to create context.

remove_unsupported_atom()
remove_unsupported_hetatm()

haddock.gear.preprocessing.replace_HETATM_to_ATOM(fhandler: Iterable[str], res: str) → Generator[str, None, None][source]

Replace record HETATM to ATOM for res.

Do not alter other lines.

Parameters:

fhanlder (file handler or list of lines) – List-like of file lines. Consumes over a for loop.
res (str) – Residue name to match for the substitution.

Yields:

str – Yield line-by-line.

haddock.gear.preprocessing.replace_HID_to_HIS(fhandler: Iterable[str], *, resin: str = 'HID', resout: str = 'HIS') → Generator[str, None, None]

Replace HID to HIS.

See also

replace_residue()

haddock.gear.preprocessing.replace_HIE_to_HIS(fhandler: Iterable[str], *, resin: str = 'HIE', resout: str = 'HIS') → Generator[str, None, None]

Replace HIE to HIS.

See also

replace_residue()

haddock.gear.preprocessing.replace_HSD_to_HIS(fhandler: Iterable[str], *, resin: str = 'HSD', resout: str = 'HIS') → Generator[str, None, None]

Replace HSD to HIS.

See also

replace_residue()

haddock.gear.preprocessing.replace_HSE_to_HIS(fhandler: Iterable[str], *, resin: str = 'HSE', resout: str = 'HIS') → Generator[str, None, None]

Replace HSE to HIS.

See also

replace_residue()

haddock.gear.preprocessing.replace_MSE_to_MET(fhandler: Iterable[str], *, resin: str = 'MSE', resout: str = 'MET') → Generator[str, None, None]

Replace MSE to MET.

See also

replace_residue()

haddock.gear.preprocessing.replace_residue(fhandler: Iterable[str], resin: str, resout: str) → Generator[str, None, None][source]

Replace residue by another and changes HETATM to ATOM if needed.

Do not alter other lines.

Parameters:

fhanlder (file handler or list of lines) – List-like of file lines. Consumes over a for loop.
resin (str) – Residue name to match for the substitution.
resout (str) – Name of the new residue. Renames resin to resout.

Yields:

str – Yield line-by-line.

See also

replace_HETATM_to_ATOM()
pdb_rplresname from pdb-tools

haddock.gear.preprocessing.solve_no_chainID_no_segID(lines: Iterable[str]) → Iterable[str][source]

Solve inconsistencies with chainID and segID.

If segID is non-existant, copy chainID over segID, and vice-versa. If none are present, adds an upper case char starting from A. This char is not repeated until the alphabet exhausts. If chainIDs and segIDs differ, copy chainIDs over segIDs.

Parameters:: lines (list of str) – The lines of a PDB file.
Returns:: list – With new lines. Or the input ones if no modification was made.