PDB preprocessing

Process input PDB files to ensure compatibility with HADDOCK3.

This module checks and modifies PDB files for compatibility with HADDOCK3. There are three types of checks/modifications:

  1. Performed to each PDB line-by-line, in a equal fashion of pdb-tools. In fact, this step mostly uses the pdb-tools package.

  2. Performed on each PDB as a whole.

  3. Performed on all PDBs together.

Main functions

Corrections performed on 1)

The following actions are perfomed sequentially over all PDBs:

  1. from pdb-tools: pdb_keepcoord

  2. from pdb-tools: pdb_tidy with strict=True

  3. from pdb-toos: pdb_element

  4. from pdb-tools: pdb_selaltloc

  5. from pdb-tools: pdb_pdb_occ with occupancy=1.00

  6. replace MSE to MET

  7. replace HSD to HIS

  8. replace HSE to HIS

  9. replace HID to HIS

  10. replace HIE to HIS

  11. add_charges_to_ions, see add_charges_to_ions()

  12. convert ATOM to HETATM for those atoms that should be HETATM. Considers the additional residues provided by the user. See convert_ATOM_to_HETATM().

  13. convert HETATM to ATOM for those atoms that should be ATOM,

  14. from pdb-toos: pdb_fixinsert, with option_list=[].

  15. remove unsupported HETATM. Considers residues provided by the user.

  16. remove unsupported ATOM. Considers residues provided by the user.

  17. from pdb-tools: pdb_reatom, start from 1.

  18. from pdb-tools: pdb_tidy with strict=True

Corrections performed on 2)

The following actions are performed sequentially for each PDB:

Read the documentation of the above functions for details what they do.

Corrections performed on 3)

The following actions are performed to all PDBs together:

Read the documentation of the above functions for details what they do.

When it happens

The PDB processing step is performed by default when reading the input molecules and copying them to the data/ folder inside the run directory. When PDBs are processed, a copy of the original input PDBs is also stored in the data/ folder.

To deactivate this initial PDB processing, set skip_preprocess = False in the general parameters of the configuration file.

Additional information

If you are a developer and want to read more about the history of this preprocessing module, visit:

https://github.com/haddocking/haddock3/projects/16

exception haddock.gear.preprocessing.ModelsDifferError[source]

Bases: HaddockError

MODELS of the PDB differ in atom labels.

haddock.gear.preprocessing.add_charges_to_ions(fhandler: Iterable[str]) Generator[str, None, None][source]

Add charges to ions according to HADDOCK3 specifications.

  1. Check if charge is correctly defined in residue name. If so, yield the line with correct residue name and charge at the end.

  2. Check if charge is correctly defined in atom name.

  3. Create charge from element. This might need manual edit in case the atom as an unconventional charge.

Parameters:

fhandler (file-hanlder, list, or list-like) – Lines of the PDB file. This function will consumes lines over a for loop; mind it if you use a generator.

Yields:

line (str) – Line-by-line: modified ion lines and any other line.

haddock.gear.preprocessing.convert_ATOM_to_HETATM(fhandler: Iterable[str], *, record: str = 'ATOM', other_record: str = 'HETATM', residues: Container[str] = {'A2G', 'ABE', 'ACD', 'ACN', 'ACT', 'ADY', 'AG', 'AG1', 'AL', 'AL3', 'AMN', 'AR', 'AS', 'AU', 'AU1', 'AU3', 'BDP', 'BDY', 'BEN', 'BGC', 'BMA', 'BR', 'BR1', 'BUT', 'CA', 'CA2', 'CD', 'CD2', 'CHE', 'CL', 'CL1', 'CO', 'CO2', 'CO3', 'COH', 'COM', 'CR', 'CR2', 'CR3', 'CS', 'CS1', 'CU', 'CU1', 'CU2', 'CYA', 'DFO', 'DME', 'DMS', 'DOD', 'EOL', 'ETA', 'ETH', 'F', 'F1', 'FCA', 'FCB', 'FE', 'FE2', 'FE3', 'FUC', 'FUL', 'GAL', 'GLA', 'GLC', 'GXL', 'HEB', 'HEC', 'HG', 'HG1', 'HG2', 'HO', 'HO3', 'HOH', 'I', 'I1', 'IMI', 'IR', 'IR3', 'K', 'K1', 'KR', 'LI1', 'MAG', 'MAN', 'MER', 'MG', 'MG2', 'MIY', 'MMA', 'MN', 'MN2', 'MN3', 'MO', 'MO3', 'NA', 'NA1', 'NAG', 'NDG', 'NGA', 'NI', 'NI2', 'O2', 'OS', 'OS4', 'PB', 'PB2', 'PHN', 'PO4', 'PT', 'PT2', 'RAM', 'SIA', 'SIB', 'SO4', 'SR', 'SR2', 'THS', 'TIP', 'TIP3', 'U3', 'U4', 'URE', 'V', 'V2', 'V3', 'WAT', 'WO4', 'XE', 'XYP', 'XYS', 'YB', 'YB2', 'YB3', 'ZN', 'ZN2'}) Generator[str, None, None]

Convert ATOM to HETATM for HADDOCK3 supported HETATM.

haddock.gear.preprocessing.convert_HETATM_to_ATOM(fhandler: Iterable[str], *, record: str = 'HETATM', other_record: str = 'ATOM  ', residues: Container[str] = {'A', 'ACE', 'ALA', 'ALY', 'ARG', 'ASH', 'ASN', 'ASP', 'C', 'CFE', 'CHX', 'CSP', 'CTN', 'CYC', 'CYF', 'CYM', 'CYS', 'DA', 'DC', 'DDZ', 'DG', 'DJ', 'DT', 'DUM', 'G', 'GLH', 'GLN', 'GLU', 'GLY', 'HIS', 'HLY', 'HY3', 'HYP', 'ILE', 'LEU', 'LYS', 'M3L', 'MET', 'MLY', 'MLZ', 'MSE', 'NEP', 'NME', 'PHE', 'PNS', 'PRO', 'PTR', 'QSR', 'SEC', 'SEP', 'SER', 'SHA', 'THR', 'TOP', 'TRP', 'TYP', 'TYR', 'TYS', 'U', 'VAL'}) Generator[str, None, None]

Convert HETATM to ATOM for HADDOCK3 supported ATOM.

haddock.gear.preprocessing.convert_record(fhandler: Iterable[str], record: str, other_record: str, residues: Container[str]) Generator[str, None, None][source]

Convert on record to another for specified residues.

For example, replace ATOM by HETATM for specific residues.

Parameters:
  • fhandler (list-like) – Contains lines of file.

  • record (str) – The PDB RECORD to match; for example, ATOM or HETATM.

  • other_record (str) – The PDB RECORD to replace with; for example, ATOM or HETATM.

  • residues (list, tuple, or set) – List of residues to replace the record.

haddock.gear.preprocessing.correct_equal_chain_segids(structures: list[list[str]]) list[list[str]][source]

Correct for repeated chainID in the input PDB files.

Repeated chain IDs are replaced by an upper case character ([A-Z]) in order.

Parameters:

structures (list of lists of str) – The input data.

Returns:

list of lists of str – The new structures.

haddock.gear.preprocessing.homogenize_chains(lines: list[str]) list[str][source]

Homogenize chainIDs within the same PDB.

If there are multiple chain identifiers in the PDB file, make all them equal to the first one.

ChainIDs are copied to segIDs afterwards.

Returns:

list – The modified lines.

haddock.gear.preprocessing.models_should_have_the_same_labels(lines: Iterable[str]) Iterable[str][source]

Confirm models have the same labels.

In an ensemble of structures, where the PDB file has multiple MODELS, all models should have the same labels; hence the same number and typ of atoms.

Parameters:

lines (list of strings.) – List containing the lines of the PDB file. Must NOT be a generator.

Returns:

list – The original lines in case no errors are found.

Raises:

ModelsDifferError – In case MODELS differ. Reports on which models differ.

haddock.gear.preprocessing.process_pdbs(*inputdata: Iterable[str] | str | Path, dry: bool = False, user_supported_residues: Iterable[str] | None = None) list[list[str]][source]

Process PDB file contents for compatibility with HADDOCK3.

Parameters:
  • inputdata (list of (str, path, list of str [lines], file handler)) – A flat list where in each index it can contain:

    • file objects

    • paths to files

    • strings representing paths

    • lists or tuples of lines

    The above types can be mixed in the input list.

    Files are read to lines in a list. Line separators are stripped.

    Do not provide nested lists with lists containing paths inside lists.

  • dry (bool) – Perform a dry run. That is, does not change anything, and just report.

  • user_supported_residues (list, tuple, or set) – The new residues that are allowed.

Returns:

list of (list of str) – The corrected (processed) PDB content in the same order as inputdata.

haddock.gear.preprocessing.remove_unsupported_atom(lines: Iterable[str], *, haddock3_defined: set[str] | None = {'A', 'ACE', 'ALA', 'ALY', 'ARG', 'ASH', 'ASN', 'ASP', 'C', 'CFE', 'CHX', 'CSP', 'CTN', 'CYC', 'CYF', 'CYM', 'CYS', 'DA', 'DC', 'DDZ', 'DG', 'DJ', 'DT', 'DUM', 'G', 'GLH', 'GLN', 'GLU', 'GLY', 'HIS', 'HLY', 'HY3', 'HYP', 'ILE', 'LEU', 'LYS', 'M3L', 'MET', 'MLY', 'MLZ', 'MSE', 'NEP', 'NME', 'PHE', 'PNS', 'PRO', 'PTR', 'QSR', 'SEC', 'SEP', 'SER', 'SHA', 'THR', 'TOP', 'TRP', 'TYP', 'TYR', 'TYS', 'U', 'VAL'}, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = 'ATOM') Generator[str, None, None]

Remove unsupported molecules in ATOM lines.

Uses remove_unsupported_molecules() by populating its haddock3_define and line_startswith parameters.

haddock.gear.preprocessing.remove_unsupported_hetatm(lines: Iterable[str], *, haddock3_defined: set[str] | None = {'A2G', 'ABE', 'ACD', 'ACN', 'ACT', 'ADY', 'AG', 'AG1', 'AL', 'AL3', 'AMN', 'AR', 'AS', 'AU', 'AU1', 'AU3', 'BDP', 'BDY', 'BEN', 'BGC', 'BMA', 'BR', 'BR1', 'BUT', 'CA', 'CA2', 'CD', 'CD2', 'CHE', 'CL', 'CL1', 'CO', 'CO2', 'CO3', 'COH', 'COM', 'CR', 'CR2', 'CR3', 'CS', 'CS1', 'CU', 'CU1', 'CU2', 'CYA', 'DFO', 'DME', 'DMS', 'DOD', 'EOL', 'ETA', 'ETH', 'F', 'F1', 'FCA', 'FCB', 'FE', 'FE2', 'FE3', 'FUC', 'FUL', 'GAL', 'GLA', 'GLC', 'GXL', 'HEB', 'HEC', 'HG', 'HG1', 'HG2', 'HO', 'HO3', 'HOH', 'I', 'I1', 'IMI', 'IR', 'IR3', 'K', 'K1', 'KR', 'LI1', 'MAG', 'MAN', 'MER', 'MG', 'MG2', 'MIY', 'MMA', 'MN', 'MN2', 'MN3', 'MO', 'MO3', 'NA', 'NA1', 'NAG', 'NDG', 'NGA', 'NI', 'NI2', 'O2', 'OS', 'OS4', 'PB', 'PB2', 'PHN', 'PO4', 'PT', 'PT2', 'RAM', 'SIA', 'SIB', 'SO4', 'SR', 'SR2', 'THS', 'TIP', 'TIP3', 'U3', 'U4', 'URE', 'V', 'V2', 'V3', 'WAT', 'WO4', 'XE', 'XYP', 'XYS', 'YB', 'YB2', 'YB3', 'ZN', 'ZN2'}, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = 'HETATM') Generator[str, None, None]

Remove unsupported molecules in HETATM lines.

Uses remove_unsupported_molecules() by populating its haddock3_define and line_startswith parameters.

haddock.gear.preprocessing.remove_unsupported_molecules(lines: Iterable[str], haddock3_defined: set[str] | None = None, user_defined: set[str] | None = None, line_startswith: str | tuple[str, ...] = ('ATOM', 'HETATM')) Generator[str, None, None][source]

Remove HADDOCK3 unsupported molecules.

This function is abstract and you need to provide the set of residues supported by HADDOCK3. See parameters.

Residues not provided in haddock3_defined and user_defined are removed from the PDB lines.

Other lines are yieled unmodified.

Parameters:
  • lines (list or list-like) – Lines of the PDB file. This function will consumes lines over a for loop; mind it if you use a generator.

  • haddock3_defined (set) – Set of residues supported by HADDOCK3. Defaults to None.

  • user_defined (set) – An additional set of allowed residues given by the user. Defaults to None.

  • line_startswith (tuple) – The lines to consider. Defaults to ("ATOM", "HETATM").

Yields:

line (str) – Line-by-line. Lines for residues not supported are not yielded.

See also

Other functions use this function to create context.

haddock.gear.preprocessing.replace_HETATM_to_ATOM(fhandler: Iterable[str], res: str) Generator[str, None, None][source]

Replace record HETATM to ATOM for res.

Do not alter other lines.

Parameters:
  • fhanlder (file handler or list of lines) – List-like of file lines. Consumes over a for loop.

  • res (str) – Residue name to match for the substitution.

Yields:

str – Yield line-by-line.

haddock.gear.preprocessing.replace_HID_to_HIS(fhandler: Iterable[str], *, resin: str = 'HID', resout: str = 'HIS') Generator[str, None, None]

Replace HID to HIS.

haddock.gear.preprocessing.replace_HIE_to_HIS(fhandler: Iterable[str], *, resin: str = 'HIE', resout: str = 'HIS') Generator[str, None, None]

Replace HIE to HIS.

haddock.gear.preprocessing.replace_HSD_to_HIS(fhandler: Iterable[str], *, resin: str = 'HSD', resout: str = 'HIS') Generator[str, None, None]

Replace HSD to HIS.

haddock.gear.preprocessing.replace_HSE_to_HIS(fhandler: Iterable[str], *, resin: str = 'HSE', resout: str = 'HIS') Generator[str, None, None]

Replace HSE to HIS.

haddock.gear.preprocessing.replace_MSE_to_MET(fhandler: Iterable[str], *, resin: str = 'MSE', resout: str = 'MET') Generator[str, None, None]

Replace MSE to MET.

haddock.gear.preprocessing.replace_residue(fhandler: Iterable[str], resin: str, resout: str) Generator[str, None, None][source]

Replace residue by another and changes HETATM to ATOM if needed.

Do not alter other lines.

Parameters:
  • fhanlder (file handler or list of lines) – List-like of file lines. Consumes over a for loop.

  • resin (str) – Residue name to match for the substitution.

  • resout (str) – Name of the new residue. Renames resin to resout.

Yields:

str – Yield line-by-line.

See also

haddock.gear.preprocessing.solve_no_chainID_no_segID(lines: Iterable[str]) Iterable[str][source]

Solve inconsistencies with chainID and segID.

If segID is non-existant, copy chainID over segID, and vice-versa. If none are present, adds an upper case char starting from A. This char is not repeated until the alphabet exhausts. If chainIDs and segIDs differ, copy chainIDs over segIDs.

Parameters:

lines (list of str) – The lines of a PDB file.

Returns:

list – With new lines. Or the input ones if no modification was made.