Skip to content

Proposed fix for issue #47#48

Open
alex-selvita wants to merge 1 commit into
jvkersch:mainfrom
alex-selvita:patch-1
Open

Proposed fix for issue #47#48
alex-selvita wants to merge 1 commit into
jvkersch:mainfrom
alex-selvita:patch-1

Conversation

@alex-selvita

@alex-selvita alex-selvita commented Oct 27, 2024

Copy link
Copy Markdown

Added support for residues from extended IUPAC set and UNK (unknown) residues encodings from 3 letter to 1 letter sequence.

Fixes #47

Add supprot for SEC (Selenocysteine) and UNK (unknown) residues encodings
Comment thread tmtools/io.py

import numpy as np

protein_letters_3to1['UNK'] = 'X'

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alkorolyov-selvita Thanks for the fix! This does the job but has the side-effect of modifying the IUPACData dictionary, which "belongs to" BioPython (so that if this dictionary is used elsewhere in a user's code, the modification will also be visible). I would propose having an explicit fix: e.g. check for residue.resname == 'UNK' on the line below, and then insert X or the appropriate 1-letter code in the sequence:

seq.append(protein_letters_3to1[residue.resname])
.

@jvkersch jvkersch left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alkorolyov-selvita Thanks for the fix. I've left one comment; could you also add a test to exercise the fix?

@alex-selvita

Copy link
Copy Markdown
Author

Hello @jvkersch thanks for your comments, sure I will update the code and try to add some basic tests for new changes.

@jvkersch

Copy link
Copy Markdown
Owner

@alkorolyov-selvita Did you have time to look into this? I think there are two issues, which are easy to address.

The first is that (apparently) PDB uses uppercase for the residues, whereas the keys in protein_letters_3to1_extended are lowercase.

>>> from Bio.PDB.Polypeptide import protein_letters_3to1
>>> protein_letters_3to1
{'ALA': 'A', 'CYS': 'C', 'ASP': 'D', 'GLU': 'E', 'PHE': 'F', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LYS': 'K', 'LEU': 'L', 'MET': 'M', 'ASN': 'N', 'PRO': 'P', 'GLN': 'Q', 'ARG': 'R', 'SER': 'S', 'THR': 'T', 'VAL': 'V', 'TRP': 'W', 'TYR': 'Y'}
>>> from Bio.Data.IUPACData import protein_letters_3to1_extended
>>> protein_letters_3to1_extended
{'Ala': 'A', 'Cys': 'C', 'Asp': 'D', 'Glu': 'E', 'Phe': 'F', 'Gly': 'G', 'His': 'H', 'Ile': 'I', 'Lys': 'K', 'Leu': 'L', 'Met': 'M', 'Asn': 'N', 'Pro': 'P', 'Gln': 'Q', 'Arg': 'R', 'Ser': 'S', 'Thr': 'T', 'Val': 'V', 'Trp': 'W', 'Tyr': 'Y', 'Asx': 'B', 'Xaa': 'X', 'Glx': 'Z', 'Xle': 'J', 'Sec': 'U', 'Pyl': 'O'}

The second is that the protein_letters_3to1_extended dictionary should not be modified by this code. There are a few workarounds: one is to create a copy of protein_letters_3to1_extended internal to this library and modify that. That would also allow you to do something about the capitalization:

_PROTEIN_LETTERS_3TO1_INTERNAL = {k.upper(): v for (k, v) in protein_letters_3to1.items()}
_PROTEIN_LETTERS_3TO1_INTERNAL["UNK"] = "X"

@alex-selvita

Copy link
Copy Markdown
Author

Hello @jvkersch ,

Sorry for long reply, I was struggling a little bit to properly configure the dev environment. For the moment I just reimplemented the get_resiude_data() locally to include the necesseray residue codes. I will try to get back to this issue alter.

@jvkersch

Copy link
Copy Markdown
Owner

@alkorolyov-selvita Let me know if you have any questions about setting up a dev environment. Personally, I use an empty Conda environment and then I install the package in editable mode (pip install -e . -v). My editor picks up the environment, and the only thing I have to do manually is recompile the C++ code when it changes (but this should not affect you).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lack of extended and unknown residue support in get_residue_data()

2 participants