alignment¶
This is the heaviest lifter of the entire project. The alignment process happens here. The input is a spectrum, and the output is a set of alignments.
alignment_utils¶
- src.alignment.alignment_utils.__get_surrounding_amino_acids(parent_sequence: str, sequence: str, count: int) → list¶
Get the amino acids that surround a sequence. Return the (left, right) count number of amino acids
- Parameters
parent_sequence (str) – protein sequence to pull from
sequence (str) – subsequence we are looking for
count (int) – the number of surrounding amino acids to get per side
- Returns
tuples of (left amino acids, right amino acids) from all occurances in the parent sequence
- Return type
list
- src.alignment.alignment_utils.__add_amino_acids(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1.0) → list¶
Try and add amino acids to get the closest calculated precursor mass to the observed precursor mass
- Parameters
spectrum (Spectrum) – observed spectrum
sequence (str) – the alignment to add amino acids to
db (Database) – holds to source proteins
gap (int) – number of allowed amino acids try and add to each side (default is 3)
tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)
- Returns
all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass
- Return type
list
- src.alignment.alignment_utils.__remove_amino_acids(spectrum: src.objects.Spectrum, sequence: str, gap: int = 3, tolerance: float = 1) → list¶
Remove up to gap number of amino acids to try and match precursor mass
- Parameters
spectrum (Spectrum) – observed spectrum
sequence (str) – the alignment to add amino acids to
gap (int) – number of allowed amino acids try and add to each side (default is 3)
tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)
- Returns
all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass
- Return type
list
- src.alignment.alignment_utils.align_overlaps(seq1: str, seq2: str) → str¶
Attempt to align two string sequences. It will look at the right side of seq1 and left side of seq2 to overlap the two strings. If no overlap is found, seq2 is appended to seq1
- Parameters
seq1 (str) – the left sequence
seq2 (str) – the right sequence
- Returns
- Return type
str
- Example
>>> align_overlaps('ABCD', 'CDEF') >>> 'ABCDEF'
- Example
>>> align_overlaps('ABCD', 'EFGH') >>> 'ABCD-EFGH'
- src.alignment.alignment_utils.match_precursor(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1) → list¶
Try and fill in the gaps of an alignment. This is primarily focused on filling in the gaps left by the difference in precursor mass. If we find that the difference is more than GAP amino acids, then an empty list is returned
- Parameters
spectrum (Spectrum) – observed spectrum
sequence (str) – the attempted alignment
db (Database) – source of proteins
gap (int) – the maximum number of allowed number of amino acids to add/subtract to/from the original sequence. (default is 3)
tolerance (float) – the error (in Daltons) allowed when trying to match a calculated precursor to the observed precursor mass. (default is 1)
- Returns
sequences with a precursor within the tolerance of the observed precursor
- Return type
list
- src.alignment.alignment_utils.get_parents(seq: str, db: src.objects.Database, ion: Optional[str] = None) -> (<class 'list'>, <class 'list'>)¶
Get the parents of a sequence. If the sequence is a hybrid sequence, then the second entry of the tuple holds a list of proteins for the right contributor, otherwise the right entry is empty.
- Parameters
seq (str) – the sequence to look for the parents of
db (Database) – the source proteins
ion (str) – if left as None, look for the full string. Otherwise try to recursivley look by taking off sides of the kmer depending on the ion type. (default is None)
- Returns
if hybrid, (left source proteins, right source proteins) else, (source proteins, None)
- Return type
(list, list)
- Example
>>> # non hybrid peptide >>> get_parents('ABCDE', db, None) >>> ([protein1, protein2], None)
- Example
>>> # hybrid peptide >>> get_parents('ABC(DE)FGH', db, None) >>> ([protein1, protein2], [protein3])
- src.alignment.alignment_utils.extend_non_hybrid(seq: str, spectrum: src.objects.Spectrum, ion: str, db: src.objects.Database) → list¶
Extend a non hybrid sequence to try and match the predicted length. b ion kmers will be extended to the right, and y ion kmers to the left
- Parameters
seq (str) – sequence to be extended
spectrum (Spectrum) – observed spectrum
ion (str) – ion type. Either ‘b’ or ‘y’
db (Database) – source of proteins
- Returns
all possible extensions of the initial sequence
- Return type
list
alignment¶
- src.alignment.alignment.same_protein_alignment(seq1: str, seq2: str, parent_sequence: str) -> (<class 'str'>, <class 'str'>)¶
Attempt to create a non-hybrid alignment from two sequences from the same protein. If the two sequences do not directly overlap but are close enough and from the same protein, make the alignment. If not, create a hybrid alignment from the two input halves. If one compeletely overlaps the other, use the larger sequence as the alignment.
- Parameters
seq1 (str) – left sequence
seq2 (str) – right sequence
parent_sequence (str) – parent sequence of seq1 and seq2
- Returns
if hybrid sequence (sequence without special charcters, sequence with hybrid sequence) else (sequence, None)
- Return type
(str, str or None)
- Example
>>> same_protein_alignment('ABC', 'CDE', 'ABCDEFG') >>> ('ABCDE', None)
- Example
>>> same_protein_alignment('ABC', 'FGH', 'ABCDEFHI') >>> ('ABCFGH', 'ABC-FGH')
- src.alignment.alignment.extend_base_kmers(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database) → list¶
Extend all the base b and y ion matched k-mers to the predicted length to try and find a non-hybrid alignment
- Parameters
b_kmers (list) – kmers from b ion masses
y_kmers (list) – kmers from y ion masses
spectrum (Spectrum) – observed spectrum
db (Database) – source proteins
- Results
extended ion kmers (strings)
- Return type
list
- src.alignment.alignment.refine_alignments(spectrum: src.objects.Spectrum, db: src.objects.Database, alignments: list, precursor_tolerance: int = 10, DEV: bool = False, truth: Optional[dict] = None, fall_off: Optional[dict] = None) → list¶
Regine the rough alignmnets made. This includes precursor matching and ambiguous hybrid removals/replacements
- Parameters
spectrum (Spectrum) – observed spectrum in question
db (Database) – Holds all the source sequences
alignments (list) – tuples of (‘nonhybrid_sequence’, None or ‘hybrid_sequence’) alignments
precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)
DEV (bool) – set to True if truth is a valid dictionary and fall off detection is desired (default is False)
truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)
fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)
- Returns
tuples of refined alignments (‘nonhybrid_sequence’, None or ‘hybrid_sequence’)
- Return type
list
- src.alignment.alignment.align_b_y(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database) → list¶
Try and connect all b and y k-mers and try and make either hybrid or non hybrid string alignments from them.
- Parameters
b_kmers (list) – kmers from b ion masses
y_kmers (list) – kmers from y ion masses
spectrum (Spectrum) – observed spectrum
db (Database) – source proteins
- Results
tuples of alignments. If hybrid, (sequence, sequence with special hybrid characters), otherwise (sequence, None)
- Return type
list
- src.alignment.alignment.attempt_alignment(spectrum: src.objects.Spectrum, db: src.objects.Database, b_hits: list, y_hits: list, n: int = 3, ppm_tolerance: int = 20, precursor_tolerance: int = 10, digest_type: str = '', DEBUG: bool = False, is_last: bool = False, truth: Optional[bool] = None, fall_off: Optional[bool] = None) → src.objects.Alignments¶
Create an alignment for the input spectrum given an initial set of b and y ion based kmers
- Parameters
spectrum (Spectrum) – observed spectrum in question
db (Database) – Holds all the source sequences
b_hits (list) – all k-mers found from the b-ion search
y_hits (list) – all k-mers found from the y-ion search
ppm_tolerance (int) – the parts per million error allowed when trying to match masses. (default is 20)
precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)
n (int) – the number of alignments to save. (default is 3)
digest_type (str) – the digest performed on the sample (default is ‘’)
truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)
fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)
is_last (bool) – Only works if DEV is set to true in params. If set to true, timing evaluations are done. (default is False)
- Returns
attempted alignments
- Return type
Alignments
hybrid_alignment¶
- src.alignment.hybrid_alignment.__replace_ambiguous_hybrid(hybrid: tuple, db: src.objects.Database, observed: src.objects.Spectrum) -> (<class 'str'>, <class 'str'>)¶
Attempt to replace a hybrid with a sequence from the database.
- Parameters
hybrid (tuple) – tuple of (nonhybrid sequence, hybrid sequence)
db (Database) – source proteins
observed (Spectrum) – observed spectrum
- Returns
input or (nonhybrid, None)
- Return type
tuple
- src.alignment.hybrid_alignment.replace_ambiguous_hybrids(hybrid_alignments: list, db: src.objects.Database, observed: src.objects.Spectrum) → list¶
Remove any ambiguous hybrid alignments that can be explained by non hybrid sequences. The returned list has the sequences or their replacements in the same order that they were in on entry.
Amino acids L and I are swapped and tried in the search due to the ambiguity in their mass
- Parameters
hybrid_alignments (list) – tuples of attemted hybrid alignments of (non hybrid sequence, hybrid sequence)
db (Database) – source proteins
observed (Spectrum) – observed spectrum
- Returns
If no replacements are found, the input. Otherwise a tuple of (non hybrid, None) is inserted in its position
- Return type
list
- src.alignment.hybrid_alignment.hybrid_alignment(seq1: str, seq2: str) -> (<class 'str'>, <class 'str'>)¶
Create a hybrid alignment from 2 sequences. If an overlap between these two sequences is found, () are placed around the ambiguous section. If there is no overlap, then seq2 is appended to seq1 with a - at the junction
- Parameters
seq1 (str) – left sequence
seq2 (str) – right sequence
- Returns
hybrid sequence without special characters, hybrid with special characters
- Return type
tuple
- Example
>>> hybrid_alignment('ABCDE', 'DEFGH') >>> ('ABCDEFGH', 'ABC(DE)FGH')
- Example
>>> hybrid_alignment('ABCD', 'EFGH') >>> ('ABCDEFGH', 'ABCD-EFGH')