alignment¶

This is the heaviest lifter of the entire project. The alignment process happens here. The input is a spectrum, and the output is a set of alignments.

alignment_utils¶

src.alignment.alignment_utils.__get_surrounding_amino_acids(parent_sequence: str, sequence: str, count: int) → list¶

Get the amino acids that surround a sequence. Return the (left, right) count number of amino acids

Parameters

parent_sequence (str) – protein sequence to pull from
sequence (str) – subsequence we are looking for
count (int) – the number of surrounding amino acids to get per side

Returns

tuples of (left amino acids, right amino acids) from all occurances in the parent sequence

Return type

list

src.alignment.alignment_utils.__add_amino_acids(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1.0) → list¶

Try and add amino acids to get the closest calculated precursor mass to the observed precursor mass

Parameters

spectrum (Spectrum) – observed spectrum
sequence (str) – the alignment to add amino acids to
db (Database) – holds to source proteins
gap (int) – number of allowed amino acids try and add to each side (default is 3)
tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)

Returns

all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass

Return type

list

src.alignment.alignment_utils.__remove_amino_acids(spectrum: src.objects.Spectrum, sequence: str, gap: int = 3, tolerance: float = 1) → list¶

Remove up to gap number of amino acids to try and match precursor mass

Parameters

spectrum (Spectrum) – observed spectrum
sequence (str) – the alignment to add amino acids to
gap (int) – number of allowed amino acids try and add to each side (default is 3)
tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)

Returns

all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass

Return type

list

src.alignment.alignment_utils.align_overlaps(seq1: str, seq2: str) → str¶

Attempt to align two string sequences. It will look at the right side of seq1 and left side of seq2 to overlap the two strings. If no overlap is found, seq2 is appended to seq1

Parameters

seq1 (str) – the left sequence
seq2 (str) – the right sequence

Returns

Return type

str

Example

>>> align_overlaps('ABCD', 'CDEF')
>>> 'ABCDEF'

Example

>>> align_overlaps('ABCD', 'EFGH')
>>> 'ABCD-EFGH'

src.alignment.alignment_utils.match_precursor(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1) → list¶

Try and fill in the gaps of an alignment. This is primarily focused on filling in the gaps left by the difference in precursor mass. If we find that the difference is more than GAP amino acids, then an empty list is returned

Parameters

spectrum (Spectrum) – observed spectrum
sequence (str) – the attempted alignment
db (Database) – source of proteins
gap (int) – the maximum number of allowed number of amino acids to add/subtract to/from the original sequence. (default is 3)
tolerance (float) – the error (in Daltons) allowed when trying to match a calculated precursor to the observed precursor mass. (default is 1)

Returns

sequences with a precursor within the tolerance of the observed precursor

Return type

list

src.alignment.alignment_utils.get_parents(seq: str, db: src.objects.Database, ion: Optional[str] = None) -> (<class 'list'>, <class 'list'>)¶

Get the parents of a sequence. If the sequence is a hybrid sequence, then the second entry of the tuple holds a list of proteins for the right contributor, otherwise the right entry is empty.

Parameters

seq (str) – the sequence to look for the parents of
db (Database) – the source proteins
ion (str) – if left as None, look for the full string. Otherwise try to recursivley look by taking off sides of the kmer depending on the ion type. (default is None)

Returns

if hybrid, (left source proteins, right source proteins) else, (source proteins, None)

Return type

(list, list)

Example

>>> # non hybrid peptide
>>> get_parents('ABCDE', db, None)
>>> ([protein1, protein2], None)

Example

>>> # hybrid peptide
>>> get_parents('ABC(DE)FGH', db, None)
>>> ([protein1, protein2], [protein3])

src.alignment.alignment_utils.extend_non_hybrid(seq: str, spectrum: src.objects.Spectrum, ion: str, db: src.objects.Database) → list¶

Extend a non hybrid sequence to try and match the predicted length. b ion kmers will be extended to the right, and y ion kmers to the left

Parameters

seq (str) – sequence to be extended
spectrum (Spectrum) – observed spectrum
ion (str) – ion type. Either ‘b’ or ‘y’
db (Database) – source of proteins

Returns

all possible extensions of the initial sequence

Return type

list

alignment¶

src.alignment.alignment.same_protein_alignment(seq1: str, seq2: str, parent_sequence: str) -> (<class 'str'>, <class 'str'>)¶

Attempt to create a non-hybrid alignment from two sequences from the same protein. If the two sequences do not directly overlap but are close enough and from the same protein, make the alignment. If not, create a hybrid alignment from the two input halves. If one compeletely overlaps the other, use the larger sequence as the alignment.

Parameters

seq1 (str) – left sequence
seq2 (str) – right sequence
parent_sequence (str) – parent sequence of seq1 and seq2

Returns

if hybrid sequence (sequence without special charcters, sequence with hybrid sequence) else (sequence, None)

Return type

(str, str or None)

Example

>>> same_protein_alignment('ABC', 'CDE', 'ABCDEFG')
>>> ('ABCDE', None)

Example

>>> same_protein_alignment('ABC', 'FGH', 'ABCDEFHI')
>>> ('ABCFGH', 'ABC-FGH')

src.alignment.alignment.extend_base_kmers(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database) → list¶

Extend all the base b and y ion matched k-mers to the predicted length to try and find a non-hybrid alignment

Parameters

b_kmers (list) – kmers from b ion masses
y_kmers (list) – kmers from y ion masses
spectrum (Spectrum) – observed spectrum
db (Database) – source proteins

Results

extended ion kmers (strings)

Return type

list

src.alignment.alignment.refine_alignments(spectrum: src.objects.Spectrum, db: src.objects.Database, alignments: list, precursor_tolerance: int = 10, DEV: bool = False, truth: Optional[dict] = None, fall_off: Optional[dict] = None) → list¶

Regine the rough alignmnets made. This includes precursor matching and ambiguous hybrid removals/replacements

Parameters

spectrum (Spectrum) – observed spectrum in question
db (Database) – Holds all the source sequences
alignments (list) – tuples of (‘nonhybrid_sequence’, None or ‘hybrid_sequence’) alignments
precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)
DEV (bool) – set to True if truth is a valid dictionary and fall off detection is desired (default is False)
truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)
fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)

Returns

tuples of refined alignments (‘nonhybrid_sequence’, None or ‘hybrid_sequence’)

Return type

list

src.alignment.alignment.align_b_y(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database) → list¶

Try and connect all b and y k-mers and try and make either hybrid or non hybrid string alignments from them.

Parameters

b_kmers (list) – kmers from b ion masses
y_kmers (list) – kmers from y ion masses
spectrum (Spectrum) – observed spectrum
db (Database) – source proteins

Results

tuples of alignments. If hybrid, (sequence, sequence with special hybrid characters), otherwise (sequence, None)

Return type

list

src.alignment.alignment.attempt_alignment(spectrum: src.objects.Spectrum, db: src.objects.Database, b_hits: list, y_hits: list, n: int = 3, ppm_tolerance: int = 20, precursor_tolerance: int = 10, digest_type: str = '', DEBUG: bool = False, is_last: bool = False, truth: Optional[bool] = None, fall_off: Optional[bool] = None) → src.objects.Alignments ¶

Create an alignment for the input spectrum given an initial set of b and y ion based kmers

Parameters

spectrum (Spectrum) – observed spectrum in question
db (Database) – Holds all the source sequences
b_hits (list) – all k-mers found from the b-ion search
y_hits (list) – all k-mers found from the y-ion search
ppm_tolerance (int) – the parts per million error allowed when trying to match masses. (default is 20)
precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)
n (int) – the number of alignments to save. (default is 3)
digest_type (str) – the digest performed on the sample (default is ‘’)
truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)
fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)
is_last (bool) – Only works if DEV is set to true in params. If set to true, timing evaluations are done. (default is False)

Returns

attempted alignments

Return type

Alignments

hybrid_alignment¶

src.alignment.hybrid_alignment.__replace_ambiguous_hybrid(hybrid: tuple, db: src.objects.Database, observed: src.objects.Spectrum) -> (<class 'str'>, <class 'str'>)¶

Attempt to replace a hybrid with a sequence from the database.

Parameters

hybrid (tuple) – tuple of (nonhybrid sequence, hybrid sequence)
db (Database) – source proteins
observed (Spectrum) – observed spectrum

Returns

input or (nonhybrid, None)

Return type

tuple

src.alignment.hybrid_alignment.replace_ambiguous_hybrids(hybrid_alignments: list, db: src.objects.Database, observed: src.objects.Spectrum) → list¶

Remove any ambiguous hybrid alignments that can be explained by non hybrid sequences. The returned list has the sequences or their replacements in the same order that they were in on entry.

Amino acids L and I are swapped and tried in the search due to the ambiguity in their mass

Parameters

hybrid_alignments (list) – tuples of attemted hybrid alignments of (non hybrid sequence, hybrid sequence)
db (Database) – source proteins
observed (Spectrum) – observed spectrum

Returns

If no replacements are found, the input. Otherwise a tuple of (non hybrid, None) is inserted in its position

Return type

list

src.alignment.hybrid_alignment.hybrid_alignment(seq1: str, seq2: str) -> (<class 'str'>, <class 'str'>)¶

Create a hybrid alignment from 2 sequences. If an overlap between these two sequences is found, () are placed around the ambiguous section. If there is no overlap, then seq2 is appended to seq1 with a - at the junction

Parameters

seq1 (str) – left sequence
seq2 (str) – right sequence

Returns

hybrid sequence without special characters, hybrid with special characters

Return type

tuple

Example

>>> hybrid_alignment('ABCDE', 'DEFGH')
>>> ('ABCDEFGH', 'ABC(DE)FGH')

Example

>>> hybrid_alignment('ABCD', 'EFGH')
>>> ('ABCDEFGH', 'ABCD-EFGH')

alignment¶

alignment_utils¶

alignment¶

hybrid_alignment¶

hypedsearch

Navigation

Related Topics