alignment

This is the heaviest lifter of the entire project. The alignment process happens here. The input is a spectrum, and the output is a set of alignments.

alignment_utils

src.alignment.alignment_utils.__get_surrounding_amino_acids(parent_sequence: str, sequence: str, count: int)list

Get the amino acids that surround a sequence. Return the (left, right) count number of amino acids

Parameters
  • parent_sequence (str) – protein sequence to pull from

  • sequence (str) – subsequence we are looking for

  • count (int) – the number of surrounding amino acids to get per side

Returns

tuples of (left amino acids, right amino acids) from all occurances in the parent sequence

Return type

list


src.alignment.alignment_utils.__add_amino_acids(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1.0)list

Try and add amino acids to get the closest calculated precursor mass to the observed precursor mass

Parameters
  • spectrum (Spectrum) – observed spectrum

  • sequence (str) – the alignment to add amino acids to

  • db (Database) – holds to source proteins

  • gap (int) – number of allowed amino acids try and add to each side (default is 3)

  • tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)

Returns

all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass

Return type

list


src.alignment.alignment_utils.__remove_amino_acids(spectrum: src.objects.Spectrum, sequence: str, gap: int = 3, tolerance: float = 1)list

Remove up to gap number of amino acids to try and match precursor mass

Parameters
  • spectrum (Spectrum) – observed spectrum

  • sequence (str) – the alignment to add amino acids to

  • gap (int) – number of allowed amino acids try and add to each side (default is 3)

  • tolerance (float) – the tolerance (in Daltons) to allow when comparing precursor masses. (default is 1)

Returns

all perturbed str sequences with a calculated precursor mass that is within the tolerance of the observed precursor mass

Return type

list


src.alignment.alignment_utils.align_overlaps(seq1: str, seq2: str)str

Attempt to align two string sequences. It will look at the right side of seq1 and left side of seq2 to overlap the two strings. If no overlap is found, seq2 is appended to seq1

Parameters
  • seq1 (str) – the left sequence

  • seq2 (str) – the right sequence

Returns

Return type

str

Example

>>> align_overlaps('ABCD', 'CDEF')
>>> 'ABCDEF'
Example

>>> align_overlaps('ABCD', 'EFGH')
>>> 'ABCD-EFGH'

src.alignment.alignment_utils.match_precursor(spectrum: src.objects.Spectrum, sequence: str, db: src.objects.Database, gap: int = 3, tolerance: float = 1)list

Try and fill in the gaps of an alignment. This is primarily focused on filling in the gaps left by the difference in precursor mass. If we find that the difference is more than GAP amino acids, then an empty list is returned

Parameters
  • spectrum (Spectrum) – observed spectrum

  • sequence (str) – the attempted alignment

  • db (Database) – source of proteins

  • gap (int) – the maximum number of allowed number of amino acids to add/subtract to/from the original sequence. (default is 3)

  • tolerance (float) – the error (in Daltons) allowed when trying to match a calculated precursor to the observed precursor mass. (default is 1)

Returns

sequences with a precursor within the tolerance of the observed precursor

Return type

list


src.alignment.alignment_utils.get_parents(seq: str, db: src.objects.Database, ion: Optional[str] = None) -> (<class 'list'>, <class 'list'>)

Get the parents of a sequence. If the sequence is a hybrid sequence, then the second entry of the tuple holds a list of proteins for the right contributor, otherwise the right entry is empty.

Parameters
  • seq (str) – the sequence to look for the parents of

  • db (Database) – the source proteins

  • ion (str) – if left as None, look for the full string. Otherwise try to recursivley look by taking off sides of the kmer depending on the ion type. (default is None)

Returns

if hybrid, (left source proteins, right source proteins) else, (source proteins, None)

Return type

(list, list)

Example

>>> # non hybrid peptide
>>> get_parents('ABCDE', db, None)
>>> ([protein1, protein2], None)
Example

>>> # hybrid peptide
>>> get_parents('ABC(DE)FGH', db, None)
>>> ([protein1, protein2], [protein3])

src.alignment.alignment_utils.extend_non_hybrid(seq: str, spectrum: src.objects.Spectrum, ion: str, db: src.objects.Database)list

Extend a non hybrid sequence to try and match the predicted length. b ion kmers will be extended to the right, and y ion kmers to the left

Parameters
  • seq (str) – sequence to be extended

  • spectrum (Spectrum) – observed spectrum

  • ion (str) – ion type. Either ‘b’ or ‘y’

  • db (Database) – source of proteins

Returns

all possible extensions of the initial sequence

Return type

list

alignment

src.alignment.alignment.same_protein_alignment(seq1: str, seq2: str, parent_sequence: str) -> (<class 'str'>, <class 'str'>)

Attempt to create a non-hybrid alignment from two sequences from the same protein. If the two sequences do not directly overlap but are close enough and from the same protein, make the alignment. If not, create a hybrid alignment from the two input halves. If one compeletely overlaps the other, use the larger sequence as the alignment.

Parameters
  • seq1 (str) – left sequence

  • seq2 (str) – right sequence

  • parent_sequence (str) – parent sequence of seq1 and seq2

Returns

if hybrid sequence (sequence without special charcters, sequence with hybrid sequence) else (sequence, None)

Return type

(str, str or None)

Example

>>> same_protein_alignment('ABC', 'CDE', 'ABCDEFG')
>>> ('ABCDE', None)
Example

>>> same_protein_alignment('ABC', 'FGH', 'ABCDEFHI')
>>> ('ABCFGH', 'ABC-FGH')

src.alignment.alignment.extend_base_kmers(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database)list

Extend all the base b and y ion matched k-mers to the predicted length to try and find a non-hybrid alignment

Parameters
  • b_kmers (list) – kmers from b ion masses

  • y_kmers (list) – kmers from y ion masses

  • spectrum (Spectrum) – observed spectrum

  • db (Database) – source proteins

Results

extended ion kmers (strings)

Return type

list


src.alignment.alignment.refine_alignments(spectrum: src.objects.Spectrum, db: src.objects.Database, alignments: list, precursor_tolerance: int = 10, DEV: bool = False, truth: Optional[dict] = None, fall_off: Optional[dict] = None)list

Regine the rough alignmnets made. This includes precursor matching and ambiguous hybrid removals/replacements

Parameters
  • spectrum (Spectrum) – observed spectrum in question

  • db (Database) – Holds all the source sequences

  • alignments (list) – tuples of (‘nonhybrid_sequence’, None or ‘hybrid_sequence’) alignments

  • precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)

  • DEV (bool) – set to True if truth is a valid dictionary and fall off detection is desired (default is False)

  • truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)

  • fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)

Returns

tuples of refined alignments (‘nonhybrid_sequence’, None or ‘hybrid_sequence’)

Return type

list


src.alignment.alignment.align_b_y(b_kmers: list, y_kmers: list, spectrum: src.objects.Spectrum, db: src.objects.Database)list

Try and connect all b and y k-mers and try and make either hybrid or non hybrid string alignments from them.

Parameters
  • b_kmers (list) – kmers from b ion masses

  • y_kmers (list) – kmers from y ion masses

  • spectrum (Spectrum) – observed spectrum

  • db (Database) – source proteins

Results

tuples of alignments. If hybrid, (sequence, sequence with special hybrid characters), otherwise (sequence, None)

Return type

list


src.alignment.alignment.attempt_alignment(spectrum: src.objects.Spectrum, db: src.objects.Database, b_hits: list, y_hits: list, n: int = 3, ppm_tolerance: int = 20, precursor_tolerance: int = 10, digest_type: str = '', DEBUG: bool = False, is_last: bool = False, truth: Optional[bool] = None, fall_off: Optional[bool] = None)src.objects.Alignments

Create an alignment for the input spectrum given an initial set of b and y ion based kmers

Parameters
  • spectrum (Spectrum) – observed spectrum in question

  • db (Database) – Holds all the source sequences

  • b_hits (list) – all k-mers found from the b-ion search

  • y_hits (list) – all k-mers found from the y-ion search

  • ppm_tolerance (int) – the parts per million error allowed when trying to match masses. (default is 20)

  • precursor_tolerance – the parts per million error allowed when trying to match precursor masses. (default is 10)

  • n (int) – the number of alignments to save. (default is 3)

  • digest_type (str) – the digest performed on the sample (default is ‘’)

  • truth (dict) – a set of id keyed spectra with the desired spectra. A better description of what this looks like can be seen in the param.py file. If left None, the program will continue normally (default is None)

  • fall_off (dict) – only works if the truth param is set to a dictionary. This is a dictionary (if using multiprocessing, needs to be process safe) where, if a sequence loses the desired sequence, a key value pair of spectrum id, DevFallOffEntry object are added to it. (default is None)

  • is_last (bool) – Only works if DEV is set to true in params. If set to true, timing evaluations are done. (default is False)

Returns

attempted alignments

Return type

Alignments

hybrid_alignment

src.alignment.hybrid_alignment.__replace_ambiguous_hybrid(hybrid: tuple, db: src.objects.Database, observed: src.objects.Spectrum) -> (<class 'str'>, <class 'str'>)

Attempt to replace a hybrid with a sequence from the database.

Parameters
  • hybrid (tuple) – tuple of (nonhybrid sequence, hybrid sequence)

  • db (Database) – source proteins

  • observed (Spectrum) – observed spectrum

Returns

input or (nonhybrid, None)

Return type

tuple


src.alignment.hybrid_alignment.replace_ambiguous_hybrids(hybrid_alignments: list, db: src.objects.Database, observed: src.objects.Spectrum)list

Remove any ambiguous hybrid alignments that can be explained by non hybrid sequences. The returned list has the sequences or their replacements in the same order that they were in on entry.

Amino acids L and I are swapped and tried in the search due to the ambiguity in their mass

Parameters
  • hybrid_alignments (list) – tuples of attemted hybrid alignments of (non hybrid sequence, hybrid sequence)

  • db (Database) – source proteins

  • observed (Spectrum) – observed spectrum

Returns

If no replacements are found, the input. Otherwise a tuple of (non hybrid, None) is inserted in its position

Return type

list


src.alignment.hybrid_alignment.hybrid_alignment(seq1: str, seq2: str) -> (<class 'str'>, <class 'str'>)

Create a hybrid alignment from 2 sequences. If an overlap between these two sequences is found, () are placed around the ambiguous section. If there is no overlap, then seq2 is appended to seq1 with a - at the junction

Parameters
  • seq1 (str) – left sequence

  • seq2 (str) – right sequence

Returns

hybrid sequence without special characters, hybrid with special characters

Return type

tuple

Example

>>> hybrid_alignment('ABCDE', 'DEFGH')
>>> ('ABCDEFGH', 'ABC(DE)FGH')
Example

>>> hybrid_alignment('ABCD', 'EFGH')
>>> ('ABCDEFGH', 'ABCD-EFGH')