preprocessing¶

The preprocessing module covers spectra loading and filtering, reducing the search space over the database by tagging obseved masses with the k-mers with either a b or y ion mass that matches. The merge search is a linear time search for doing this matching process. As of now, this is not incredibly well optimized, and the limitation seems to be approximately 300 proteins to fit in 32 GB RAM comfortably. Future endeavors on preprocessing should try to reduce the time and space needed to tag these k-mers, possibly forgoing this approach altogether.

spectra_filtering¶

src.preprocessing.spectra_filtering.relative_abundance_filtering(masses: list, abundances: list, percentage: float) -> (<class 'list'>, <class 'list'>)¶

Take all peaks from the spectrum who’s abundance is at least percentage of the total abundances. It is assumed that the masses and abundances lists share ordering

Parameters

masses (list) – m/z values
abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i
percentage (float) – the minimum percentage of the total abundance a peak must have to pass the filter. Values are in the range [0, 1). A relatively realistic value is .005 (.5%)

Returns

filtered masses, filtered abundaces

Return type

(list, list)

src.preprocessing.spectra_filtering.peak_filtering(masses: list, abundances: list, num_peaks: int) -> (<class 'list'>, <class 'list'>)¶

Take the most abundant peaks and return the sorted masses with the abundances. It is assumed that the masses and abundances lists share ordering

Parameters

masses (list) – m/z values
abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i
num_peaks – the top X most abundant peaks

Returns

filtered masses, filtered abundaces

Return type

(list, list)

merge_search¶

src.preprocessing.merge_search.merge(mz_s: Iterable, indices: Iterable, kmers: Iterable, boundaries: Iterable) → collections.defaultdict¶

Perform a linear search of observed mz values that fit into mappings to get a mapping from mz boundaries [lower_bound, upper_bound] to a set of k-mers

Parameters

mz_s (Iterable) – mz values to look through
indices (Iterable) – index mappings from mz values to kmers. The steps to get from an m/z value to a set of k-mers is as follows: if we have a m/z value at index i, we will get the values in range of indices*[*i-1] to indices*[*i], call it j in J. Then, the k-mers we want are all kmers at kmers*[*j] for each j in J.
kmers (Iterable) – the k-mers associated with the mz_s in the range of indices as described in the indices param
boundaries (Iterable) – lower upper bounds for a mass in a list like [lower_bound, upper_bound]

Returns

mapping from a string (using src.utils.hashable_boundaries) of <lower_bound>-<upper_bound> to a list of k-mers

Return type

defaultdict

src.preprocessing.merge_search.make_database_set(proteins: list, max_len: int) -> (<class 'array.array'>, <class 'array.array'>, <class 'list'>, <class 'array.array'>, <class 'array.array'>, <class 'list'>, <class 'dict'>)¶

Create parallel lists of (masses, index_maps, kmers) for the merge sort operation where index_maps map the massses to a range of positions in the kmers list

Parameters

proteins (list) – protein entries of shape (name, entry) where entry has a .sequence attribute
max_len (int) – max k-mer length

Returns

b ion masses created from the protein set, mapping from b ion masses to the kmers associated (same length as b ion masses), kmers associated with b ion masses, y ion masses created from the protein set, mapping from y ion masses to the kmers associated (same length as y ion masses), kmers associated with y ion masses, mapping of k-mers to source proteins

Return type

(array, array, list, array, array, list, dict)

src.preprocessing.merge_search.match_masses(spectra_boundaries: list, db: src.objects.Database, max_pep_len: int = 30) -> (<class 'dict'>, <class 'dict'>, <class 'src.objects.Database'>)¶

Take in a list of boundaries from observed spectra and return a b and y dictionary that maps boundaries -> kmers

Parameters

spectra_boundaries (list) – boundaries as lists as [lower_bound, upper_bound]
db (Database) – source proteins
max_pep_len (int) – maximum peptide length in k-mer prefetching

Returns

mapping of b ion masses to k-mers, mapping of y ion masses to k-mers, updated database

Return type

(dict, dict, Database)

preprocessing_utils¶

src.preprocessing.preprocessing_utils.load_spectra(spectra_files: list, ppm_tol: int, peak_filter: int = 0, relative_abundance_filter: float = 0.0) -> (<class 'list'>, <class 'list'>, <class 'dict'>)¶

Load all the spectra files into memory and merge all spectra into one massive list for reduction of the search space

Parameters

spectra_files (list) – full string paths the the spectra files
ppm_tol (int) – parts per million mass error allowed for making boundaries
peak_filter (int) – the top X most abundant spectra to keep. If left as 0, relative_abundance_filter is used instead. (default is 0)
relative_abundance_filter (float) – the percentage of the total abundance a peak must make up in order to pass the filter. Value should be between [0, 1). A realistic value is .005 (.5%). If peak_filter is non-zero, that value is used instead. (default is 0.0)

Returns

Spectra objects from file, overlapped boundaries of [lower_bound, upper_bound], mapping from a m/z value to the index of the boundaries that the m/z fits in

Return type

(list, list, dict)

preprocessing¶

spectra_filtering¶

merge_search¶

preprocessing_utils¶

hypedsearch

Navigation

Related Topics