preprocessing¶
The preprocessing module covers spectra loading and filtering, reducing the search space over the database by tagging obseved masses with the k-mers with either a b or y ion mass that matches. The merge search is a linear time search for doing this matching process. As of now, this is not incredibly well optimized, and the limitation seems to be approximately 300 proteins to fit in 32 GB RAM comfortably. Future endeavors on preprocessing should try to reduce the time and space needed to tag these k-mers, possibly forgoing this approach altogether.
spectra_filtering¶
- src.preprocessing.spectra_filtering.relative_abundance_filtering(masses: list, abundances: list, percentage: float) -> (<class 'list'>, <class 'list'>)¶
Take all peaks from the spectrum who’s abundance is at least percentage of the total abundances. It is assumed that the masses and abundances lists share ordering
- Parameters
masses (list) – m/z values
abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i
percentage (float) – the minimum percentage of the total abundance a peak must have to pass the filter. Values are in the range [0, 1). A relatively realistic value is .005 (.5%)
- Returns
filtered masses, filtered abundaces
- Return type
(list, list)
- src.preprocessing.spectra_filtering.peak_filtering(masses: list, abundances: list, num_peaks: int) -> (<class 'list'>, <class 'list'>)¶
Take the most abundant peaks and return the sorted masses with the abundances. It is assumed that the masses and abundances lists share ordering
- Parameters
masses (list) – m/z values
abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i
num_peaks – the top X most abundant peaks
- Returns
filtered masses, filtered abundaces
- Return type
(list, list)
merge_search¶
- src.preprocessing.merge_search.merge(mz_s: Iterable, indices: Iterable, kmers: Iterable, boundaries: Iterable) → collections.defaultdict¶
Perform a linear search of observed mz values that fit into mappings to get a mapping from mz boundaries [lower_bound, upper_bound] to a set of k-mers
- Parameters
mz_s (Iterable) – mz values to look through
indices (Iterable) – index mappings from mz values to kmers. The steps to get from an m/z value to a set of k-mers is as follows: if we have a m/z value at index i, we will get the values in range of indices*[*i-1] to indices*[*i], call it j in J. Then, the k-mers we want are all kmers at kmers*[*j] for each j in J.
kmers (Iterable) – the k-mers associated with the mz_s in the range of indices as described in the indices param
boundaries (Iterable) – lower upper bounds for a mass in a list like [lower_bound, upper_bound]
- Returns
mapping from a string (using src.utils.hashable_boundaries) of <lower_bound>-<upper_bound> to a list of k-mers
- Return type
defaultdict
- src.preprocessing.merge_search.make_database_set(proteins: list, max_len: int) -> (<class 'array.array'>, <class 'array.array'>, <class 'list'>, <class 'array.array'>, <class 'array.array'>, <class 'list'>, <class 'dict'>)¶
Create parallel lists of (masses, index_maps, kmers) for the merge sort operation where index_maps map the massses to a range of positions in the kmers list
- Parameters
proteins (list) – protein entries of shape (name, entry) where entry has a .sequence attribute
max_len (int) – max k-mer length
- Returns
b ion masses created from the protein set, mapping from b ion masses to the kmers associated (same length as b ion masses), kmers associated with b ion masses, y ion masses created from the protein set, mapping from y ion masses to the kmers associated (same length as y ion masses), kmers associated with y ion masses, mapping of k-mers to source proteins
- Return type
(array, array, list, array, array, list, dict)
- src.preprocessing.merge_search.match_masses(spectra_boundaries: list, db: src.objects.Database, max_pep_len: int = 30) -> (<class 'dict'>, <class 'dict'>, <class 'src.objects.Database'>)¶
Take in a list of boundaries from observed spectra and return a b and y dictionary that maps boundaries -> kmers
- Parameters
spectra_boundaries (list) – boundaries as lists as [lower_bound, upper_bound]
db (Database) – source proteins
max_pep_len (int) – maximum peptide length in k-mer prefetching
- Returns
mapping of b ion masses to k-mers, mapping of y ion masses to k-mers, updated database
- Return type
(dict, dict, Database)
preprocessing_utils¶
- src.preprocessing.preprocessing_utils.load_spectra(spectra_files: list, ppm_tol: int, peak_filter: int = 0, relative_abundance_filter: float = 0.0) -> (<class 'list'>, <class 'list'>, <class 'dict'>)¶
Load all the spectra files into memory and merge all spectra into one massive list for reduction of the search space
- Parameters
spectra_files (list) – full string paths the the spectra files
ppm_tol (int) – parts per million mass error allowed for making boundaries
peak_filter (int) – the top X most abundant spectra to keep. If left as 0, relative_abundance_filter is used instead. (default is 0)
relative_abundance_filter (float) – the percentage of the total abundance a peak must make up in order to pass the filter. Value should be between [0, 1). A realistic value is .005 (.5%). If peak_filter is non-zero, that value is used instead. (default is 0.0)
- Returns
Spectra objects from file, overlapped boundaries of [lower_bound, upper_bound], mapping from a m/z value to the index of the boundaries that the m/z fits in
- Return type
(list, list, dict)