preprocessing

The preprocessing module covers spectra loading and filtering, reducing the search space over the database by tagging obseved masses with the k-mers with either a b or y ion mass that matches. The merge search is a linear time search for doing this matching process. As of now, this is not incredibly well optimized, and the limitation seems to be approximately 300 proteins to fit in 32 GB RAM comfortably. Future endeavors on preprocessing should try to reduce the time and space needed to tag these k-mers, possibly forgoing this approach altogether.

spectra_filtering

src.preprocessing.spectra_filtering.relative_abundance_filtering(masses: list, abundances: list, percentage: float) -> (<class 'list'>, <class 'list'>)

Take all peaks from the spectrum who’s abundance is at least percentage of the total abundances. It is assumed that the masses and abundances lists share ordering

Parameters
  • masses (list) – m/z values

  • abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i

  • percentage (float) – the minimum percentage of the total abundance a peak must have to pass the filter. Values are in the range [0, 1). A relatively realistic value is .005 (.5%)

Returns

filtered masses, filtered abundaces

Return type

(list, list)


src.preprocessing.spectra_filtering.peak_filtering(masses: list, abundances: list, num_peaks: int) -> (<class 'list'>, <class 'list'>)

Take the most abundant peaks and return the sorted masses with the abundances. It is assumed that the masses and abundances lists share ordering

Parameters
  • masses (list) – m/z values

  • abundances (list) – abundance value for the m/z values. Abundance at entry i corresponds to m/z valuat entry i

  • num_peaks – the top X most abundant peaks

Returns

filtered masses, filtered abundaces

Return type

(list, list)

preprocessing_utils

src.preprocessing.preprocessing_utils.load_spectra(spectra_files: list, ppm_tol: int, peak_filter: int = 0, relative_abundance_filter: float = 0.0) -> (<class 'list'>, <class 'list'>, <class 'dict'>)

Load all the spectra files into memory and merge all spectra into one massive list for reduction of the search space

Parameters
  • spectra_files (list) – full string paths the the spectra files

  • ppm_tol (int) – parts per million mass error allowed for making boundaries

  • peak_filter (int) – the top X most abundant spectra to keep. If left as 0, relative_abundance_filter is used instead. (default is 0)

  • relative_abundance_filter (float) – the percentage of the total abundance a peak must make up in order to pass the filter. Value should be between [0, 1). A realistic value is .005 (.5%). If peak_filter is non-zero, that value is used instead. (default is 0.0)

Returns

Spectra objects from file, overlapped boundaries of [lower_bound, upper_bound], mapping from a m/z value to the index of the boundaries that the m/z fits in

Return type

(list, list, dict)