RAPDORData#

class RAPDOR.datastructures.RAPDORData(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#

The RAPDORData Class storing results and containing functions for analysis

df#

the dataframe that stores intensities and additional columns per protein.

Type:

pd.Dataframe

logbase#

the logbase if intensities in df are log transformed. Else None.

Type:

int

design#

dataframe containing information about the intensity columns in df

Type:

pd.Dataframe

array#

(np.ndarray): The non-normalized intensities from the df intensity columns.

min_replicates#

minimum number of replicates required to calculate scores

Type:

int

internal_design_matrix#

dataframe where fraction columns are stored as a list instead of seperate columns

Type:

pd.Dataframe

norm_array#

An array containing normalized values that add up to 1.

Type:

Union[None, np.ndarray]

distances#

An array of size num_proteins x num_samples x num_samples that stores the distance between samples. If no distance was calculated it is None.

Type:

Union[None, np.ndarray]

permutation_sufficient_samples#

Set to true if there are at least 5 samples per condition. Else False.

Type:

bool

score_columns#

list of strings that are used as column names for scores that can be calculated via this object.

Type:

List[str]

control#

Name of the level of treatment that should be used as control.

Type:

str

methods#

List of supported distance functions

Type:

List[str]

Examples

An instance of the RAPDORData class is obtained via the following code. Make sure your csv files are correctly fomatted as desribed in the Data Prepatation Tutorial.

>>> df = pd.read_csv("../testData/testFile.tsv", sep="\t")
>>> design = pd.read_csv("../testData/testDesign.tsv", sep="\t")
>>> rapdor = RAPDORData(df, design, logbase=2)
__init__(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#

Methods

__init__(df, design[, logbase, ...])

calc_all_anosim_value()

Calculates ANOSIM R for each protein and stores it in df

calc_all_permanova_f()

Calculates PERMANOVA F for each protein and stores it in df

calc_all_scores()

Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.

calc_anosim_p_value(permutations, threads[, ...])

Calculates ANOSIM p-value via shuffling and stores it in df.

calc_distance_stats()

Calculates the mean distance and variance of this distance inside the same treatment group.

calc_distances([method])

Calculates between sample distances.

calc_distribution_features()

Calculates features used in a bubble plot

calc_mean_distance()

Calculates the distance between the means of the two treatment groups.

calc_permanova_p_value(permutations, threads)

Calculates PERMANOVA p-value via shuffling and stores it in df.

determine_peaks([beta])

Determines the Mean Distance, Peak Positions and shift direction.

determine_strongest_shift()

Determines the position of the strongest shift

export_csv(file[, sep])

Exports the extra_df to a file.

from_file(json_file)

Creates a class instance from a JSON file.

from_files(intensities, design[, logbase, sep])

Constructor to generate instance from files instead of pandas dataframes.

from_json(json_string)

Creates class instance from JSON string.

normalize_and_get_distances(method[, ...])

Normalizes the array and calculates sample distances.

normalize_array_with_kernel([kernel_size, eps])

Normalizes the array and sets norm_array attribute.

pca()

Performs PCA on the normalized array.

rank_table(values, ascending)

Ranks the df

remove_clusters()

run_preprocessing([method, kernel, impute, ...])

Normalizes the array, imputes missing values if needed and calculates sample distances.

to_json(file)

Exports the object to JSON

to_jsons()

encodes this object as a JSON string

Attributes

extra_df

Return a Dataframe Slice all columns from self.df that are not part of the intensity columns

methods

raw_lfc

Calculates the log2 fold change of the raw intensity means.

score_columns

calc_all_anosim_value()#

Calculates ANOSIM R for each protein and stores it in df

calc_all_permanova_f()#

Calculates PERMANOVA F for each protein and stores it in df

calc_all_scores()#

Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.

calc_anosim_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local', callback=None)#

Calculates ANOSIM p-value via shuffling and stores it in df. Adjusts for multiple testing.

Parameters:
  • permutations (int) – number of permutations used to calculate p-value. Set to -1 to use all possible distinct permutations

  • threads (int) – number of threads used for calculation

  • seed (int) – seed for random permutation

  • mode (str) – either local or global. Global uses distribution of R value of all proteins as background. Local uses protein specific distribution.

  • callback (Callable) – A callback function that receives the progress in the form of a percent string e.g. “50”. This can be used in combination with a progress bar.

Returns:

fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values

Return type:

p-values (np.ndarray)

calc_distance_stats()#

Calculates the mean distance and variance of this distance inside the same treatment group.

calc_distances(method: str = None)#

Calculates between sample distances.

Parameters:

method (str) – One of the values from methods. The method used for sample distance calculation.

Raises:

ValueError – If the method string is not supported or symmetric-kl-divergence is used without adding an epsilon to the protein intensities

calc_distribution_features()#

Calculates features used in a bubble plot

Sets the features in current_embedding that can be used to plot a bubble plot of the data

calc_mean_distance()#

Calculates the distance between the means of the two treatment groups.

calc_permanova_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local')#

Calculates PERMANOVA p-value via shuffling and stores it in df. Adjusts for multiple testing.

Parameters:
  • permutations (int) – number of permutations used to calculate p-value

  • threads (int) – number of threads used for calculation

  • seed (int) – seed for random permutation

  • mode (str) – either local or global. Global uses distribution of pseudo F value of all proteins as background. Local uses protein specific distribution.

Returns:

fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values

Return type:

p-values (np.ndarray)

determine_peaks(beta: float = 1000)#

Determines the Mean Distance, Peak Positions and shift direction.

The Peaks are determined the following way:

  1. Calculate the mean of the norm_array per group (RNase & Control)

  2. Calculate the mixture distribution of the mean distributions.

  3. Calculate $D$ which is either:
    • Relative position-wise entropy of both groups to the mixture distribution if distance method is KL-Divergence or Jensen-Shannon

    • position-wise euclidean distance of both groups to the mixture distribution if distance method is Eucledian-Distance

  4. Apply a soft-argmax to this using beta hyperparameter to find the relative position shift

determine_strongest_shift()#

Determines the position of the strongest shift

export_csv(file: str, sep: str = ',')#

Exports the extra_df to a file.

Parameters:
  • file (str) – Path to file where dataframe should be exported to.

  • sep (str) – seperator to use.

property extra_df#

Return a Dataframe Slice all columns from self.df that are not part of the intensity columns

Returns:

slice of df

Return type:

pd.DataFrame

classmethod from_file(json_file)#

Creates a class instance from a JSON file.

Returns:

the RAPDORData stored in the file.

Return type:

RAPDORData

classmethod from_files(intensities: str, design: str, logbase: int = None, sep: str = ',')#

Constructor to generate instance from files instead of pandas dataframes.

Parameters:
  • intensities (str) – Path to the intensities File

  • design (str) – Path to the design file

  • logbase (Union[None, int]) – Logbase if intensities in the intensity file are log transformed

  • sep (str) – seperator used in the intensities and design files. Must be the same for both.

Returns: RAPDORData

classmethod from_json(json_string)#

Creates class instance from JSON string.

Parameters:

json_string – string representation of the RAPDORData

Returns:

the RAPDORData stored in the string

Return type:

RAPDORData

normalize_and_get_distances(method: str, kernel: int = 0, eps: float = 0)#

Normalizes the array and calculates sample distances.

Parameters:
  • method (str) – One of the values from methods. The method used for sample distance calculation.

  • kernel (int) – Averaging kernel size. This kernel is applied to the fractions.

  • eps (float) – epsilon added to the intensities to overcome problems with zeros.

normalize_array_with_kernel(kernel_size: int = 0, eps: float = 0)#

Normalizes the array and sets norm_array attribute.

Parameters:
  • kernel_size (int) – Averaging kernel size. This kernel is applied to the fractions.

  • eps (float) – epsilon added to the intensities to overcome problems with zeros.

pca()#

Performs PCA on the normalized array.

Results are stored in df as PC1 and PC2. explained variance is stored in pca_var.

rank_table(values, ascending)#

Ranks the df

This can be useful if you don´t have a sufficient number of samples and thus can`t calculate a p-value. The ranking scheme can be set via the function parameters.

Parameters:
  • values (List[str]) – which columns to use for ranking

  • ascending (List[bool]) – a boolean list indicating whether the column at the same index in values should be sorted ascending.

property raw_lfc#

Calculates the log2 fold change of the raw intensity means.

Returns:

array containing the log2 fold change of the raw intensities

Return type:

np.ndarray

run_preprocessing(method: str = None, kernel: int = 0, impute: bool = False, impute_perc: float = 0.5, impute_nn: int = 10, impute_quantile: float = 0.95, eps: float = 0)#

Normalizes the array, imputes missing values if needed and calculates sample distances.

Parameters:
  • method (str) – One of the values from methods. The method used for sample distance calculation.

  • kernel (int) – Averaging kernel size. This kernel is applied to the fractions.

  • impute (bool) – Wheter to impute missing values using KNN

  • impute_perc (float) – Maximum percentage of missing values per fraction and condition used for imputation.

  • impute_nn (int) – N Neighbors to account in data imputation.

  • impute_quantile (int) – Does only impute values for the quantile with the lowest mean distance between replicates. Others are assumed to be too noisy and data imputation might cause problems.

  • eps (float) – epsilon added to the intensities to overcome problems with zeros.

to_json(file: str)#

Exports the object to JSON

Parameters:

file (str) – Path to the file where the JSON encoded object should be stored.

to_jsons()#

encodes this object as a JSON string

Returns:

JSON string representation of this object

Return type:

str