RAPDORData#

class RAPDOR.datastructures.RAPDORData(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#

The RAPDORData Class storing results and containing functions for analysis

df#

the dataframe that stores intensities and additional columns per protein.

Type:: pd.Dataframe

logbase#

the logbase if intensities in df are log transformed. Else None.

Type:: int

design#

dataframe containing information about the intensity columns in df

Type:: pd.Dataframe

array#: (np.ndarray): The non-normalized intensities from the df intensity columns.

min_replicates#

minimum number of replicates required to calculate scores

Type:: int

internal_design_matrix#

dataframe where fraction columns are stored as a list instead of seperate columns

Type:: pd.Dataframe

norm_array#

An array containing normalized values that add up to 1.

Type:: Union[None, np.ndarray]

distances#

An array of size num_proteins x num_samples x num_samples that stores the distance between samples. If no distance was calculated it is None.

Type:: Union[None, np.ndarray]

permutation_sufficient_samples#

Set to true if there are at least 5 samples per condition. Else False.

Type:: bool

score_columns#

list of strings that are used as column names for scores that can be calculated via this object.

Type:: List[str]

control#

Name of the level of treatment that should be used as control.

Type:: str

methods#

List of supported distance functions

Type:: List[str]

Examples

An instance of the RAPDORData class is obtained via the following code. Make sure your csv files are correctly fomatted as desribed in the Data Prepatation Tutorial.

>>> df = pd.read_csv("../testData/testFile.tsv", sep="\t")
>>> design = pd.read_csv("../testData/testDesign.tsv", sep="\t")
>>> rapdor = RAPDORData(df, design, logbase=2)

__init__(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#

Methods

`__init__`(df, design[, logbase, ...])
`calc_all_anosim_value`()	Calculates ANOSIM R for each protein and stores it in `df`
`calc_all_permanova_f`()	Calculates PERMANOVA F for each protein and stores it in `df`
`calc_all_scores`()	Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.
`calc_anosim_p_value`(permutations, threads[, ...])	Calculates ANOSIM p-value via shuffling and stores it in `df`.
`calc_distance_stats`()	Calculates the mean distance and variance of this distance inside the same treatment group.
`calc_distances`([method])	Calculates between sample distances.
`calc_distribution_features`()	Calculates features used in a bubble plot
`calc_mean_distance`()	Calculates the distance between the means of the two treatment groups.
`calc_permanova_p_value`(permutations, threads)	Calculates PERMANOVA p-value via shuffling and stores it in `df`.
`determine_peaks`([beta])	Determines the Mean Distance, Peak Positions and shift direction.
`determine_strongest_shift`()	Determines the position of the strongest shift
`export_csv`(file[, sep])	Exports the `extra_df` to a file.
`from_file`(json_file)	Creates a class instance from a JSON file.
`from_files`(intensities, design[, logbase, sep])	Constructor to generate instance from files instead of pandas dataframes.
`from_json`(json_string)	Creates class instance from JSON string.
`normalize_and_get_distances`(method[, ...])	Normalizes the array and calculates sample distances.
`normalize_array_with_kernel`([kernel_size, eps])	Normalizes the array and sets norm_array attribute.
`pca`()	Performs PCA on the normalized array.
`rank_table`(values, ascending)	Ranks the `df`
`remove_clusters`()
`run_preprocessing`([method, kernel, impute, ...])	Normalizes the array, imputes missing values if needed and calculates sample distances.
`to_json`(file)	Exports the object to JSON
`to_jsons`()	encodes this object as a JSON string

Attributes

`extra_df`	Return a Dataframe Slice all columns from self.df that are not part of the intensity columns
`methods`
`raw_lfc`	Calculates the log2 fold change of the raw intensity means.
`score_columns`

calc_all_anosim_value()#: Calculates ANOSIM R for each protein and stores it in df

calc_all_permanova_f()#: Calculates PERMANOVA F for each protein and stores it in df

calc_all_scores()#: Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.

calc_anosim_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local', callback=None)#

Calculates ANOSIM p-value via shuffling and stores it in df. Adjusts for multiple testing.

Parameters:

permutations (int) – number of permutations used to calculate p-value. Set to -1 to use all possible distinct permutations
threads (int) – number of threads used for calculation
seed (int) – seed for random permutation
mode (str) – either local or global. Global uses distribution of R value of all proteins as background. Local uses protein specific distribution.
callback (Callable) – A callback function that receives the progress in the form of a percent string e.g. “50”. This can be used in combination with a progress bar.

Returns:

fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values

Return type:

p-values (np.ndarray)

calc_distance_stats()#: Calculates the mean distance and variance of this distance inside the same treatment group.

calc_distances(method: str = None)#

Calculates between sample distances.

Parameters:: method (str) – One of the values from methods. The method used for sample distance calculation.
Raises:: ValueError – If the method string is not supported or symmetric-kl-divergence is used without adding an epsilon to the protein intensities

calc_distribution_features()#

Calculates features used in a bubble plot

Sets the features in current_embedding that can be used to plot a bubble plot of the data

calc_mean_distance()#: Calculates the distance between the means of the two treatment groups.

calc_permanova_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local')#

Calculates PERMANOVA p-value via shuffling and stores it in df. Adjusts for multiple testing.

Parameters:

permutations (int) – number of permutations used to calculate p-value
threads (int) – number of threads used for calculation
seed (int) – seed for random permutation
mode (str) – either local or global. Global uses distribution of pseudo F value of all proteins as background. Local uses protein specific distribution.

Returns:

fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values

Return type:

p-values (np.ndarray)

determine_peaks(beta: float = 1000)#

Determines the Mean Distance, Peak Positions and shift direction.

The Peaks are determined the following way:

Calculate the mean of the norm_array per group (RNase & Control)
Calculate the mixture distribution of the mean distributions.
Calculate $D$ which is either:
- Relative position-wise entropy of both groups to the mixture distribution if distance method is KL-Divergence or Jensen-Shannon
- position-wise euclidean distance of both groups to the mixture distribution if distance method is Eucledian-Distance
Apply a soft-argmax to this using beta hyperparameter to find the relative position shift

determine_strongest_shift()#: Determines the position of the strongest shift

export_csv(file: str, sep: str = ',')#

Exports the extra_df to a file.

Parameters:

file (str) – Path to file where dataframe should be exported to.
sep (str) – seperator to use.

property extra_df#

Return a Dataframe Slice all columns from self.df that are not part of the intensity columns

Returns:: slice of df
Return type:: pd.DataFrame

classmethod from_file(json_file)#

Creates a class instance from a JSON file.

Returns:: the RAPDORData stored in the file.
Return type:: RAPDORData

classmethod from_files(intensities: str, design: str, logbase: int = None, sep: str = ',')#

Constructor to generate instance from files instead of pandas dataframes.

Parameters:

intensities (str) – Path to the intensities File
design (str) – Path to the design file
logbase (Union[None, int]) – Logbase if intensities in the intensity file are log transformed
sep (str) – seperator used in the intensities and design files. Must be the same for both.

Returns: RAPDORData

classmethod from_json(json_string)#

Creates class instance from JSON string.

Parameters:: json_string – string representation of the RAPDORData
Returns:: the RAPDORData stored in the string
Return type:: RAPDORData

normalize_and_get_distances(method: str, kernel: int = 0, eps: float = 0)#

Normalizes the array and calculates sample distances.

Parameters:

method (str) – One of the values from methods. The method used for sample distance calculation.
kernel (int) – Averaging kernel size. This kernel is applied to the fractions.
eps (float) – epsilon added to the intensities to overcome problems with zeros.

normalize_array_with_kernel(kernel_size: int = 0, eps: float = 0)#

Normalizes the array and sets norm_array attribute.

Parameters:

kernel_size (int) – Averaging kernel size. This kernel is applied to the fractions.
eps (float) – epsilon added to the intensities to overcome problems with zeros.

pca()#

Performs PCA on the normalized array.

Results are stored in df as PC1 and PC2. explained variance is stored in pca_var.

rank_table(values, ascending)#

Ranks the df

This can be useful if you don´t have a sufficient number of samples and thus can`t calculate a p-value. The ranking scheme can be set via the function parameters.

Parameters:

values (List[str]) – which columns to use for ranking
ascending (List[bool]) – a boolean list indicating whether the column at the same index in values should be sorted ascending.

property raw_lfc#

Calculates the log2 fold change of the raw intensity means.

Returns:: array containing the log2 fold change of the raw intensities
Return type:: np.ndarray

run_preprocessing(method: str = None, kernel: int = 0, impute: bool = False, impute_perc: float = 0.5, impute_nn: int = 10, impute_quantile: float = 0.95, eps: float = 0)#

Normalizes the array, imputes missing values if needed and calculates sample distances.

Parameters:

method (str) – One of the values from methods. The method used for sample distance calculation.
kernel (int) – Averaging kernel size. This kernel is applied to the fractions.
impute (bool) – Wheter to impute missing values using KNN
impute_perc (float) – Maximum percentage of missing values per fraction and condition used for imputation.
impute_nn (int) – N Neighbors to account in data imputation.
impute_quantile (int) – Does only impute values for the quantile with the lowest mean distance between replicates. Others are assumed to be too noisy and data imputation might cause problems.
eps (float) – epsilon added to the intensities to overcome problems with zeros.

to_json(file: str)#

Exports the object to JSON

Parameters:: file (str) – Path to the file where the JSON encoded object should be stored.

to_jsons()#

encodes this object as a JSON string

Returns:: JSON string representation of this object
Return type:: str