RAPDORData#
- class RAPDOR.datastructures.RAPDORData(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#
The RAPDORData Class storing results and containing functions for analysis
- df#
the dataframe that stores intensities and additional columns per protein.
- Type:
pd.Dataframe
- min_replicates#
minimum number of replicates required to calculate scores
- Type:
int
- internal_design_matrix#
dataframe where fraction columns are stored as a list instead of seperate columns
- Type:
pd.Dataframe
- norm_array#
An array containing normalized values that add up to 1.
- Type:
Union[None, np.ndarray]
- distances#
An array of size num_proteins x num_samples x num_samples that stores the distance between samples. If no distance was calculated it is None.
- Type:
Union[None, np.ndarray]
- permutation_sufficient_samples#
Set to true if there are at least 5 samples per condition. Else False.
- Type:
bool
- score_columns#
list of strings that are used as column names for scores that can be calculated via this object.
- Type:
List[str]
- control#
Name of the level of treatment that should be used as control.
- Type:
str
- methods#
List of supported distance functions
- Type:
List[str]
Examples
An instance of the RAPDORData class is obtained via the following code. Make sure your csv files are correctly fomatted as desribed in the Data Prepatation Tutorial.
>>> df = pd.read_csv("../testData/testFile.tsv", sep="\t") >>> design = pd.read_csv("../testData/testDesign.tsv", sep="\t") >>> rapdor = RAPDORData(df, design, logbase=2)
- __init__(df: pandas.DataFrame, design: pandas.DataFrame, logbase: int = None, min_replicates: int = 2, control: str = 'Control', measure_type: str = 'Protein', measure: str = 'Intensities')#
Methods
__init__
(df, design[, logbase, ...])Calculates ANOSIM R for each protein and stores it in
df
Calculates PERMANOVA F for each protein and stores it in
df
Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.
calc_anosim_p_value
(permutations, threads[, ...])Calculates ANOSIM p-value via shuffling and stores it in
df
.Calculates the mean distance and variance of this distance inside the same treatment group.
calc_distances
([method])Calculates between sample distances.
Calculates features used in a bubble plot
Calculates the distance between the means of the two treatment groups.
calc_permanova_p_value
(permutations, threads)Calculates PERMANOVA p-value via shuffling and stores it in
df
.determine_peaks
([beta])Determines the Mean Distance, Peak Positions and shift direction.
Determines the position of the strongest shift
export_csv
(file[, sep])Exports the
extra_df
to a file.from_file
(json_file)Creates a class instance from a JSON file.
from_files
(intensities, design[, logbase, sep])Constructor to generate instance from files instead of pandas dataframes.
from_json
(json_string)Creates class instance from JSON string.
normalize_and_get_distances
(method[, ...])Normalizes the array and calculates sample distances.
normalize_array_with_kernel
([kernel_size, eps])Normalizes the array and sets norm_array attribute.
pca
()Performs PCA on the normalized array.
rank_table
(values, ascending)Ranks the
df
remove_clusters
()run_preprocessing
([method, kernel, impute, ...])Normalizes the array, imputes missing values if needed and calculates sample distances.
to_json
(file)Exports the object to JSON
to_jsons
()encodes this object as a JSON string
Attributes
Return a Dataframe Slice all columns from self.df that are not part of the intensity columns
Calculates the log2 fold change of the raw intensity means.
- calc_all_scores()#
Calculates ANOSIM R, shift direction, peak positions and Mean Sample Distance.
- calc_anosim_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local', callback=None)#
Calculates ANOSIM p-value via shuffling and stores it in
df
. Adjusts for multiple testing.- Parameters:
permutations (int) – number of permutations used to calculate p-value. Set to -1 to use all possible distinct permutations
threads (int) – number of threads used for calculation
seed (int) – seed for random permutation
mode (str) – either local or global. Global uses distribution of R value of all proteins as background. Local uses protein specific distribution.
callback (Callable) – A callback function that receives the progress in the form of a percent string e.g. “50”. This can be used in combination with a progress bar.
- Returns:
fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values
- Return type:
p-values (np.ndarray)
- calc_distance_stats()#
Calculates the mean distance and variance of this distance inside the same treatment group.
- calc_distances(method: str = None)#
Calculates between sample distances.
- Parameters:
method (str) – One of the values from methods. The method used for sample distance calculation.
- Raises:
ValueError – If the method string is not supported or symmetric-kl-divergence is used without adding an epsilon to the protein intensities
- calc_distribution_features()#
Calculates features used in a bubble plot
Sets the features in
current_embedding
that can be used to plot a bubble plot of the data
- calc_mean_distance()#
Calculates the distance between the means of the two treatment groups.
- calc_permanova_p_value(permutations: int, threads: int, seed: int = 0, mode: str = 'local')#
Calculates PERMANOVA p-value via shuffling and stores it in
df
. Adjusts for multiple testing.- Parameters:
permutations (int) – number of permutations used to calculate p-value
threads (int) – number of threads used for calculation
seed (int) – seed for random permutation
mode (str) – either local or global. Global uses distribution of pseudo F value of all proteins as background. Local uses protein specific distribution.
- Returns:
fdr corrected p-values for each protein distribution (np.ndarray): distribution of R values used to calculate p-values
- Return type:
p-values (np.ndarray)
- determine_peaks(beta: float = 1000)#
Determines the Mean Distance, Peak Positions and shift direction.
The Peaks are determined the following way:
Calculate the mean of the
norm_array
per group (RNase & Control)Calculate the mixture distribution of the mean distributions.
- Calculate $D$ which is either:
Relative position-wise entropy of both groups to the mixture distribution if distance method is KL-Divergence or Jensen-Shannon
position-wise euclidean distance of both groups to the mixture distribution if distance method is Eucledian-Distance
Apply a soft-argmax to this using beta hyperparameter to find the relative position shift
- determine_strongest_shift()#
Determines the position of the strongest shift
- export_csv(file: str, sep: str = ',')#
Exports the
extra_df
to a file.- Parameters:
file (str) – Path to file where dataframe should be exported to.
sep (str) – seperator to use.
- property extra_df#
Return a Dataframe Slice all columns from self.df that are not part of the intensity columns
- Returns:
slice of
df
- Return type:
pd.DataFrame
- classmethod from_file(json_file)#
Creates a class instance from a JSON file.
- Returns:
the RAPDORData stored in the file.
- Return type:
- classmethod from_files(intensities: str, design: str, logbase: int = None, sep: str = ',')#
Constructor to generate instance from files instead of pandas dataframes.
- Parameters:
intensities (str) – Path to the intensities File
design (str) – Path to the design file
logbase (Union[None, int]) – Logbase if intensities in the intensity file are log transformed
sep (str) – seperator used in the intensities and design files. Must be the same for both.
Returns: RAPDORData
- classmethod from_json(json_string)#
Creates class instance from JSON string.
- Parameters:
json_string – string representation of the RAPDORData
- Returns:
the RAPDORData stored in the string
- Return type:
- normalize_and_get_distances(method: str, kernel: int = 0, eps: float = 0)#
Normalizes the array and calculates sample distances.
- Parameters:
method (str) – One of the values from methods. The method used for sample distance calculation.
kernel (int) – Averaging kernel size. This kernel is applied to the fractions.
eps (float) – epsilon added to the intensities to overcome problems with zeros.
- normalize_array_with_kernel(kernel_size: int = 0, eps: float = 0)#
Normalizes the array and sets norm_array attribute.
- Parameters:
kernel_size (int) – Averaging kernel size. This kernel is applied to the fractions.
eps (float) – epsilon added to the intensities to overcome problems with zeros.
- pca()#
Performs PCA on the normalized array.
Results are stored in
df
as PC1 and PC2. explained variance is stored inpca_var
.
- rank_table(values, ascending)#
Ranks the
df
This can be useful if you don´t have a sufficient number of samples and thus can`t calculate a p-value. The ranking scheme can be set via the function parameters.
- Parameters:
values (List[str]) – which columns to use for ranking
ascending (List[bool]) – a boolean list indicating whether the column at the same index in values should be sorted ascending.
- property raw_lfc#
Calculates the log2 fold change of the raw intensity means.
- Returns:
array containing the log2 fold change of the raw intensities
- Return type:
np.ndarray
- run_preprocessing(method: str = None, kernel: int = 0, impute: bool = False, impute_perc: float = 0.5, impute_nn: int = 10, impute_quantile: float = 0.95, eps: float = 0)#
Normalizes the array, imputes missing values if needed and calculates sample distances.
- Parameters:
method (str) – One of the values from methods. The method used for sample distance calculation.
kernel (int) – Averaging kernel size. This kernel is applied to the fractions.
impute (bool) – Wheter to impute missing values using KNN
impute_perc (float) – Maximum percentage of missing values per fraction and condition used for imputation.
impute_nn (int) – N Neighbors to account in data imputation.
impute_quantile (int) – Does only impute values for the quantile with the lowest mean distance between replicates. Others are assumed to be too noisy and data imputation might cause problems.
eps (float) – epsilon added to the intensities to overcome problems with zeros.
- to_json(file: str)#
Exports the object to JSON
- Parameters:
file (str) – Path to the file where the JSON encoded object should be stored.
- to_jsons()#
encodes this object as a JSON string
- Returns:
JSON string representation of this object
- Return type:
str