5. The ANOVA analysis in details ¶

Contents

The ANOVA analysis in details

5.1. About the settings ¶

When using the ANOVA instance or the ANOVAReport to create an HTML report (see HTML report), all tunable settings are accessible from an attribute called settings:

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
gdsc.settings

This settings attribute is an instance of gdsctools.settings.ANOVASettings, which is fully documented in the reference. For example in the HTML report section, we will change the default output directory where HTML pages are saved to a user-defined value.

It is also important to note that when calling ANOVAReport, the first argument is an ANOVA instance that already contains the settings. So, ANOVAReport will use that settings automatically (it may still be changed later). Consider this example:

>>> from gdsctools import ANOVA, ic50_test, ANOVAReport
>>> gdsc = ANOVA(ic50_test)
>>> gdsc.settings.FDR_threshold = 15
>>> results = gdsc.anova_all()

>>> ar = ANOVAReport(gdsc, results)
>>> ar.settings.FDR_threshold
15

We will see more settings here below but first let us come back on the ANOVA analysis.

5.2. Regression analysis ¶

By default, the regression uses an OLS method. 4 Factors may be taken into account depending on the content of the GenomicFeature data set.

Here below, we use the R syntax where C() stands for categorical data and Y is the variable to be explained. Depending on the data and the gdsctools.anova.settings one of the following formula will be used:

$Y \sim C(tissue) + C(media) + C(MSI) + Feature$

$Y \sim C(tissue) + C(MSI) + Feature$

$Y \sim C(MSI) + Feature$

$Y \sim Feature$

Here are some rules applied:

Feature factor is always included by definition and is in last position
MSI and Media are included by default if found in the genomic feature data set. Note, however than one can exclude these factors using the settings.
Tissue is included if there are more than 2 tissues. Again, one can change the settings.analysis_type to the name of the tissue (instead of PANCAN, the default value).

Note

The order of the different features in the equations may have an impact on the analysis.

Since analysis may be time-consuming, we have hard-coded the regression formula. Note, however, that in version 0.16, we have added the anova_one_drug_one_feature_custom() method, which can be use for any type of regression based on a user formula. This is slower than the hard-coded version mentionned above but is more flexible. One can for instance set the formula to specify the treatement to be used as a reference:

an.settings['regression_formula'] = "Y ~ C(tissue, Treatment(reference='breast'))"

The ANOVA analysis itself uses a type I error. The summary can be obtained for a specific combination of drug and feature as follows:

from gdsctools import *
an = ANOVA(ic50_test, gf_v17)
drugid = 1047
feature = an.feature_names[0]
odof = an._get_one_drug_one_feature_data(drugid, feature)
res = an.anova_one_drug_one_feature(drugid, feature)
an._get_anova_summary(an.data_lm, output="dataframe", odof=odof)

and should show the following summary:

              Df       Sum Sq    Mean Sq  F value       PR(>F)
tissue      26.0   352.345257  13.551741  9.26853  1.63864e-31
msi          1.0     5.309389   5.309389  3.63129    0.0570537
feature      1.0     3.186109   3.186109   2.1791     0.140282
Residuals  817.0  1194.554709   1.462123     None         None

An alternative (simpler but slower) way since version 0.16 is to use:

an.anova_one_drug_one_feature_custom(drugid, feature,
    formula='Y ~ C(tissue) + C(msi) + feature')

5.2.1. Type I error ¶

The ANOVA analysis is based on a Type I error, also called sequential sum of squares. Consider 2 effects A and B, it tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects. So, this type of sums of squares gives different results for unbalanced data depending on the sequence.

5.2.2. MSI factor ¶

MSI is always included by default. However, you may exclude it by setting its value to False:

settings.include_MSI_factor

If MSI_FACTOR column is not found in the Genomic Feature data set, the MSI factor will be excluded automatically and the parameter above set to False.

Warning

If you force the MSI factor to True whereas there is not enough data in the binary sets of the MSI factor, error will be raised.

5.2.3. MEDIA factor ¶

If included in the genomic feature data set, MEDIA are included by default. However, you may exclude it by setting its value to False:

settings.include_MEDIA_factor

If MEDIA_FACTOR column is not found in the Genomic Feature data set, the MEDIA factor will be set automatically to False.

5.2.4. Tissue factor ¶

Another factor used in the regression (tissue) will be automatically excluded if there is only one tissue (or none). If several tissues are available, you can still exclude it from the regression analysis by settings this parameter to anything different from the default value (PANCAN):

settings.analysis_type = PANCAN

5.3. Filtering ¶

When performing the analysis for a given drug and feature, the regression may not be performed if there is not enough statistics.

These parameters will influence the number of tests being performed (number of associations of drug vs feature in anova_all()):

- minimum_nonna_ic50
- MSI_feature_threshold
- feature_factor_threshold

The first parameter indicates the minimum number of valid IC50 required for a given drug to be analysed. The current default value is 6.

The second parameter indicates the minimum size of the positive and negative population when IC50 are filtered by MSI factor (defaults to 2).

The third parameter indicates the minimum size of the positive and negative population when IC50 are filtered by Feature factor (defaults to 3).

This table summarizes the effect of these parameters:

The left hand side table mimics the IC50 data. The first column should and last 3 rows are not to be included in an IC50 matrix (see Data Format and Readers) but are added here as annotations for the following discussions.

When the regression analysis is performed for a given drug and a given feature, 3 filters are applied. First, a minimum number of values is required (minimum_nonna_ic50 setting). Therefore, the drug is not analysed. The second check is performed with respect to the MSI values. A drug can be analysed only if (once NA have been discarded) the number of IC50s corresponding to positive and negative MSIs is greater or equal to MSI_feature_threshold. In our example, the drugs in column D_pMSI=0 and D_pMSI=1 are therefore discarded since they have zero and only one positive MSI, respectively.

Finally, similarly to the MSI check, a drug/feature association is analysed if the number of IC50s corresponding to positive and negative feature is or equal to feature_factor_threshold.

5.4. Multiple testing corrections ¶

By default, the multiple testing correction is based on Benjamini–Hochberg (BH) method but it can be set to other methods using

settings.pval_correction_method

5.5. volcano plots ¶

The volcano plots are one of the main results of the analysis and summarizes visually the significance of the different associations.

It is part of the AnovaResults class and is returned either by an ODAF or ADAF analysis:

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
res = gdsc.anova_all()
res.volcano()

(Source code, png, hires.png, pdf)

Here are some parameters used to tune the plots and selection of significant events:

pvalue_threshold is used to select significant hits. See ANOVAReport.
effect_threshold is used to select significant hits as well.
FDR_threshold is used in gdsctools.volcano.VolcanoANOVA (horizontal lines)
volcano_FDR_interpolation uses interpolation to plot the FDR lines in the volcano plot.
volcano_additional_FDR_lines : [0.01, 0.1, 10]

5.6. others ¶

See ANOVASettings for the full listing.

Note

Some settings will be set automatically when calling some functions. For instance, if you call anova.ANOVA.set_cancer_type() to a single tissue, then the analysis_type will be set to the tissue’s name. If there are not enough positive or negative MSI, the MSI factor will ignored.