14. Reproducibility

In this section we show how to reproduce some specific results such as those published in the GDSC1000 paper (see Notes to reproduce v17a results (GDSC1000 Cell paper) section).

The motivation for this section is to show that results can be reproduced using an official release and the proper set of parameters used in the analysis.

14.1. Notes to reproduce v17a results (GDSC1000 Cell paper)

Reference:Iorio et al 2016

We refer to the data used in the reference above as the v17a data sets. It can be found on the CancerRxGene page. We will use this data to reproduce the results published in the same web page.

First, we need to download the data. For example, let us retrieve the BRCA tissue specific data set using wget command:

wget http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/anova/BRCA/DATA/INPUT/ANOVA_input.txt

Warning

in v17a data, the input file is named ANOVA_input.txt It is a tabulated format and contains all IC50s and genomic features altogether. This is not recommended and we now expect the IC50s and genomic_features to be in two different files. However, for back compatibility, GDSCTools is able to extract the IC50 alone, or the genomic features alone from that kind of format. See hereafter and Data Format and Readers section for details.

Once downloaded, create an ANOVA instance as follows:

from gdsctools import ANOVA
an = ANOVA("ANOVA_input.txt", "ANOVA_input.txt")

Here, we provide the filename twice; this is not a mistake, please see the warning box above. Internally, the first argument extracts the IC50s and the second one extracts the genomic features.

In v17a, there are two parameters that are not the current default values that must be set to reproduce the results:

an.setttings.pvalue_correction_method = "qvalue"
an.settings.equal_var_ttest = False

The later parameter is used in some plots for annotation but is not essential. The first parameter is important since it sets the method used for multiple testing correction. It will also define the number of associations that are significant. The FDR threshold that defines the significant associations was set to 25.

In the case of a tissue specific analysis (here BRCA), the name of the tissue is unknown (not specified anywhere inside the file) and so one should provide the information. This is not important for the analysis itself but is used for instance to name the output of the directory where HTML reports are stored:

report = ANOVAReport(an)
report.settings.directory = "BLCA"
report.create_html_pages()

Note that the multiple corrected values reported by GDSCTools and found on the website are different by a systematic bais of 2-3 %. This is known and due to a different implementation of the qvalue method (smoothing function). However, the number of tests and the ANOVA_FEATURE_pval column (pvalues of the FEATURE factor) should agree perfectly. Finally, note that because the value of the FDR (corrected values) differ, the number of significant associations below that threshold may also slightly differ. However, results are consitent for FDR not close to the threshold.

As for the PANCAN case, results are currently different between GDSCTools and what is posted within the link at the top of the page because there is currently a mismatch between the ANOVA_input file provided and the results provided (one has the MEDIA factor while there other has not)

For information, on an intel i7 core, the analysis of the PANCAN data set (265 drugs and about 1000 features) takes about 20 minutes to finish. Tissue specific data files takes a few minutes or less in general.

14.2. Notes about v18 onwards

In V18 onwards (May 2016), the IC50 may have duplicated columns for a given drug. So some drugs are clustered together. The algorithm was implemented in GDSCTools in IC50Cluster class. It should be used as follow.

IC50 must be read with the IC50Cluster class as follows:

from gdsctools import *
ic50 = IC50Cluster("v18_data")

This will ensure also that the drug identifiers are unique. Indeed, in v18 data sets, columns for a given DRUG ID may be duplicated (for different drug concentration).

Then, as usual:

an = ANOVA(ic50, "GF.csv", "DRUG_DECODE.csv")

All default parameters were used except for the FDR threshold:

report.settings.FDR_threshold = 35