####################################################################################################

Electronic appendix to the BMC Bioinformatics article:

Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment

Roman Hornung*, Anne-Laure Boulesteix, David Causeur

* Department of Medical Informatics, Biometry and Epidemiology,
  University of Munich, Marchioninistr. 15, D-81377, Munich, Germany;
  for questions please contact:  hornung@ibe.med.uni-muenchen.de

####################################################################################################


Program and Platform:
#####################

- Program: R, version 3.1.1

- Used R packages:

     'bapred', version: 0.2
     'corpcor', version: 1.6.6
     'ggplot2', version: 1.0.1 
     'gridExtra', version: 2.0.0 
     'Hmisc', version: 3.14-4 
     'MASS', version: 7.3-33 
     'Matrix', version: 1.1-4 
     'mvtnorm', version: 1.0-0 
     'plyr', version: 1.8.1 
     'pscl', version: 1.4.9 
     'snowfall', version: 1.84-6

  Used Bioconductor packages:  

     'affy', version: 1.42.3 
     'ArrayExpress', version: 1.24.0 
     'CMA', version: 1.22.0 
     'GEOquery', version: 2.30.1
       
  NOTE: Packages listed above depend on others, which might have to be installed
        manually in case this is not performed automatically.

- Platform: Linux (x86-64)




General information:
####################

- The folder "FAbatchPaper" this README is contained in has to be put into the 
  home directory ("~/") of a Linux machine.
  We use paths of the form "./FAbatchPaper/...".
  Note: Evaluating the results (see below) can also be performed under Windows.
  The retrieval of the datasets and the reproduction of the analysis - which is
  performed in parallel - requires Linux.

- The results are stored in the folder "Results".

- The folder "Functions" contains R-scripts with functions used in the simulations
  ("SimulationSnowfallFunctions.R") and real-data analyses 
  ("RealdataanalysisSnowfallFunctions.R").

- The empty folder "InterimResults" is used in the analyses to store intermediate
  files which are not part of the final results.




Evaluation of the results:
##########################

- For the evaluation of the results it is not necessary to re-perform the analyses:
  In the folder "Results" we provide Rda-files containing the results in a raw form.
  In the folder "EvaluationOfResults" we provide R-scripts in which these Rda-files
  are loaded and in which the results presented in the paper and in the Supplementary
  Materials are obtained. These R-scripts are as follows:
  "MainSimulationScript_EVALUATION.R" provides the results of the main simulation study.
  In "RealdataanalysisScript_EVALUATION.R" the metric values after batch effect adjustment
  using the real datasets are obtained. Moreover, the PC plots of the datasets presented
  in the Supplementary Materials are generated. Note: the code for generating the PC plots
  requires the presence of the datasets, which however have to be downloaded first, see
  below (caption "Datasets:"). 
  "CrossBatchPredictionScript_EVALUATION.R" provides the cross-batch prediction results 
  for the real dataset presented in the section "Application in cross-batch prediction".
  "CrossBatchPredictionSimulation_EVALUATION.R" provides the cross-batch prediction results
  for the simulated datasets, again presented in the section "Application in cross-batch 
  prediction".
  "SimulationSignalExaggeration_EVALUATION.R" lastly provides the results of the simulation 
  illustrating the overoptimism through applying SVA discussed in the section 
  "Artificial increase of measured class signal by applying SVA".




Reproducing the results:
########################

Datasets:
#########

- Because of the size of the datasets used in the real-data analyses these are not 
  included in the Electronic Appendix. However, they can be obtained automically by 
  executing the corresponding scripts in the folder "Datasets/PreparationScripts". 
  Moreover, all datasets can be obtained by the script "GetAllData.R" found in the 
  same folder. Note however, that the execution of the latter is very time consuming.

- The datasets will be stored in the folder "Datasets/ProcessedData" in the form of
  Rda-files containing: "X", the covariate matrix, "y", the target variable and "batch",
  the batch variable.

- The folder "Datasets/DownloadedIntermediateData" will merely contain intermediate files
  downloaded when executing the data retrieval scripts. These are however deleted at 
  the end of each data retrieval script.

Simulations:
############

- The script "MainSimulationScript.R" performs the main simulation. This analysis, as
  well as all other analyses below, requires an MPI environment, where the number of CPUs 
  can be varied by changing "ncpus" to a different number in the code.
  The simulation settings are generated in the script "SetupSimulation.R". Here we also
  provide detailled information on how the parameter values were specified.
  In the specification of the simulation design we used a colon cancer dataset, which
  can obtained by the script "ColoncbTranscr_preparationinfos.R". This dataset will
  be stored in the folder "InterimResults".
  In "SetupSimulation.R" also the additional setting, in which the predictors are 
  uncorrelated, is specified. The latter is used in the cross-batch prediction 
  simulation.

- The script "CrossBatchPredictionSimulation.R" performs the cross-batch prediction
  simulation presented in the section "Application in cross-batch prediction".

- The script "SimulationSignalExaggeration.R" performs the simulation illustrating the 
  overoptimism through applying SVA discussed in section "Artificial increase 
  of measured class signal by applying SVA".

Real-data analyses:
###################

- The calculation of the metric values can be performed by the script 
  "RealdataanalysisScript.R".

- The cross-batch prediction study is implemented in "CrossBatchPredictionScript.R".

- The plot from the example in the introduction is obtained from "ExampleIntroduction.R".
  This R-script requires the presence of the "AlcoholismTranscr"-dataset.

- The validity checks as presented in the section "Verification of model assumptions 
  on the basis of real data" can be performed by the script "CheckModelFit.R".