Multivariate meta-analysis of proteomics data from human prostate and colon tumours
Stockholm Bioinformatics Center Seminars
Wednesday 15 April 2009
to 17:00 at
Lina Hultin-Rosenberg (KI)
There is a vast need to develop better methods for diagnostics and prognostics in cancer therapy. The methods used today are dependent on experienced cytologists and pathologists and are very time consuming. Hence, there is a need to find clinically applicable protein biomarkers as support for diagnosis and tumour classification. The use of multivariate methods such as PLS, where the expression of several genes or proteins are studied simultaneously has earlier shown to be powerful in biomarker discovery.
The main aim of this study is to perform multivariate meta data analysis of 2D gel electrophoresis data originating from several different studies on different cancer types. By incorporating data from various tumour types numerous clinical questions can be addressed. For example potential biomarkers specific for a certain tumour type can be identified as well as those biomarkers that are general for all malign tumour types. Results from meta-analysis on prostate cancer (n=39) and colon cancer (n=43) epithelial tumour tissue profiling are presented in this study. The datasets are matched to each other using the PDQuest software and an expression database containing the intensities for the spots in all samples was established. The further data analysis work was performed in R.
Two different ways of treating missing values were run in parallel through the analysis. Spots with a large fraction of missing data were excluded prior to analysis and the remaining missing data points were exchanged for either the mean value of the spot or the 10% lowest value for the spot. PLS-DA (Partial Least Squares Discriminant Analysis) was utilized to build predictive models and to select the most important variables for distinguishing between the classes normal and tumour. The spots were ranked by the PLS dependent VIP (Variable Importance on Projection) score and the most important variables were selected for prediction. This was repeated with decreasing number of variables and the prediction success measures were evaluated. The modelling procedure was performed in two levels of validation to ascertain a stable variable selection and model optimization, and to measure the optimized model performance. The most stable variables from a bootstrap validation were selected for the final prediction of a test set.
Despite such different tissues in the datasets, there were around 50 variables selected in at least 50% of the bootstrap rounds. This reveals some stability in the dataset and a strong signal for those variables. When applied in a PLS model to predict a held-out test set the variables yielded a rather promising prediction success (geometric mean of sensitivity and specificity was 86%). Further analysis will aim at identifying the selected proteins and validate their use as biomarkers in cancer diagnostics and therapy.