Modelling-based joint embedding of histology and genomics using canonical correlation analysis for breast cancer survival prediction
Abstract
Traditional approaches to predicting breast cancer patients’ survival outcomes were based on clinical subgroups, the PAM50 genes, or the histological tissue's evaluation. With the growth of multi-modality datasets capturing diverse information (such as genomics, histology, radiology and clinical data) about the same cancer, information can be integrated using advanced tools and have improved survival prediction. These methods implicitly exploit the key observation that different modalities originate from the same cancer source and jointly provide a complete picture of the cancer. In this work, we investigate the benefits of explicitly modelling multi-modality data as originating from the same cancer under a probabilistic framework. Specifically, we consider histology and genomics as two modalities originating from the same breast cancer under a probabilistic graphical model (PGM). We construct maximum likelihood estimates of the PGM parameters based on canonical correlation analysis (CCA) and then infer the underlying properties of the cancer patient, such as survival. Equivalently, we construct CCA-based joint embeddings of the two modalities and input them to a learnable predictor. Real-world properties of sparsity and graph-structures are captured in the penalized variants of CCA (pCCA) and are better suited for cancer applications. For generating richer multi-dimensional embeddings with pCCA, we introduce two novel embedding schemes that encourage orthogonality to generate more informative embeddings. The efficacy of our proposed prediction pipeline is first demonstrated via low prediction errors of the hidden variable and the generation of informative embeddings on simulated data. When applied to breast cancer histology and RNA-sequencing expression data from The Cancer Genome Atlas (TCGA), our model can provide survival predictions with average concordance-indices of up to 68.32% along with interpretability. We also illustrate how the pCCA embeddings can be used for survival analysis through Kaplan–Meier curves.