Bayesian clinical classification from high-dimensional data

Signatures versus variability

Akram Shalabi, Masato Inoue, Johnathan Watkins, Emanuele De Rinaldis, Anthony C.C. Coolen

    Research output: Contribution to journalArticle

    2 Citations (Scopus)

    Abstract

    When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.

    Original languageEnglish
    Pages (from-to)336-351
    Number of pages16
    JournalStatistical Methods in Medical Research
    Volume27
    Issue number2
    DOIs
    Publication statusPublished - 2018 Feb 1

    Fingerprint

    Bayes Theorem
    High-dimensional Data
    Signature
    Overfitting
    Bayesian Prediction
    Hyperparameters
    Bayesian Methods
    Numerical integration
    Breakdown
    Genomics
    Covariates
    High-dimensional
    Prediction
    Class
    caN protocol

    Keywords

    • Bayesian classification
    • curse of dimensionality
    • Discriminant analysis
    • outcome prediction
    • overfitting

    ASJC Scopus subject areas

    • Epidemiology
    • Statistics and Probability
    • Health Information Management

    Cite this

    Bayesian clinical classification from high-dimensional data : Signatures versus variability. / Shalabi, Akram; Inoue, Masato; Watkins, Johnathan; De Rinaldis, Emanuele; Coolen, Anthony C.C.

    In: Statistical Methods in Medical Research, Vol. 27, No. 2, 01.02.2018, p. 336-351.

    Research output: Contribution to journalArticle

    Shalabi, Akram ; Inoue, Masato ; Watkins, Johnathan ; De Rinaldis, Emanuele ; Coolen, Anthony C.C. / Bayesian clinical classification from high-dimensional data : Signatures versus variability. In: Statistical Methods in Medical Research. 2018 ; Vol. 27, No. 2. pp. 336-351.
    @article{690ee495860e492ba410aaeb061fc568,
    title = "Bayesian clinical classification from high-dimensional data: Signatures versus variability",
    abstract = "When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.",
    keywords = "Bayesian classification, curse of dimensionality, Discriminant analysis, outcome prediction, overfitting",
    author = "Akram Shalabi and Masato Inoue and Johnathan Watkins and {De Rinaldis}, Emanuele and Coolen, {Anthony C.C.}",
    year = "2018",
    month = "2",
    day = "1",
    doi = "10.1177/0962280216628901",
    language = "English",
    volume = "27",
    pages = "336--351",
    journal = "Statistical Methods in Medical Research",
    issn = "0962-2802",
    publisher = "SAGE Publications Ltd",
    number = "2",

    }

    TY - JOUR

    T1 - Bayesian clinical classification from high-dimensional data

    T2 - Signatures versus variability

    AU - Shalabi, Akram

    AU - Inoue, Masato

    AU - Watkins, Johnathan

    AU - De Rinaldis, Emanuele

    AU - Coolen, Anthony C.C.

    PY - 2018/2/1

    Y1 - 2018/2/1

    N2 - When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.

    AB - When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.

    KW - Bayesian classification

    KW - curse of dimensionality

    KW - Discriminant analysis

    KW - outcome prediction

    KW - overfitting

    UR - http://www.scopus.com/inward/record.url?scp=85041947217&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85041947217&partnerID=8YFLogxK

    U2 - 10.1177/0962280216628901

    DO - 10.1177/0962280216628901

    M3 - Article

    VL - 27

    SP - 336

    EP - 351

    JO - Statistical Methods in Medical Research

    JF - Statistical Methods in Medical Research

    SN - 0962-2802

    IS - 2

    ER -