Predicting mostly disordered proteins by using structure-unknown protein data

Kana Shimizu, Yoichi Muraoka, Shuichi Hirose, Kentaro Tomii, Tamotsu Noguchi

    Research output: Contribution to journalArticle

    54 Citations (Scopus)

    Abstract

    Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052-0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036-0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%-10% disordered sequences, 1.46% for the proteins with 10%-20% disordered sequences and 16.57% for proteins with 20%-40% disordered sequences. Conclusion: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.

    Original languageEnglish
    Article number78
    JournalBMC Bioinformatics
    Volume8
    DOIs
    Publication statusPublished - 2007 Mar 6

    Fingerprint

    Proteins
    Protein
    Unknown
    Predictors
    Protein Structure
    Correlation coefficient
    Specificity
    Support Vector Machine
    Support vector machines
    Intrinsically Disordered Proteins
    Even function
    Hydrophobicity
    Protein Sequence
    Transducer
    Biology
    Transducers
    Hydrophobic and Hydrophilic Interactions
    Contacts (fluid mechanics)
    Contact
    Robustness

    ASJC Scopus subject areas

    • Biochemistry
    • Molecular Biology
    • Computer Science Applications
    • Structural Biology
    • Applied Mathematics
    • Medicine(all)

    Cite this

    Predicting mostly disordered proteins by using structure-unknown protein data. / Shimizu, Kana; Muraoka, Yoichi; Hirose, Shuichi; Tomii, Kentaro; Noguchi, Tamotsu.

    In: BMC Bioinformatics, Vol. 8, 78, 06.03.2007.

    Research output: Contribution to journalArticle

    Shimizu, Kana ; Muraoka, Yoichi ; Hirose, Shuichi ; Tomii, Kentaro ; Noguchi, Tamotsu. / Predicting mostly disordered proteins by using structure-unknown protein data. In: BMC Bioinformatics. 2007 ; Vol. 8.
    @article{dbca4f17956a4e78ba1ba1f949236683,
    title = "Predicting mostly disordered proteins by using structure-unknown protein data",
    abstract = "Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052-0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036-0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95{\%} for the proteins with 5{\%}-10{\%} disordered sequences, 1.46{\%} for the proteins with 10{\%}-20{\%} disordered sequences and 16.57{\%} for proteins with 20{\%}-40{\%} disordered sequences. Conclusion: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.",
    author = "Kana Shimizu and Yoichi Muraoka and Shuichi Hirose and Kentaro Tomii and Tamotsu Noguchi",
    year = "2007",
    month = "3",
    day = "6",
    doi = "10.1186/1471-2105-8-78",
    language = "English",
    volume = "8",
    journal = "BMC Bioinformatics",
    issn = "1471-2105",
    publisher = "BioMed Central",

    }

    TY - JOUR

    T1 - Predicting mostly disordered proteins by using structure-unknown protein data

    AU - Shimizu, Kana

    AU - Muraoka, Yoichi

    AU - Hirose, Shuichi

    AU - Tomii, Kentaro

    AU - Noguchi, Tamotsu

    PY - 2007/3/6

    Y1 - 2007/3/6

    N2 - Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052-0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036-0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%-10% disordered sequences, 1.46% for the proteins with 10%-20% disordered sequences and 16.57% for proteins with 20%-40% disordered sequences. Conclusion: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.

    AB - Background: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. Results: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052-0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036-0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%-10% disordered sequences, 1.46% for the proteins with 10%-20% disordered sequences and 16.57% for proteins with 20%-40% disordered sequences. Conclusion: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness.

    UR - http://www.scopus.com/inward/record.url?scp=34748854289&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=34748854289&partnerID=8YFLogxK

    U2 - 10.1186/1471-2105-8-78

    DO - 10.1186/1471-2105-8-78

    M3 - Article

    VL - 8

    JO - BMC Bioinformatics

    JF - BMC Bioinformatics

    SN - 1471-2105

    M1 - 78

    ER -