TY - GEN
T1 - Feature selection based on physicochemical properties of redefined N-term region and C-term regions for predicting disorder
AU - Shimizu, Kana
AU - Hirose, Shuichi
AU - Muraoka, Yoichi
AU - Noguchi, Tamotsu
PY - 2005/1/1
Y1 - 2005/1/1
N2 - The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision In predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specific-reduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.
AB - The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision In predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specific-reduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.
KW - C-term region
KW - Intrinsic disorder
KW - N-term region
KW - PSSM
KW - Physicochemical property
UR - http://www.scopus.com/inward/record.url?scp=33847206446&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33847206446&partnerID=8YFLogxK
U2 - 10.1109/cibcb.2005.1594927
DO - 10.1109/cibcb.2005.1594927
M3 - Conference contribution
AN - SCOPUS:33847206446
SN - 0780393872
SN - 9780780393875
T3 - Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '05
BT - Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '05
PB - IEEE Computer Society
T2 - 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '05
Y2 - 14 November 2005 through 15 November 2005
ER -