TY - GEN
T1 - Integrating RoBERTa Fine-Tuning and User Writing Styles for Authorship Attribution of Short Texts
AU - Wang, Xiangyu
AU - Iwaihara, Mizuho
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Authorship Attribution (AA) is a fundamental branch of text classification, aiming at identifying the authors of given texts. However, authorship attribution of short texts faces many challenges like short text, feature sparsity and non-standardization of casual words. Recent studies have shown that deep learning methods can greatly improve the accuracy of AA tasks, however they still represent user posts using a set of predefined features (e.g., word n-grams and character n-grams) and adopt text classification methods to solve this task. In this paper, we propose a hybrid model to solve author attribution of short texts. The first part is a pretrained language model based on RoBERTa to produce post representations that are aware of tweet-related stylistic features and their contextualities. The second part is a CNN model built on a number of feature embeddings to represent users' writing styles. Finally, we assemble these representations for final AA classification. Our experimental results show that our model on tweets shows the state-of-the-art result on a known tweet AA dataset.
AB - Authorship Attribution (AA) is a fundamental branch of text classification, aiming at identifying the authors of given texts. However, authorship attribution of short texts faces many challenges like short text, feature sparsity and non-standardization of casual words. Recent studies have shown that deep learning methods can greatly improve the accuracy of AA tasks, however they still represent user posts using a set of predefined features (e.g., word n-grams and character n-grams) and adopt text classification methods to solve this task. In this paper, we propose a hybrid model to solve author attribution of short texts. The first part is a pretrained language model based on RoBERTa to produce post representations that are aware of tweet-related stylistic features and their contextualities. The second part is a CNN model built on a number of feature embeddings to represent users' writing styles. Finally, we assemble these representations for final AA classification. Our experimental results show that our model on tweets shows the state-of-the-art result on a known tweet AA dataset.
KW - Authorship attribution
KW - Pre-trained language model
KW - RoBERTa
KW - Salient writing styles
KW - Short text classification
KW - Social network contents
UR - http://www.scopus.com/inward/record.url?scp=85115181973&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115181973&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-85896-4_32
DO - 10.1007/978-3-030-85896-4_32
M3 - Conference contribution
AN - SCOPUS:85115181973
SN - 9783030858957
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 413
EP - 421
BT - Web and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Proceedings
A2 - U, Leong Hou
A2 - Spaniol, Marc
A2 - Sakurai, Yasushi
A2 - Chen, Junying
PB - Springer Science and Business Media Deutschland GmbH
T2 - 5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Y2 - 23 August 2021 through 25 August 2021
ER -