Integrating RoBERTa Fine-Tuning and User Writing Styles for Authorship Attribution of Short Texts

Xiangyu Wang, Mizuho Iwaihara*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Authorship Attribution (AA) is a fundamental branch of text classification, aiming at identifying the authors of given texts. However, authorship attribution of short texts faces many challenges like short text, feature sparsity and non-standardization of casual words. Recent studies have shown that deep learning methods can greatly improve the accuracy of AA tasks, however they still represent user posts using a set of predefined features (e.g., word n-grams and character n-grams) and adopt text classification methods to solve this task. In this paper, we propose a hybrid model to solve author attribution of short texts. The first part is a pretrained language model based on RoBERTa to produce post representations that are aware of tweet-related stylistic features and their contextualities. The second part is a CNN model built on a number of feature embeddings to represent users' writing styles. Finally, we assemble these representations for final AA classification. Our experimental results show that our model on tweets shows the state-of-the-art result on a known tweet AA dataset.

Original languageEnglish
Title of host publicationWeb and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Proceedings
EditorsLeong Hou U, Marc Spaniol, Yasushi Sakurai, Junying Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages413-421
Number of pages9
ISBN (Print)9783030858957
DOIs
Publication statusPublished - 2021
Event5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021 - Guangzhou, China
Duration: 2021 Aug 232021 Aug 25

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12858 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Country/TerritoryChina
CityGuangzhou
Period21/8/2321/8/25

Keywords

  • Authorship attribution
  • Pre-trained language model
  • RoBERTa
  • Salient writing styles
  • Short text classification
  • Social network contents

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Integrating RoBERTa Fine-Tuning and User Writing Styles for Authorship Attribution of Short Texts'. Together they form a unique fingerprint.

Cite this