Segmentation-based Phishing URL Detection

Eint Sandi Aung, Hayato Yamana

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Uniform resource locators (URLs), used for referencing web pages, play a vital role in cyber fraud because of their complicated structure; phishers, or in other words, attackers, employ tricky by-passing techniques to deceive users. Thus, information extracted from URLs might indicate significant and meaningful patterns essential for phishing detection. To enhance the accuracy of URL-based phishing detection, we need an accurate word segmentation technique to split URLs correctly. However, in contrast to traditional word segmentation techniques used in natural language processing (NLP), URL segmentation requires meticulous attention, as tokenization, the process of turning meaningless data into meaningful data, is not as easy to apply as in NLP. In our work, we concentrate on URL segmentation to propose a novel tokenization method, named URL-Tokenizer, by combining the Bert tokenizer and WordSegment tokenizer, in addition to adopting character-level and word-level segmentations simultaneously. Our experimental evaluations in detecting the phishing URLs show that our proposed method achieves a high accuracy of 95.7% with a balanced dataset, and 97.7% with an imbalanced dataset, whereas baseline models achieved 85.4% with a balanced dataset and 85.1% with an imbalanced dataset.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
PublisherAssociation for Computing Machinery
Pages550-556
Number of pages7
ISBN (Electronic)9781450391153
DOIs
Publication statusPublished - 2021 Dec 14
Event2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021 - Virtual, Online, Australia
Duration: 2021 Dec 142021 Dec 17

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
Country/TerritoryAustralia
CityVirtual, Online
Period21/12/1421/12/17

Keywords

  • Information extraction
  • Phishing URL detection
  • Word segmentation

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint

Dive into the research topics of 'Segmentation-based Phishing URL Detection'. Together they form a unique fingerprint.

Cite this