CatchPhish: detection of phishing websites by inspecting URLs

Routhu Srinivasa Rao, Tatti Vaishnavi, Alwyn Roshan Pais

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25%. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26% on our dataset and an accuracy of 98.25%, 97.49% on benchmark datasets which is much better than the existing baseline models.

Original languageEnglish
JournalJournal of Ambient Intelligence and Humanized Computing
DOIs
Publication statusAccepted/In press - 01-01-2019
Externally publishedYes

Fingerprint

Websites
Classifiers
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

Rao, Routhu Srinivasa ; Vaishnavi, Tatti ; Pais, Alwyn Roshan. / CatchPhish : detection of phishing websites by inspecting URLs. In: Journal of Ambient Intelligence and Humanized Computing. 2019.
@article{aff010ccca67440a98524ce78f3a6035,
title = "CatchPhish: detection of phishing websites by inspecting URLs",
abstract = "There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25{\%}. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26{\%} on our dataset and an accuracy of 98.25{\%}, 97.49{\%} on benchmark datasets which is much better than the existing baseline models.",
author = "Rao, {Routhu Srinivasa} and Tatti Vaishnavi and Pais, {Alwyn Roshan}",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/s12652-019-01311-4",
language = "English",
journal = "Journal of Ambient Intelligence and Humanized Computing",
issn = "1868-5137",
publisher = "Springer Verlag",

}

CatchPhish : detection of phishing websites by inspecting URLs. / Rao, Routhu Srinivasa; Vaishnavi, Tatti; Pais, Alwyn Roshan.

In: Journal of Ambient Intelligence and Humanized Computing, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - CatchPhish

T2 - detection of phishing websites by inspecting URLs

AU - Rao, Routhu Srinivasa

AU - Vaishnavi, Tatti

AU - Pais, Alwyn Roshan

PY - 2019/1/1

Y1 - 2019/1/1

N2 - There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25%. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26% on our dataset and an accuracy of 98.25%, 97.49% on benchmark datasets which is much better than the existing baseline models.

AB - There exists many anti-phishing techniques which use source code-based features and third party services to detect the phishing sites. These techniques have some limitations and one of them is that they fail to handle drive-by-downloads. They also use third-party services for the detection of phishing URLs which delay the classification process. Hence, in this paper, we propose a light-weight application, CatchPhish which predicts the URL legitimacy without visiting the website. The proposed technique uses hostname, full URL, Term Frequency-Inverse Document Frequency (TF-IDF) features and phish-hinted words from the suspicious URL for the classification using the Random forest classifier. The proposed model with only TF-IDF features on our dataset achieved an accuracy of 93.25%. Experiment with TF-IDF and hand-crafted features achieved a significant accuracy of 94.26% on our dataset and an accuracy of 98.25%, 97.49% on benchmark datasets which is much better than the existing baseline models.

UR - http://www.scopus.com/inward/record.url?scp=85065708499&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065708499&partnerID=8YFLogxK

U2 - 10.1007/s12652-019-01311-4

DO - 10.1007/s12652-019-01311-4

M3 - Article

AN - SCOPUS:85065708499

JO - Journal of Ambient Intelligence and Humanized Computing

JF - Journal of Ambient Intelligence and Humanized Computing

SN - 1868-5137

ER -