2025

Python, Jupyter, ML, NTLK, SQLite, Pandas, Scikit-learn

AI-Powered Phishing Detection System using Machine Learning

Researched and Trained a Model for Email Classification System to Identify Phishing Emails.

The Problem: The Ever-Evolving Phishing Threat

Phishing is no longer just about poorly written emails from princes. Attackers now use AI tools to craft perfect messages, deploy short-lived URLs, and use advanced techniques like HTML smuggling to bypass traditional security filters. Static, rule-based systems are struggling to keep up, leaving individuals and organisations vulnerable to data breaches, ransomware, and financial loss.

The Solution: A Proactive, ML-Driven Defense

I designed and built an end-to-end phishing detection system that uses machine learning to analyse email content and identify malicious intent. This wasn't just an academic model—it's a functional, command-line tool that can connect to a real email inbox, scan messages, and log threats.

How I Build It: A Full Project Lifecycle

This project followed a complete development lifecycle, from initial research to a working prototype.

1. Research & Discovery

I began with a deep dive into the cybersecurity landscape, analysing the shortcomings of traditional methods (like DKIM and rule-based filters) and evaluating the strengths/weaknesses of various machine learning algorithms for text classification.

2. Data Engineering & Modal Development

  • Data Sourcing & Cleaning: Aggregated and cleaned multiple public datasets from Kaggle, resulting in a robust dataset for training.
  • Feature Engineering: Built a custom text processing pipeline in Python to tokenize emails, remove URLs/stop words, and lemmatize text, transforming raw email content into features a model could understand.
  • Comparative Model Analysis: I didn't just pick one model. I trained and evaluated five different algorithms to find the best performer:
    • Support Vector Machine (SVM)
    • Logistic Regression
    • Random Forest
    • Naïve Bayes
    • Decision Tree
  • Hyperparameter Tuning: Used Grid Search to fine-tune the top-performing models, squeezing out every bit of performance.

3. System Implementation & Architecture

The core of the project was integrating the best model into a practical system. The architecture features:

  • Email Fetcher: A Python module using imaplib to securely connect to email providers (Gmail, Outlook, etc.) via SSL and retrieve emails in real-time.
  • Classification Engine: The processed email text is fed into the trained SVM model for instant prediction.
  • Threat Logging: Classified phishing emails are automatically logged to a local SQLite database for review and analysis, creating an audit trail.
  • CLI Interface: A clean command-line interface allows users to easily initiate scans and view results.

The Result: High-Stakes Performance

After rigorous testing and tuning, the Support Vector Machine (SVM) model emerged as the champion, achieving outstanding results:

Metric Score What it means
Accuracy 98.7% The model is correct 98.7% of the time.
Precision (Spam) 97% When it flags an email as spam, it's correct 97% of the time (low false positives).
Recall (Spam) 98% It successfully catches 98% of all actual spam emails (low false negatives).

This means the system is both highly accurate and reliable, correctly identifying malicious emails while minimizing the chance of incorrectly flagging legitimate communication.

Key Challenges & Solutions

  • Challenge: Real-world testing required email server access, which is protected by 2-Factor Authentication.
    • Solution: Researched and implemented the use of App Passwords for secure, programmatic access without disabling 2FA.
  • Challenge: Risk of model overfitting on a limited dataset.
    • Solution: Implemented class weighting and strategic data sourcing to create a more balanced dataset, improving the model's ability to generalize.
  • Saving and loading the complex trained model.
    • Solution: Leveraged the joblib library for efficient serialization and persistence of the Scikit-learn model.

Skills Demonstrated

  • Machine Learning: Model Selection, Hyperparameter Tuning (Grid Search), Feature Engineering, Evaluation Metrics (Accuracy, Precision, Recall, F1, Confusion Matrix)
  • Programming & DevOps: Python, SQL, Git, Jupyter Notebooks
  • Cybersecurity: Understanding of Phishing Tactics, Secure Development (SSL, App Passwords), System Hardening
  • Software Engineering: System Architecture, Database Design (SQLite), API Integration (IMAP), CLI Development, Full Project Lifecycle

Link to view the document: Click Here