2025
Python, Jupyter, ML, NTLK, SQLite, Pandas, Scikit-learn
AI-Powered Phishing Detection System using Machine Learning
Researched and Trained a Model for Email Classification System to Identify Phishing Emails.
The Problem: The Ever-Evolving Phishing Threat
Phishing is no longer just about poorly written emails from princes. Attackers now use AI tools to craft perfect messages, deploy short-lived URLs, and use advanced techniques like HTML smuggling to bypass traditional security filters. Static, rule-based systems are struggling to keep up, leaving individuals and organisations vulnerable to data breaches, ransomware, and financial loss.
The Solution: A Proactive, ML-Driven Defense
I designed and built an end-to-end phishing detection system that uses machine learning to analyse email content and identify malicious intent. This wasn't just an academic model—it's a functional, command-line tool that can connect to a real email inbox, scan messages, and log threats.
How I Build It: A Full Project Lifecycle
This project followed a complete development lifecycle, from initial research to a working prototype.
1. Research & Discovery
I began with a deep dive into the cybersecurity landscape, analysing the shortcomings of traditional methods (like DKIM and rule-based filters) and evaluating the strengths/weaknesses of various machine learning algorithms for text classification.
2. Data Engineering & Modal Development
- Data Sourcing & Cleaning: Aggregated and cleaned multiple public datasets from Kaggle, resulting in a robust dataset for training.
- Feature Engineering: Built a custom text processing pipeline in Python to tokenize emails, remove URLs/stop words, and lemmatize text, transforming raw email content into features a model could understand.
-
Comparative Model Analysis: I didn't just pick one model. I trained and evaluated five different algorithms to find the best performer:
- Support Vector Machine (SVM)
- Logistic Regression
- Random Forest
- Naïve Bayes
- Decision Tree
- Hyperparameter Tuning: Used Grid Search to fine-tune the top-performing models, squeezing out every bit of performance.
3. System Implementation & Architecture
The core of the project was integrating the best model into a practical system. The architecture features:
- Email Fetcher: A Python module using imaplib to securely connect to email providers (Gmail, Outlook, etc.) via SSL and retrieve emails in real-time.
- Classification Engine: The processed email text is fed into the trained SVM model for instant prediction.
- Threat Logging: Classified phishing emails are automatically logged to a local SQLite database for review and analysis, creating an audit trail.
- CLI Interface: A clean command-line interface allows users to easily initiate scans and view results.
The Result: High-Stakes Performance
After rigorous testing and tuning, the Support Vector Machine (SVM) model emerged as the champion, achieving outstanding results:
| Metric | Score | What it means |
| Accuracy | 98.7% | The model is correct 98.7% of the time. |
| Precision (Spam) | 97% | When it flags an email as spam, it's correct 97% of the time (low false positives). |
| Recall (Spam) | 98% | It successfully catches 98% of all actual spam emails (low false negatives). |
This means the system is both highly accurate and reliable, correctly identifying malicious emails while minimizing the chance of incorrectly flagging legitimate communication.
Key Challenges & Solutions
-
Challenge: Real-world testing required email server access, which is protected by 2-Factor Authentication.
- Solution: Researched and implemented the use of App Passwords for secure, programmatic access without disabling 2FA.
-
Challenge: Risk of model overfitting on a limited dataset.
- Solution: Implemented class weighting and strategic data sourcing to create a more balanced dataset, improving the model's ability to generalize.
-
Saving and loading the complex trained model.
- Solution: Leveraged the joblib library for efficient serialization and persistence of the Scikit-learn model.
Skills Demonstrated
- Machine Learning: Model Selection, Hyperparameter Tuning (Grid Search), Feature Engineering, Evaluation Metrics (Accuracy, Precision, Recall, F1, Confusion Matrix)
- Programming & DevOps: Python, SQL, Git, Jupyter Notebooks
- Cybersecurity: Understanding of Phishing Tactics, Secure Development (SSL, App Passwords), System Hardening
- Software Engineering: System Architecture, Database Design (SQLite), API Integration (IMAP), CLI Development, Full Project Lifecycle
Link to view the document: Click Here