Machine Learning Spam Filter

Machine Learning Spam Filter in Python

Visit pythonforbiginners.com to discover python tutorials

About the project: This is a python project that touches upon the fundamental steps in Natural Language Processing (NLP) and Machine Learning.

We are using a single file and not using external datasets, this code will use a small, in-memory dataset of example SMS messages to train a Multinomial Naive Bayes (MNB) classifier. MNB is highly effective and efficient for text classification like spam filtering.

Project Level: Advance

Installation Pre-requisite

You will need to install the scikit-learn library, which provides the machine learning tools:


  pip install scikit-learn

Below is the complete, runnable Python script:


import random
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline

# --- 1. Simulated Dataset (In-Memory) ---
# 0 = Ham (Legitimate), 1 = Spam
DATA = [
    ("Hey, are we still meeting for lunch tomorrow at 1?", 0),
    ("URGENT! You have won $1000! Click here now to claim your prize.", 1),
    ("Please call me back when you get this message.", 0),
    ("Free entry in weekly competition to win a 2024 iPad! Text WIN to 8009.", 1),
    ("What time does the movie start tonight?", 0),
    ("Congratulations! You are selected for a secret shopper role. Reply YES.", 1),
    ("Just checking in to see how you are doing.", 0),
    ("Reply to this message for a free ringtone download! Limited time offer.", 1),
    ("I'll send the documents over by the end of the day.", 0),
    ("Your bank account has been suspended. Log in immediately via the link.", 1),
    ("Can you pick up milk on your way home?", 0),
    ("This is the last chance to claim your reward before it expires! Call 0800-456-789.", 1),
    ("Meeting agenda is attached. Please review before 3 PM.", 0),
    ("We appreciate your business. Thank you for your continued support.", 0),
    ("New iPhone 16 Pro Max special price just for you! Visit our website.", 1),
    ("See you later!", 0),
    ("Important security alert regarding your online password.", 1),
]

# Separate features (text) and labels (0/1)
texts = [item[0] for item in DATA]
labels = [item[1] for item in DATA]

# --- 2. Data Splitting ---

# Split the data into training (80%) and testing (20%) sets
# random_state ensures the split is the same every time for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.20, random_state=42
)

# --- 3. Model Pipeline Setup ---

# We create a pipeline that combines two steps:
# 1. CountVectorizer: Converts text into a matrix of token counts (Word features).
# 2. MultinomialNB: The classification algorithm (Naive Bayes).
model = make_pipeline(
    CountVectorizer(), 
    MultinomialNB()
)

# --- 4. Model Training ---

print("--- Training Machine Learning Model (Multinomial Naive Bayes) ---")
# The pipeline handles the vectorization of X_train and then trains the MNB model
model.fit(X_train, y_train)
print("Training complete.")

# --- 5. Model Evaluation ---

# Predict labels for the test set
y_pred = model.predict(X_test)

print("\n--- Model Evaluation on Test Data ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Ham (0)', 'Spam (1)']))

# --- 6. Prediction Function for New Messages ---

def predict_message(message: str, trained_model):
    """
    Takes a new message and predicts whether it is Spam (1) or Ham (0).
    """
    # The model expects a list of strings, even for a single message
    prediction = trained_model.predict([message])[0]
    
    # Use the model's prediction probability for more detail
    probabilities = trained_model.predict_proba([message])[0]
    
    result = "SPAM" if prediction == 1 else "HAM"
    certainty = probabilities[prediction] * 100
    
    print(f"\n[TEST] Message: '{message}'")
    print(f"-> Prediction: {result}")
    print(f"-> Certainty: {certainty:.2f}%")
    
    return result

# --- 7. Live Demonstration ---

print("\n" + "="*50)
print("--- LIVE SPAM FILTER DEMONSTRATION ---")
print("="*50)

# Test case 1: A legitimate message (should be HAM)
predict_message(
    "Did you remember to send the final presentation files?",
    model
)

# Test case 2: A suspicious, spam-like message (should be SPAM)
predict_message(
    "You have a new payment notification! Click the link to prevent account closure.",
    model
)

# Test case 3: A simple, legitimate message (should be HAM)
predict_message(
    "I'm running five minutes late, sorry!",
    model
)

# Test case 4: A clear spam message (should be SPAM)
predict_message(
    "WIN $5000 CASH prize NOW! Text PRIZE to 4444 to enter.",
    model
)

print("\n" + "="*50)
print("End of Demonstration.")

#

How the Code Works

Dataset (DATA): We define a simple list of messages, manually labeled as Ham (0) or Spam (1).

Data Splitting: The train_test_split function reserves 20% of the data for testing, ensuring we evaluate the model on messages it hasn't seen during training.

Model Pipeline (make_pipeline):
- CountVectorizer: This is the core NLP preprocessing step. It transforms raw text into numerical features by counting word occurrences. This output is a bag-of-words representation, which is what the ML model needs.
- MultinomialNB: This is the classification algorithm. It uses probability to determine the likelihood of a text belonging to the "Ham" or "Spam" category based on the word counts provided by the vectorizer.

Training and Evaluation: The model is trained, and its performance is printed using the accuracy_score and a detailed classification_report.

predict_message Function: This utility allows you to pass new, arbitrary strings to the fully trained pipeline to see its real-time prediction.

👉 Download this code and experiment by modifying the difficulty or adding transactions.

Also, You can try with external data sets.

← Back to Projects

Link List

Machine Learning Spam Filter