Language Detection App Project code in Python

← Back to Projects

Language Detection App in Python

About the project: This is a project for creating a language detection tool in Python demonstrates fundamental techniques in Natural Language Processing (NLP), specifically using statistical features (like character N-grams) to classify text.

For this project, we'll use the powerful scikit-learn library, combining a feature extraction method (CountVectorizer with character N-grams) and a classifier (LogisticRegression).

We are using a single file, this code will use a small.


Project Level: Advance

Installation Pre-requisite

You will need to install the scikit-learn library, which provides the machine learning tools:


  pip install scikit-learn
  



Below is the complete, runnable Python script:


import random
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline

# --- 1. Simulated Multi-Language Dataset ---
# This dataset is small, but the power of character N-grams allows a 
# simple model to learn unique language patterns quickly.
DATA = [
    # English (en)
    ("The quick brown fox jumps over the lazy dog.", "en"),
    ("I love to learn Python programming.", "en"),
    ("Hello, how are you doing today?", "en"),
    ("This is a simple text for testing purposes.", "en"),
    
    # French (fr)
    ("Le renard brun rapide saute par-dessus le chien paresseux.", "fr"),
    ("J'aime apprendre la programmation Python.", "fr"),
    ("Bonjour, comment allez-vous aujourd'hui?", "fr"),
    ("Ceci est un simple texte à des fins de test.", "fr"),
    
    # German (de)
    ("Der schnelle braune Fuchs springt über den faulen Hund.", "de"),
    ("Ich lerne gerne Python-Programmierung.", "de"),
    ("Hallo, wie geht es dir heute?", "de"),
    ("Dies ist ein einfacher Text zu Testzwecken.", "de"),
    
    # Spanish (es)
    ("El rápido zorro marrón salta sobre el perro perezoso.", "es"),
    ("Me encanta aprender la programación en Python.", "es"),
    ("Hola, ¿cómo estás hoy?", "es"),
    ("Este es un texto simple para fines de prueba.", "es"),
    
    # Italian (it)
    ("La volpe marrone veloce salta oltre il cane pigro.", "it"),
    ("Amo imparare la programmazione Python.", "it"),
    ("Ciao, come stai oggi?", "it"),
    ("Questo è un semplice testo a scopo di test.", "it"),
]

# Separate features (text) and labels (language code)
texts = [item[0] for item in DATA]
labels = [item[1] for item in DATA]

# --- 2. Data Splitting ---

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.20, random_state=42
)

# --- 3. Model Pipeline Setup ---

# Character N-grams are sequences of characters (like 'th', 'ing', 'sch'). 
# These statistical patterns are unique to each language and are highly effective 
# for language detection, even on small datasets.
# We use a range of 2 to 4 characters (min_n=2, max_n=4)
# Logistic Regression is a strong baseline classifier for this type of problem.
model = make_pipeline(
    CountVectorizer(analyzer='char', ngram_range=(2, 4)), 
    LogisticRegression(max_iter=1000) # Increased max_iter for stability
)

# --- 4. Model Training ---

print("--- Training Machine Learning Model (Logistic Regression with Character N-grams) ---")
# The pipeline handles the vectorization of X_train and then trains the LR model
model.fit(X_train, y_train)
print(f"Training complete. Model trained on {len(X_train)} examples.")

# --- 5. Model Evaluation ---

# Predict labels for the test set
y_pred = model.predict(X_test)

print("\n--- Model Evaluation on Test Data ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# --- 6. Prediction Function for New Text ---

def detect_language(text: str, trained_model):
    """
    Takes a new text string and predicts its language.
    """
    # The model expects a list of strings, even for a single message
    prediction = trained_model.predict([text])[0]
    
    # Use the model's prediction probability for confidence
    probabilities = trained_model.predict_proba([text])[0]
    
    # Get the confidence score for the predicted language
    # We map the predicted label back to its index in the classes_ array
    confidence = max(probabilities) * 100
    
    print(f"\n[TEXT] \"{text[:50]}...\"")
    print(f"-> Detected Language: {prediction.upper()}")
    print(f"-> Confidence: {confidence:.2f}%")
    
    return prediction

# --- 7. Live Demonstration ---

print("\n" + "="*70)
print("--- LIVE LANGUAGE DETECTION DEMONSTRATION ---")
print("="*70)

# Test case 1: Spanish
detect_language(
    "¿Podrías enviarme un mensaje de texto cuando llegues a la casa?",
    model
)

# Test case 2: German
detect_language(
    "Entschuldigung, können Sie mir bitte helfen, den Bahnhof zu finden?",
    model
)

# Test case 3: English
detect_language(
    "The library is closed today due to the national holiday.",
    model
)

# Test case 4: French
detect_language(
    "Je suis très heureux de participer à votre fête d'anniversaire.",
    model
)

# Test case 5: Italian
detect_language(
    "Che bella giornata per fare una passeggiata in campagna!",
    model
)

print("\n" + "="*70)
print("End of Demonstration.")
# 



  


How the Code Works

This code provides a clean, functional implementation that simulates a real-world NLP model. It achieves high accuracy because the statistical patterns of languages (the character n-grams) are very distinct.

👉 Download this code and experiment by modifying the difficulty or adding transactions.

Also, You can test it with a specific sentence!


← Back to Projects