Text Prediction System – A Statistical Language Model in Scala 3

Associated Link :

https://github.com/Daniil-Hirchyts/prediction-app-scala

February 25, 2025 (1mo ago)


Introduction

This text prediction system is a comprehensive educational project I developed in Scala 3 as part of the Développement Avancé course in my third year of BUT Informatique at the IUT of Montpellier. The application employs statistical models based on n-grams to analyze text patterns and generate word predictions in real-time. This technical document outlines the architecture, algorithms, implementation challenges, and potential future improvements for this natural language processing project.

Project Scope and Technical Foundation

Project Objectives

The text prediction project was designed with several clear objectives:

Technology Stack

The project was built using a modern functional programming approach with the following technologies:

Why Scala?

Scala 3 was deliberately chosen as the implementation language for several compelling reasons:

  1. Functional Programming Paradigm: The immutable, functional approach of Scala aligns perfectly with the mathematical nature of language models. Linguistic algorithms involve data transformations rather than mutations, making functional programming a natural fit.

  2. Type Safety: Scala's strong static typing system prevented numerous potential runtime errors during development. The algebraic data types and pattern matching capabilities were particularly valuable when handling complex tree structures like Tries.

  3. Expressiveness: Scala's concise syntax enabled implementing complex algorithms with minimal code while maintaining readability. Features like case classes and pattern matching significantly simplified the implementation of the core data structures.

  4. JVM Compatibility: Being JVM-based, Scala offered excellent performance while maintaining interoperability with Java libraries when needed, particularly for web interface components.

  5. Actor Model with Akka: The reactive nature of a text prediction application (responding to user input events) mapped well to Akka's actor-based concurrency model, enabling responsive performance.

Technical Challenges

Several significant challenges arose during development:

Core Algorithms and Principles

N-gram Model: Statistical Foundations

The heart of the prediction system is its n-gram model. An n-gram is a contiguous sequence of n items (in this case, words) from a given text sample. The fundamental principle behind n-gram prediction is the Markov assumption: the probability of a word depends primarily on the previous n-1 words that precede it.

This statistical approach works through the following principles:

  1. Frequency Analysis: The system analyzes large text corpora to count how often certain word sequences appear, establishing a statistical model of language patterns.

  2. Conditional Probability: For an n-gram model, we calculate P(wₙ|w₁...wₙ₋₁) — the probability that word wₙ follows the sequence w₁...wₙ₋₁. This is computed by dividing the frequency of the complete sequence (w₁...wₙ) by the frequency of the prefix sequence (w₁...wₙ₋₁).

  3. Context Window Sensitivity: Higher-order n-grams (like trigrams, where n=3) capture more context than lower-order ones (like unigrams, where n=1), providing a balance between specificity and generalizability.

For example, in a trigram model, to predict what might follow "the large elephant," we look at historical frequencies of words that followed this exact phrase in our training corpus, creating a probability distribution of potential next words.

The system supports:

Trie Data Structure: Efficient Retrieval

A critical aspect of making predictions efficient was the choice of data structure. The project uses a Trie (prefix tree) for storing and retrieving n-grams, which offers several advantages:

  1. Prefix-Based Organization: In a Trie, words sharing common prefixes share storage, making it memory-efficient for storing large vocabularies with common stems.

  2. Lookup Efficiency: Finding all words with a given prefix has O(m) complexity, where m is the length of the prefix, regardless of the total vocabulary size.

  3. Incremental Matching: As a user types, the Trie can efficiently update suggestions by traversing just one additional node per character.

The Trie implementation follows these principles:

For n-gram storage, each path in the Trie represents a sequence of words (the context), and the node at the end of this path contains the distribution of words that could follow this sequence, along with their probabilities.

The Trie is implemented as an immutable data structure in Scala 3, adhering to functional programming principles:

case class Trie[T](
    enfants: Map[Char, Trie[T]] = Map.empty[Char, Trie[T]],
    valeur: Option[Probabilites[T]] = None
)

Prediction Algorithm Workflow

The prediction process works through these sequential steps:

  1. Context Extraction: When a user is typing, the system identifies the last n-1 words to form the context for prediction.

  2. Trie Traversal: The identified context is used to navigate the Trie, finding the node that represents this specific word sequence.

  3. Probability Ranking: Once the correct node is located, the system retrieves the probability distribution of potential next words.

  4. Suggestion Filtering: The top k words with the highest probabilities are selected and presented to the user as suggestions.

  5. Feedback Integration: If the user selects one of the suggestions, this reinforces the model's predictions in future iterations.

The implementation includes the following operations:

  1. Insertion: Adds a word and its associated probabilities to the Trie, creating a new Trie instance.
  2. Lookup: Finds a word (or prefix) in the Trie and returns its associated probabilities.
  3. Retrieval of most probable results: Gets the most likely words to follow a given sequence.

Feature Implementation

Real-time Suggestions

The real-time suggestion feature analyzes the user's current text input and offers relevant word predictions.

This implementation:

  1. Captures the user's current input
  2. Extracts the relevant context (last n words)
  3. Queries the API for potential next words
  4. Displays suggestions in real-time
  5. Allows the user to accept a suggestion with Tab key

Text Generation

For text generation, a similar process is used iteratively:

  1. Start with a user-provided prefix
  2. Predict the next word based on the current context
  3. Add this word to the generated text
  4. Update the context window by sliding it one word forward
  5. Repeat until the desired text length is reached

This approach creates text that statistically resembles the style and patterns of the training corpus, following natural language collocations.

Model Management

The application supports multiple prediction models, allowing users to:

This flexibility enables users to have specialized models for different domains (e.g., technical writing, creative writing, formal documents) with different vocabulary distributions.

Frontend Development

The frontend was designed to be intuitive and responsive, with key components including:

  1. Main Writing Interface: A clean editor with real-time suggestions as you type.
  2. Suggestions Panel: A dedicated view showing possible next words with their probabilities.
  3. Text Generator: An interface for generating text from a starting phrase.
  4. Model Management: Controls for creating, selecting, and managing prediction models.

Testing and Validation

The project includes a comprehensive test suite that verifies the correctness and performance of all major components:

For example, the Trie data structure tests verify correct behavior for various edge cases:

"Un Trie" should "stocker des mots et leurs probabilités" in {
  val trie = Trie[String]()
  val probsMap = Map("chien" -> 0.7, "chat" -> 0.3)
  val probs = Probabilites(probsMap)
  val trieMisAJour = trie.inserer("bonjour", probs)
  val resultat = trieMisAJour.trouver("bonjour")
  resultat shouldBe Some(probs)
}

Limitations and Future Improvements

Current Limitations

The n-gram approach, while effective, has inherent limitations:

Neural Network Improvements

The system could be significantly enhanced by incorporating neural network approaches:

  1. Word Embeddings: Rather than treating words as discrete symbols, word embeddings (like Word2Vec or GloVe) represent words as dense vectors in a continuous space where similar words cluster together. This would allow the system to understand that "automobile" and "car" are related even if they don't appear in identical contexts in the training data.

  2. Recurrent Neural Networks (RNNs): LSTM or GRU networks that maintain state over longer sequences could capture dependencies beyond what n-grams can model.

  3. Bidirectional Context: While n-grams only look at preceding words, transformer-based models like BERT could analyze context in both directions, further improving prediction accuracy.

  4. Transfer Learning: Pre-trained models could provide a strong foundation of language understanding that gets fine-tuned on user-specific text, combining general language knowledge with personalized prediction.

Implementation strategy could involve:

Conclusion

This text prediction project successfully demonstrates a functional approach to natural language processing using statistical methods and immutable data structures. By implementing n-gram models and the Trie data structure in Scala 3, I created an efficient and responsive text prediction system that served as an excellent educational exercise in computational linguistics.

Key lessons learned include:

While the statistical approach has its limitations, it provides a solid foundation for text prediction that could be enhanced with more advanced techniques like neural networks. Understanding these fundamental statistical principles remains valuable even as the field moves toward more sophisticated approaches.

This project not only fulfilled the course requirements but also provided valuable insights into natural language processing, functional programming patterns, and web application development—skills that transfer well to other complex software engineering challenges.


This project was developed as part of the Développement Avancé course in BUT3 Informatique at the IUT of Montpellier.

🇬🇧