Next Word Predictor

With Smart Phones being so prevalent and almost ever one using SMS and WhatsApp, one must have seen the use of Next Word Prediction Application. The application functions as follows: As you type in your text, the application predicts and suggests what your next word in the sentence should be.

There are many latest technologies available to generate these applications like Word2Vec. You can also explore GenSim for this purpose.

I provide a simple solution for creating this application. I use R Programming for creating this solution.

Demonstration of the Tool

Looking at the Application Work

The look and feel of this display is very basic as it is used to demonstrate the logic only. The logic has been built into a Shiny Application. The program can be expanded further to other applications.

Next Word Predictor - Screen Shot

The Source Data

The first step to creating this solution is to obtain sizeable amount of text has to be gathered. This predictor is for making prediction in English language. So, I needed a lot of text in English language.

I got his data from websites of Newspapers. Using this data, I formed the corpus. (If this is unknown territory for you, please read appropriate sources for this knowledge.)

Create the n-grams

After obtaining the data, the next step is to create the n-grams. In the fields of computational linguistics and probability, an ngram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The ngrams typically are collected from a text or speech corpus.

n-grams ca be created using the RWeka package in R (If this is unknown territory for you, please read appropriate sources for this knowledge).

I created n-grams for 2-gram, 3-gram, 4-gram, 5-gram, 6-gram, 7-gram and 8-gram. I stored the n-grams in R Data Store. The basic purpose of this to match the given input text to see if there exists this sequence in our existing knowledge base and what were the subsequent words used previously with their probabilities.

The Prediction Logic

Step 1: Load the saved n-grams

gram2 <- readRDS("2gram.rds")
gram3 <- readRDS("3gram.rds")
gram4 <- readRDS("4gram.rds")
gram5 <- readRDS("5gram.rds")
gram6 <- readRDS("6gram.rds")
gram7 <- readRDS("7gram.rds")
gram8 <- readRDS("8gram.rds")

Step 2: Load the text and form the corpus and clean the text

The text is the existing set of words based on which the next word has to be predicted.

    text <- input$inputText

    mydata.corpus <- Corpus(VectorSource(text))
    mydata.corpus <- tm_map(mydata.corpus,content_transformer(function(x) iconv(x, to='ASCII', sub=' ')))
    mydata.corpus <- tm_map(mydata.corpus,content_transformer(tolower))
    mydata.corpus <- tm_map(mydata.corpus, content_transformer(removeNumbers))
    mydata.corpus <- tm_map(mydata.corpus, content_transformer(removePunctuation))
    mydata.corpus <- tm_map(mydata.corpus, content_transformer(stripWhitespace))
    mydata.corpus <- tm_map(mydata.corpus, PlainTextDocument)
    mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) stri_trans_tolower(x)))
    mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) stri_trans_general(x, "en_US")))

Step 3: Collate the fragments of the text from the corpus

    frase <- unlist(mydata.corpus[[1]]$content)

    prev <- unlist (strsplit (frase, split = " ", fixed = TRUE))
    len <- length(prev)
    fra2 <- paste(tail (prev, 1), collapse = " ")
    fra3 <- paste(tail (prev, 2), collapse = " ")
    fra4 <- paste(tail (prev, 3), collapse = " ")
    fra5 <- paste(tail (prev, 4), collapse = " ")
    fra6 <- paste(tail (prev, 5), collapse = " ")
    fra7 <- paste(tail (prev, 6), collapse = " ")
    fra8 <- paste(tail (prev, 7), collapse = " ")

Step 4: Make the prediction

The predicted word is stored in the variable predict.

    predict <- NULL
    try(pred8 <- gram8 [context == fra8, .SD [which.max (p), word]])
    try(pred7 <- gram7 [context == fra7, .SD [which.max (p), word]])
    try(pred6 <- gram6 [context == fra6, .SD [which.max (p), word]])
    try(pred5 <- gram5 [context == fra5, .SD [which.max (p), word]])
    try(pred4 <- gram4 [context == fra4, .SD [which.max (p), word]])
    try(pred3 <- gram3 [context == fra3, .SD [which.max (p), word]])
    try(pred2 <- gram2 [context == fra2, .SD [which.max (p), word]])
                  predict<-"Next word cannot be predicted."