N-grams | Introduction to Text Analytics with R Part 6

Video is ready, Click Here to View ×

N-grams includes specific coverage of:

• Validate the effectiveness of TF-IDF in improving model accuracy.
• Introduce the concept of N-grams as an extension to the bag-of-words model to allow for word ordering.
• Discuss the trade-offs involved of N-grams and how Text Analytics suffers from the “Curse of Dimensionality”.
• Illustrate how quickly Text Analytics can strain the limits of your computer hardware.

The data and R code used in this series is available here:

Learn more about Data Science Dojo here:

Watch the latest video tutorials here:

See what our past attendees are saying here:

Like Us:
Follow Us:
Connect with Us:…

sql tutorial for beginners with examples

18 Comments on “N-grams | Introduction to Text Analytics with R Part 6”

  1. hey I need help, I want to do confusionmatrix on rpart.cv.2 and I use the code confusionMatrix(train.tokens.tfidf.df$Label, rpart.cv.2$finalModel$predicted) then Error: `data` and` reference` appear should be factors with the same levels

    any suggestions for my problem?

    thank you

  2. Great video series! Question- why retain unique unigram after creating the n-gram? I am using gensim on the python side and it keeps only the then assembled ngram. Let me know if i am theoretically losing detail. Thanks

  3. Thank you so much for your help, your tutorials are extremely helpful. However, I am confused as to what you mean when you talk about preserving the idf vector. Why is this one so important to preserve as opposed to the tf vector or the tfidf vector? You say that it is important to use it to translate the new data into the space that the old data is in; but wouldn't the idf vector change itself once we starting adding new data in? Won't the contents change as we get more data? Why does the contents of this vector remain constant as we get new data in? Thank you in advance!

  4. hey there Dave!

    I seem to have a similar error when i execute the second CV

    rpart.cv.2 = train(Label ~ ., data = train.tokens.tfidf.df, method = "rpart", trControl = cv.cntrl, tuneLength = 7)
    Error in terms.formula(formula, data = data) :
    variable names are limited to 10000 bytes

  5. Hi Data Science Dojo, first I'd like to thank you for such excellent videos on Text Analytics using R. I have an issue with the script [rpart.cv.1 <- train(Label ~ ., data = train.tokens.df, method = "rpart", trControl = cv.cntrl, tuneLength = 7)]. It takes an awful long time ranging between 15-30 minutes and at the end the error message appears as follow:

    Error: cannot allocate vector of size 354.7 Mb
    In addition: Warning message:
    In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
    There were missing values in resampled performance measures.
    Called from: terms.formula(formula, data = data)
    Timing stopped at: 172.5 9.39 182.4

    Am I missing something here? The rest of the scripts were running fine.

  6. Hi Dave,
    When I am running the code
    rpart.cv.2 <- train(Label ~ .,data=train.token.tfidf.df, method = 'rpart',
    trControl = cv.cntrl, tuneLength = 7)

    Its throwing the following error
    Error in terms.formula(formula, data = data) :
    variable names are limited to 10000 bytes.
    Can you plz suggest how to correct this?

  7. Hi! Thanks a lot! These videos are brilliant!
    I keep running into a problem when trying to run the model:

    > rpart.cv.2 <-train(Label~., data=train.tokens.tfidf.df, method = "rpart",
    + trControl = cv.cntrl, tuneLength = 7)

    Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
    unable to find variable "optimismBoot"

    Would you happen to know why I am getting this error, and how to fix it?

  8. Hi @Dave

    I completely understood all that is happening in video #6, but when I try to train the model using "train.tokens.tfidf.df " as data, I get the following error :

    rpart.cv.2 <- train(Label ~ ., data = train.tokens.tfidf.df, method = "rpart",
    trControl = cv.cntrl, tuneLength = 7)

    Error in eval(predvars, data, env) : object 'Label' not found

    I even looked it up on stack overflow and then used the GitHub repository for this piece of code and then eventually the whole code , just in case I was typing something wrong, but this error won't go.

    Funny thing is the train.tokens.tfidf.df has a column named Label, and yet this error appears. Also this code works very well on rpart.cv.1

    This same error persists for all the subsequent rpart.cv variables, i.e. rpart.cv.3, rpart.cv.4

    Finally, this same code sometimes also throws an error : Error: protect(): protection stack overflow in R, (maybe due to 29K + columns in the data set)

    I'm stuck on this for a while now, and honestly this has become a roadblock in my learning. I have been waiting for your response on this , and would really appreciate your help.


  9. Hey there,
    thanks for the tutoring. This is great. However, i got this problem while transforming to as.matrix
    Error in asMethod(object) :
    Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

    i guess its because the dim is too large, how to do with this. Please advice~~

  10. Hey Dave,
    Got this error. Please help in rectifying

    rpart.cv.2<-train(Label ~ .,data=train.tokens.tfidf.df, method="rpart",trControl=cv.ctrl, tuneLength=7)

    Error in `[.data.frame`(m, labs) : undefined columns selected
    Called from: `[.data.frame`(m, labs)

  11. Thank you! This is really great! I love how you explain why each step of the process is done instead of just instructing us to do this. Really helpful for a project I'm working on text analytics. Can you finish off the tutorial to apply the model by predicting a test set in a sorta confusion matrix format? I'm not sure how to apply that model function in a random test sample. My project is due this tuesday!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.