Multinomial text classification In Data Science: A machine learning approach

Hi all! It’s been a long time since I have posted. The time between my last post in March to today was filled with a high amount of enthusiasm and lots of learning. It was great to hear feedback from a few of my readers and contributing their ideas and suggestions to make my previous contributions better and optimized as per their use cases.

This time, I have come up with an implementation of Text classification which is more often used in most machine learning and data science applications like binary classification, NLP (Natural Language Processing) applications or Sentiment analysis. To begin with, its great to start with an idea of what exactly is text classification.

Text classification is the process of labeling natural language texts with relevant classes or categories from a predefined set

What are the applications of Text classification?

To name a few:

  • Automating CRM tasks by analyzing and assigning the tasks based on relevance
  • Analysing the sentiments from the text to determine the mood of an audience
  • Predicting the possible class or category of a given dataset and many more…

It sounds interesting!

What are the methods to do it?

Keeping it short, there are namely two popular methods for doing text classification:

  1. Bernoulli classification
    • This model uses binary occurrence information, ignoring the number of occurrences i.e the count of the elements while training our model.
    • For example, it is suitable to use when we have only two categories or classes for data classification, like a coin when tossed may produce heads or tails.
    • When the number of classes is more than two and the document is large enough, then this model may make mistakes when classifying long documents.
  2. Multinomial classification
    • This model does not ignore the number of occurrences of the elements in a text while training our model.
    • It provides the most generic form of text classification and supports classification in multiple categories of classes.
    • For example, it can be used to classify a document containing text into multiple classes, like rolling a dice having 6 sides (classes).

How did I do it?

Since multinomial classification is the most reliable method to classify large documents, I have used this method for my implementation of text classification. My implementation uses a training dataset to train my model and returns the accuracy (%) of my trained model when some test data is fed as an input. The higher the accuracy, the more reliable the model.

The technology stack which I have used to train my model and implement this application uses:

  1. Java 8
  2. MongoDB

Follow this link to find my implementation: https://goo.gl/V5kgkU

Feel free to comment your implementations or post your queries.