Naive Bayes Model for Machine Learning and AI

By:   |   Updated: 2024-11-21   |   Comments   |   Related: More > Artificial Intelligence


Problem

As AI and Machine Learning have increased in popularity, especially Large Language Models, more professionals have explored how these systems work. Unfortunately, some put the cart before the horse, where they take on more complex algorithms before learning to pave the foundation, resulting in faded interest in the topic. This tip will introduce a simple probabilistic, yet powerful classifier, the Naïve Bayes Model, and implement it in Python.

Solution

In a previous tip "Machine Learning Introduction: KNN Model," we explored the K-Nearest Neighbors (KNN) algorithm, which is perfect for a supervised learning beginners. By using such simple and intuitive models, our goal for our readers is to get comfortable with the jargon of machine learning, what problems to expect, and what nitpicky details to look out for, rather than just explaining a model. With this purpose in mind, this tip focuses on another simple, yet effective algorithm: the Naive Bayes classifier.

Despite its simplicity, Naive Bayes is a very useful model for classification. Unlike KNN, which was a discriminative classifier (as it works based on features that differentiate between classes), Naive Bayes is a generative classifier as it attempts to model the underlying joint probability distribution of the input features and the output label. Every combination of features and labels has a probability associated with it–what are the chances for a certain label, given a certain combination of input features. This is essentially the joint probability distribution of the feature label space.

Example of scenarios where probabilities are highly relevant

Basics of Probability

Before we get ahead, we must understand some basic concepts from probability.

So, what really is probability? It is simply a measure of how likely an event is to occur and it is quantified as a number between 0 and 1 (inclusive). One of the key concepts in probability is 'experiment'—an action that has more than one possible outcome. An example of this can be flipping a coin, whose output can either be heads or tails. Rolling a die is another experiment which can yield 6 different possible outcomes.

There is a unique name for a set of all possible outcomes of an experiment—the 'sample space' formula. For instance, the sample space for flipping a coin is formula.

Another important keyword is 'Event' formula. It is a specific outcome or a set of outcomes from the sample space. As an example in context of rolling a die, one event can be getting an even number when rolling a die. This event will comprise the following outcomes: formula.

To calculate the probability of an event, we can use the following formula:

formula

Therefore, if we were to calculate formula, we should get formulaas an answer as there are formula outcomes in our event, and formula total possible outcomes as shown in the sample space. This number basically conveys that there is a probability of formula, or a formula chance that rolling a fair, six-sided die will result in an even number.

Probability of Events Occurring Together

Moving on to more complicated matters, sometimes we want to find the probability of two or more events that happen at the same time. To find the probability of two independent events formula andformula occurring simultaneously, we can use the multiplication rule:

formula

But what does independence mean here? The two events are independent of each other if the occurrence of one event does not alter the probability of the other. This concept is best understood through examples.

Suppose there is an event that it will rain today in your city. There is another event that rolling a die gets you a 6. These two events are independent, i.e., the weather of your city has no effect on what number you will get when you roll a die. On the other hand, let's consider the event that a student did not sleep the night before their exam; the other event being that a student scored poorly on their exam. Are these events independent? Obviously not! A lack of sleep can impair the student's judgment and memory, increasing the chances of performing poorly in the exam. Since we cannot use the multiplication rule for dependent events like these, we will see later how to compute their probability.

Probability of One Event or the Other Event

What if the events are alternative? For example, let's say that we want to calculate the probability of a person rolling a formula or formula on their die. The crucial word here is 'or', when the two events belong to the same sample space.

Review the Venn diagram below. The blue sphere denotes an event A, whereas the green sphere shows another event B. We can think of these spheres as enclosing all the possible outcomes for the event they are representing. For the outcomes that are not in both A and B, they are enclosed in the white part of the rectangle. Therefore, the entire rectangle denotes the sample space for a certain experiment. In cases where the two events do not have any outcomes in common, we can use the addition rule as follows to calculate

formula:

formula

A venn diagram of mutually exclusive events

Suppose that a restaurant has 3 desserts on their menu, but they only make 1 of the desserts each day. The probability that a dessert will be made on a random day is:

formula

formula

formula

Suppose that your favorite desserts are cake and pudding, and you are interested in knowing what are the chances the restaurant will serve either of the desserts on any given day? Using the addition rule, we can get formula. There is a formula chance that the restaurant will serve your favorite desserts! Such events which do not occur at the same time are called mutually exclusive events. It is impossible for them to happen together.

But what if the events are not mutually exclusive? Consider the Venn diagram below. Some of the outcomes are present in both event A and B; therefore, it is possible for the two events to happen together. In this case, the addition rule is:

formula

Now we are subtracting the probability that the two events occur together to prevent overestimating our results due to double counting.

A venn diagram of events which are not mutually exclusive

Conditional Probability

Next, we are entering the important territory of probability that is directly linked to our Naive Bayes classifier. Conditional probability measures the probability of an event given that an event has already occurred. The probability of an event formula given that formula has occurred is denoted by:

formula

Let's try to understand this formula using the Venn diagram below. Since we are given the information that formula has already occurred, our sample space will now be reduced to the event formula. We can then find the probability of intersection, i.e., where both formula andformula occur and divide by the probability of formula (which is now our sample space). You might also now be thinking that the conditional probability of mutually exclusive events is formula, and you are absolutely correct!

Venn diagram demonstrating reduction of sample space in conditional probabilities

Let's work with an example to absorb this concept fully. Suppose that a university campus has formula students. formula of these students are business students and the remaining formula are computer science students. We also know that formula out of formula computer science students are introverted, and formula out of formula business students are introverted. Now, given that a person is introverted, what is the probability that this person is a business student? We might expect this probability to be low but let's verify the results using conditional probability. We basically need to evaluate:

formula

We know that formula business students are introverted:

formula

We also know that total introverted people in the university are formula (combining from both majors):

formula

Now we can compute the conditional probability:

formula

There is a 66.7% chance that an introverted student will be a business major!

Before we wrap up our discussion about conditional probability, we will be shedding light on the Bayes Theorem, which is foundational to our Naive Bayes classifier. Here's an alternate way to compute formula using the Bayes Theorem:

formula

The important feature of this theorem is that it allows us to invert the conditional probabilities, which we will see later are relevant in the classification methodology of the Naive Bayes model.

The Naive Bayes Algorithm

Now that we have a good understanding about probabilities, let's jump to the main matter of this tip: how does a Naive Bayes algorithm works? Consider this algorithm:

formula

So, what's going on here?

  • formula is known as the posterior probability and it quantifies the probability of a certain class given the data point.
  • formula is the likelihood of observing a particular data point, given a class label.
  • formula is the prior probability–what is the probability of a certain class in the dataset? The Bayes Theorem helps us update this prior belief about our data.
  • formula is the probability of a data point.

Generally, we aren't concerned with formula. Remember our end goal is to classify a data point appropriately. Our aim therefore is to find the most probable class given the training data. We can thus modify our Bayes Theorem to achieve:

formula

formula is our most probable class and we are finding it by taking the argmax of formula across all the classes in the dataset. In simpler terms, we are computing formula for all our class labels and returning the class for which the probability is the highest. Since formula does not change when we are computing the posterior probability across different classes, we can remove it from the formula as it is not giving us any distinguishing information. The simplified formula then becomes:

formula

Let's illustrate this concept with an example. Going back to our university example, suppose that in another batch of students, there are formula math majors, out of which formula are introverted, formula are business majors, out of which formula are introverted, and there are formula economics majors, out of which formula are introverted. The question here is, given that a student is an introvert, what is their most likely major? Using the jargon of the above formula, our classes are now math, business, and economics. Let's compute the posterior of these classes one at a time:

formula

formula

formula

Since the posterior probability for economics is larger than the rest, the introvert student is most likely to be an economics major.

Naive Bayes in Action

We will now be illustrating how to utilize your Naive Bayes classifier for text sentiment analysis. Suppose you are given the following training dataset about positive and negative movie reviews:

A dataset of movie reviews

How can we compute the probability of data here? Previously, we had numbers which made things easy, but now we only have words. A simple remedy to this situation is counting. Depending on the label, we can keep separate counts for the frequency of words. Here's an illustration below:

Counts of words from both negative and positive reviews separately

Now from this illustration, we compute the likelihoods formula. These probabilities are illustrated below in the table:

Likelihood of words belonging to positive and negative class

Let's try to understand this table. For example, the first row is telling us the probability of word good, given that it is positive formula, with a value of formula. However, since the word good does not appear in the negative reviews, formula is 0.

Now that we have our likelihoods, let's calculate the prior probabilities as well:

formula

formula

These probabilities just represent the frequency of labels in the dataset.

Now that we have our entire toolset, let's use our Naive Bayes model to classify the following unseen review: "great fantastic acting".

Note: Although the words 'great' and 'acting' are present in our vocabulary, the word 'fantastic' does not appear in either of the data points for both classes. Since we don't have any probabilities for it, what do we do with it? A simple strategy for out of vocabulary words is to drop it from the test data point. Therefore, our test data point now becomes "great acting".

Note: Although the word 'acting' is present in both classes, the word 'great' appears only in the positive class. Since formula is formula, it will zero out the entire result of the posterior probability. To deal with such zeros, we can employ one of many techniques; however, we will be focusing on add-1 smoothing. Simply put, we need to add an extra count of 1 for each word in both classes. Our counts become:

Counts of words with add-1 smoothing

This will modify our likelihoods as follows:

Probabilities with add-1 smoothing

Now that we no longer have zero likelihoods or out of vocabulary words, we can calculate the posterior probability for our test point "great acting":

formula

formula

Therefore, the predicted class for the review "great fantastic acting" by a Naive Bayes model will be positive.

Implementing Naive Bayes using Python

For this part, we will be working with a synthetic movie review dataset and implement the Naive Bayes algorithm using the Sklearn library to classify an unseen review into positive or negative. Below is the list of libraries we will be using for our implementation:

#MSSQLTips.com
import pandas as pd
import random
import re
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Now, let's get started.

#MSSQLTips.com
ddf_reviews.head()
Movie review dataset

Before we can think about designing a classification algorithm, we first need to prepare this dataset. Our two goals here are to first clean the dataset of any punctuation and stop words and to standardize it. The other goal is to generate counts of words from our vocabulary.

Let's first design a function to clean and standardize the dataset.

#MSSQLTips.com
def clean_text(text):
  text = text.lower()
  text = text.split()
  text = [word for word in text if word.lower() not in stop_words]
 
  for each_word in range(len(text)):
     text[each_word] = re.sub(r'[^a-zA-Z\s]', '', text[each_word])
 
  cleaned_sentence = " ".join(text)
 
  return " ".join(cleaned_sentence.split())

In this function, we are first making the text lower case. Then we make a list of each word in a sentence and remove the stop words from this list. From each word, non-numeric characters are then removed, before the words are rejoined to form the cleaned sentence. We can now apply this function to our review's text.

#MSSQLTips.com
df_reviews['Review'] = df_reviews['Review'].apply(clean_text)
ddf_reviews.head()
Cleaned and prepared dataset

Before we vectorize our dataset to generate counts, let's first split our dataset into train and test sets. Since our dataset was shuffled when it was generated, we simply reserve the first 90 rows for our training dataset and keep the remaining 10 rows as the test dataset. We can do this by slicing the data frame and then resetting the index.

#MSSQLTips.com
df_train = df_reviews[:90].reset_index(drop=True)
df_test = df_reviews[-10:].reset_index(drop=True)

Let's now observe the class distribution across both of the datasets.

#MSSQLTips.com
ddf_train['Sentiment'].value_counts()
Class distribution in the training dataset
#MSSQLTips.comdfdf_test['Sentiment'].value_counts()
Class distribution in the testing dataset

Let's also turn our attention towards the dataset labels. We need to convert them from their textual form to numerical form so that they can be processed by the algorithm that will be used later. Following convention, we can use '1' for a positive review and '0' for a negative review.

#MSSQLTips.com
df_train['Sentiment'] = df_train['Sentiment'].replace({'Positive': 1, 'Negative': 0})
df_test['Sentiment'] = df_test['Sentiment'].replace({'Positive': 1, 'Negative': 0})
train_labels = df_train['Sentiment'].to_numpy()
test_labels = df_test['Sentiment'].to_numpy()

Now, we can move onto vectorizing our dataset.

#MSSQLTips.com
vectorizer = CountVectorizer()
train_counts = vectorizer.fit_transform(df_train['Review'])
test_counts = vectorizer.transform(df_test['Review'])

We have used Sklearn's count vectorizer. Its documentation is available here. We can also print the shape of the returned object to know more about the vocabulary size of our dataset. It is 301 in our case, as shown below.

#MSSQLTips.com
print(train_counts.shape)
prprint(test_counts.shape)
Dimensions of the vectorized dataset

Now, all there is left is to fit our data into the model and let it do all the work. We already know most of the theory behind the inner workings of a Naive Bayes model; therefore, it is no longer a black box model for us. Again, for ease and convenience, we will use Sklearn's Multinomial Naive Bayes classifier. Its documentation is available here.

#MSSQLTips.com
model = MultinomialNB()
model.fit(train_counts, train_labels)
pred = model.predict(test_counts)

Since the dataset was small, it would be meaningless to evaluate it thoroughly. Nonetheless, we should still observe our model predictions, as outlined below.

#MSSQLTips.com
prediction = ['Negative' if label == 0 else 'Positive' for label in pred]
for index, row in df_test.iterrows():
   print(f"Row {index}:Prediction = {prediction[index]}, Review = {row['Review']}")
Reviews and their corresponding predictions by our model

Despite how miniscule our dataset was, our Naive Bayes model performed exceptionally well, with only two misclassifications. This amounts to almost 80% classification accuracy, which is awesome for a model this simple, and the dataset size this tiny!

Naive Bayes in Real World

Although we have seen our model perform well beyond our expectations, it is time to ground ourselves and evaluate the Naive Bayes classifier itself in terms of its advantages and limitations. Most importantly, there is a big reason why our classifier is termed as 'naive'. Remember how we calculated the posterior probabilities? As a demonstration, for a movie review "great movie loved it", we calculate the posterior probability for the positive class as follows:

formula

There is, however, a strict underlying mathematical assumption behind the method we follow to compute the posterior in such a way. The 'naive' assumption our model refers to the conditional independence whereby the data probabilities given a certain class formula are independent. If we don't make this assumption, the computation becomes as follows:

formula

formula

We can now easily see why we make the conditional independence assumption, otherwise the calculations become very convoluted. Unfortunately, it is rare for data to be conditionally independent in real life, which can make our model less applicable at times.

However, we have also seen that this model is very easy to implement, does not require a huge training dataset to yield good performance, and is also insensitive to irrelevant features. So, we see that Naive Bayes is readily used for text sentiment analysis, email spam detection, topic classification, etc.

Summary

In this tip, we introduced the readers to the Naive Bayes classifier. To ensure the prerequisites are met, readers were first familiarized with the basic probability theory, including Bayes Theorem to provide the necessary foundation. We then moved onto discussing the model algorithm in detail, alongside various examples. To demonstrate the practical side of things, we then implemented the model using the Sklearn library in Python. We then concluded our discussion after highlighting the pros, cons, and real-world applications of the Naive Bayes classifier.

Next Steps
  • For interested readers, further explore different types of Naive Bayes models such as Gaussian, Categorical, Bernoulli, and Complement Naive Bayes models as available on Sklearn. Readers should evaluate where these models shine the most, what their outputs are, and what sort of datasets they are suitable for.
  • Since the word regression was not mentioned anywhere in this tip, readers are advised to investigate if it is even feasible to conduct regression with a Naive Bayes model, and why it is so.
  • Readers are also encouraged to research into other smoothing methods than add-1 smoothing and analyze its strengths and limitations with respect to a problem at hand.
  • To checkout more AI related tips.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Harris Amjad Harris Amjad is a BI Artist, developing complete data-driven operating systems from ETL to Data Visualization.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2024-11-21

Comments For This Article

















get free sql tips
agree to terms