Data Mining term Project

Movie Review Classifier

Introduction:

The goal of this project is to build a classifier which will be able to predict if the given movie review is positive or negative
For this experiment we have taken about 40,000 reviews for training data , 5000 reviews for development and 5000 reviews for testing the classifier.
please click here for the dataset

There are many similar applications on the internet that perform the same action in a more efficient manner.
Please click here to find the top 10 best apps that provide moview revies
Please click here to find similar apps that segregate the reviews into positive and negative ones.

There are a lot other apps apart from these that can segregate a positive and negative reviews . The application i have built tries to segregate the review into positivie or negative instatntly with almost 86% accuracy.

WHY ?

A lot of applications (better ones) already exist on the internet that can give a rating once you type your review . a lot of problems are being caused because of the "fake reviews" or "neutral reviews" as humans we can figure out what exactly is going on but it is slightly challenging to make a machine understand the same this is why i have made a modest attempt towards making the machine understand the reviews and tell us weather it is positive or negative , and when it comes to neutral reviews my machine tries it bests to predict wether it is on positive side or negative side

Please click below to refer the full code.

Get Started

Source codes and outputs:

First we import all the Libraries required

We load the Data

We segregate the data into training data , dvelopment data and testing data.

Now we print the data to verify the loading process.

Here we verify if the data loaded is split properly into train , development and test data.

Now we add a Column: Sentiment_id

We add this column and assign positive reviews to 1 and negative revies to 0. We do this as lot of plotting and visualization would require numerical data.

Now we print the data again.

Here we print the data once again to make sure the sentiment_int column is added inall train , development and test data

Data Visualization

Now we shall do some visualiztion to check the rows and columns of each type of data and we shall also compare the nuber of positive and negative reviews in each type of data set.

We shall check the number of records in train , test and development data

Data Cleaning

In-order to obtain the best accuracy we need to clean the data , in this process we remove the duplicates and null values from the data. I did not have to remove the special characters as the data did not have any and the data was also in lower case

Removing the Null values and converting the data to lower case a

This step is essential because the classifier may consider upper case : AMZAING and lower case : amazing as two different words.

Now we check the rating count

We check the number of positive and negative reviews in train data

Creating X and Y for train , test and dev

To build classifiers (to train and predict): We will need X and y for each dataset
The term frequency is calculated as:
Term frequency = (Number of Occurrences of a word)/(Total words in the records)
And the Inverse Document Frequency is calculated as:
IDF(word) = Log((Total number of records)/(Number of records containing the word))

Above we have made dictionary to store the results of the classifiers

Building the Classifiers

We will build and compare multiple classifiers which may be useful for text classification:

1 ) Naive Bayes Classifier
2) SVM (Support Vector Machine) Classifier
3) Random Forest Classifier

there are various other text classifiers but for this experiment we shall use the above three classifiers.

Naive Bayes Classifier.

Naive bayes method is one of the best method for text classification and text analysis in machine learning Naive bayes is not a single algorithm its a family of algorithms
many of those algorithms are:

Binomial naive bayes. - it is useful for features that are binare ie. 0 and 1

Multinomial naive bayes. it is the most ideal algorithm for text classification

Gaussian naive bayes. it is ideal for features that follow a normal distribution , or features that are continuous.

For our experiment we use Multinomial Naive Bayes using smoothing:
We calculate both accuracy and mean squared error and later decide which parameter to use to decide the best classifier.

Support Vector Machine (SVM)

Amongst the many text classifiers SVM is also one of the most efficcient and faster classifier According to the documentation of scikit-learn :
For large datasets consider using sklearn.svm.LinearSVC and not sklearn.svm.SVC So, we will go with sklearn.svm.LinearSVC
Support vector machine decides the best boundary between vectors that belong to a group
It can be applied to any kind of vector which encode any kind of data. SVM decides where to draw the best line or the hyperplane that divides the data efficiently the space between the category and the hyperplane on both sides decides the accuracy.
We will try SVM for C=1 , c=100 and c= 1000

C = 1

C = 100

C = 1000

Random Forest Classifier

A Random Forest Classifier is a meta estimator that fits a number of decision tree classifiers on various sub samples of dataset and uses averaging to improve the accuracy and control overfitting.
In other words random decision forests are ensemble learning method for classification by building decision trees during training time and outputs the mean prediction of indivisual trees. The number of trees in the forest (n_estimators) is considerd as one of the most important factor for getting better accuracy. We will use n_estimators = 10 and 50 to check the results.

We can see that Random_forest classifier wit n_estimators = 20 has a better accuracy.

Comparing the accuracies

Here we compare the accuracies to find the best classifier , we can consider the mean sqaured error as well but I have chosen accuracies to compute the same as our movie reviewer needs high accuracy to predict if the data is positive or negative.

We can see that Multinomial Naive Bayes has the best accuracy.

Input field and test accuracy

Now we use the best classifier on the test data and find out the accuracy , once we have the accuracy we build an input field that gives the output of wether the review is positive or negative.

Outputs Samples

Graphic User Interface (GUI)

I have built a basic GUI that will take your review as input and tell us weather the review is positive or negative.

Overfitting Explanation

Overfitting is when model is too complex, training error is small but test error is large. Two mainnreasons for overfitting are : Limited training size and High Model Capacity Overfitting results in decision trees that are more complex than necessary Increasing the size of training data reduces the difference between training and testing errors at a given size of model , thus we can overcome Overfitting

Hyperparameter tuning

The hyperparameters are parameters that is used to obtain the best performance out of a classifier , for example The Random Forest classifier has n_estimators which are also the number of decision trees. when n_estimators were 10 theclassifier had a low accuracy but when the n_estimators increased to 20 (more treess) the classifiers accuracy decreased. same way SVM has c= 1,10,100 in which c=100 has the best accuracy. this is the basic explanation of hyperparameter tuning.

Explanation of basic algorithm

The classifier used are : Multinomial naive bayes , Support Vector Machine (SVM) , Random-Forest Classifier

Multinomial Naive Bayes

Algorithm : Contains two basic steps train the Multinomial NB and Apply the Multinomial NB Training of Multinomial NB contains the following steps : Convert the dataset into frquency table
It creates a likelihood by finding probablities
Now we use the Naive Bayesian equation to find the posterior probablities of each class
The class with highest Posterior probablity will be the outcome.

Support Vector Machine

Lets suppose there is a dataset that contains two categories which are red and blue plotted on a 2d plane , we can easily segregate the data by drawing a line. But when the data is spread over a n-dimensional space SVM separates the data by drawing a hyperplane in between them . This hyperplane is also called as Decision boundary.

Random-Forest Classifier

Random forest like its name suggests, contains a large number of decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction
Working:
Select a set of random Samples from datasets
We need to create a decision tree for each selected sample
The we get the predicted values from each tree created.
Then for each predicted result voting is done.
It will select the result with highest votes.

Contributions

My Major contribution to this would be choosing the appropriate classifiers required for text classification. and increasing the accuracy so that the machine would appropriately predict weather the review is positive or not. I do have used some references inorder to build the classiiers but also tried to use some hyperparameters among them to increse its accuracies. overall I have been able to push the accuracy from 66% to 85% and the machine is almost pinpoint in deciding weather the review is positive or negative

Challenges

My major challenge was to choose and build the appropriate classifiers. data cleaning was not a mjor issue but to choose the right classifier I had to learn how they function and thus choose the right ones for this experiment. according to my knowledge i have chosen the best ones but there might be better classifiers as well. And to complete the project within a limited timeframe due to other exams was also one of the challenges I faced !

Overcoming the Challenges

In order to overcome the challenges I did use some references which eased my way in learning the classifiers and also helped me implement the classifiers also the previous assignments helped me . for example Naive bayes had been implemented in assignment 3 and thus helped me in this project.

Acknowledgement

I would like to thank Dr.Deok Gun Park for this wonderful session of data mining classes and this amaazing opportunity to apply the learning in the form of this project. It has added great value to my resume , in the end it was a pleasure taking up these classes Thank you once again Dr.Park !

References

I have referrred the code from the following references. These references have given a good headstart to the project and helped me understand the concept better. I have also tried to experimnent a few things from the referenced code which is mentioned under "contributions"

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://stackabuse.com/text-classification-with-python-and-scikit-learn/
https://www.reddit.com/r/MachineLearning/comments/2uhhbh/difference_between_binomial_multinomial_and/co8xxls/
https://scikit-learn.org/stable/modules/naive_bayes.html
https://monkeylearn.com/text-classification-support-vector-machines-svm/
https://scikit-learn.org/stable/modules/svm.html
https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
https://en.wikipedia.org/wiki/Random_forest
https://medium.com/datadriveninvestor/deploy-your-machine-learning-model-using-flask-made-easy-now-635d2f12c50c
https://www.tutorialspoint.com/after-method-in-python-tkinter#:~:text=Tkinter%20is%20a%20python%20library%20to%20make%20GUIs.,function%20FuncName%20after%20the%20given%20delay%20in%20milisecond

Links

Kaggle Notebook .
youtube video tutorial for GUI
My portolio
Download Proposal

Contact

Any Queries ? Contact me !

Email Id : Neelesh216@gmail.com

Address : 404 E Border st, Arlington, Texas 76010, USA

Phone : 682 375 1222