Overfitting Explanation
Overfitting is when model is too complex, training error is small but test error is large.
Two mainnreasons for overfitting are : Limited training size and High Model Capacity
Overfitting results in decision trees that are more complex than necessary
Increasing the size of training data reduces the difference between training and testing errors at a given size of model , thus
we can overcome Overfitting
Hyperparameter tuning
The hyperparameters are parameters that is used to obtain the best performance out of a classifier , for example
The Random Forest classifier has n_estimators which are also the number of decision trees. when n_estimators were 10 theclassifier had
a low accuracy but when the n_estimators increased to 20 (more treess) the classifiers accuracy decreased. same way SVM has c= 1,10,100
in which c=100 has the best accuracy. this is the basic explanation of hyperparameter tuning.
Explanation of basic algorithm
The classifier used are : Multinomial naive bayes , Support Vector Machine (SVM) , Random-Forest Classifier
Multinomial Naive Bayes
Algorithm : Contains two basic steps train the Multinomial NB and Apply the Multinomial NB
Training of Multinomial NB contains the following steps : Convert the dataset into frquency table
It creates a likelihood by finding probablities Now we use the Naive Bayesian equation to find the posterior probablities of each class
The class with highest Posterior probablity will be the outcome.
Support Vector Machine
Lets suppose there is a dataset that contains two categories which are red and blue plotted on a 2d plane , we can easily
segregate the data by drawing a line. But when the data is spread over a n-dimensional space SVM separates the data by drawing a hyperplane in between
them . This hyperplane is also called as Decision boundary.
Random-Forest Classifier
Random forest like its name suggests, contains a large number of decision trees that operate as an ensemble.
Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction
Working:
Select a set of random Samples from datasets
We need to create a decision tree for each selected sample
The we get the predicted values from each tree created.
Then for each predicted result voting is done.
It will select the result with highest votes.
Contributions
My Major contribution to this would be choosing the appropriate classifiers required for text classification.
and increasing the accuracy so that the machine would appropriately predict weather the review is positive or not.
I do have used some references inorder to build the classiiers but also tried to use some hyperparameters among them to increse its accuracies.
overall I have been able to push the accuracy from 66% to 85% and the machine is almost pinpoint in deciding weather the review is positive or negative
Challenges
My major challenge was to choose and build the appropriate classifiers. data cleaning was not a mjor issue but to choose the
right classifier I had to learn how they function and thus choose the right ones for this experiment.
according to my knowledge i have chosen the best ones but there might be better classifiers as well.
And to complete the project within a limited timeframe due to other exams was also one of the challenges I faced !
Overcoming the Challenges
In order to overcome the challenges I did use some references which eased my way in learning the classifiers and also helped
me implement the classifiers also the previous assignments helped me . for example Naive bayes had been implemented in assignment 3
and thus helped me in this project.
Acknowledgement
I would like to thank Dr.Deok Gun Park for this wonderful session of data mining classes and this amaazing opportunity to
apply the learning in the form of this project. It has added great value to my resume , in the end it was a pleasure taking up these classes
Thank you once again Dr.Park !
References
I have referrred the code from the following references.
These references have given a good headstart to the project and helped me understand the concept better.
I have also tried to experimnent a few things from the referenced code which is mentioned under "contributions"
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://stackabuse.com/text-classification-with-python-and-scikit-learn/
https://www.reddit.com/r/MachineLearning/comments/2uhhbh/difference_between_binomial_multinomial_and/co8xxls/
https://scikit-learn.org/stable/modules/naive_bayes.html
https://monkeylearn.com/text-classification-support-vector-machines-svm/
https://scikit-learn.org/stable/modules/svm.html
https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
https://en.wikipedia.org/wiki/Random_forest
https://medium.com/datadriveninvestor/deploy-your-machine-learning-model-using-flask-made-easy-now-635d2f12c50c
https://www.tutorialspoint.com/after-method-in-python-tkinter#:~:text=Tkinter%20is%20a%20python%20library%20to%20make%20GUIs.,function%20FuncName%20after%20the%20given%20delay%20in%20milisecond
Links
Kaggle Notebook .
youtube video tutorial for GUI
My portolio
Download Proposal