NJ Transit Tweet Analysis & Predictions : Part 2
After creating and finalizing the data set of NJ Transit I am focusing on classifiers.
Naive Bayes Classifier
Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under a simple assumption which is the attributes is conditionally independent.
The classification is conducted by deriving the maximum posterior which is the maximal P (Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.
Figure 1 shows the confusion matrix which demonstrates miss classified labels and classified labels for NJ Transit trains which are late by 60+ minutes, less than 20 minutes, 20 to 40 minutes and 40 to 60 minutes respectively.
Figure 2 demonstrates the correct classified instances are 210 (72% accuracy) while incorrect classified instances are 80 (27%).
Bayes Net Classifier (Probabilistic model)
Bayesian networks — also known as “belief networks” or “causal networks” — are graphical models for representing multivariate probability distributions. Each variable is represented as a vertex in an directed a cyclic graph (“dag”); the probability distribution is represented in factorized form as follows:
Where the set of vertices that are is ‘s parents in the graph. A Bayesian network is fully specified by the combination of:
· The graph structure, i.e., what directed arcs exist in the graph.
· The probability table for each variable .
Figure 3 shows the confusion matrix which demonstrates miss-classified labels and classified labels for NJ Transit trains which are late by 60+ minutes, less than 20 minutes, 20 to 40 minutes and 40 to 60 minutes respectively.
Figure 4 demonstrates the correct classified instances are 225 (77% accuracy) while incorrect classified instances are 65 (22%).
After applying Bayes Net classifier, the miss-classified labels count is less as compared to Bayesian classifier and accuracy of classified labels is high.
Logistic Regression
Model
Output = 0 or 1
Hypothesis => Z = WX + B
hΘ(x) = sigmoid (Z)
If ‘Z’ goes to infinity, Y (predicted) will become 1 and if ‘Z’ goes to negative infinity, Y (predicted) will become 0.
Figure 5 shows the confusion matrix which demonstrates miss-classified labels and classified labels for NJ Transit trains which are late by 60+ minutes, less than 20 minutes, 20 to 40 minutes and 40 to 60 minutes respectively.
Plot 6 is the evidence that correct classified instances are 241 (83% accuracy) while incorrect classified instances are 49 (16%).
After applying logistic regression classifier the miss-classified labels count is less as compared to Naive Bayes classifier and accuracy of classified labels is high
K Nearest Neighbor Classifier
After logistic regression Nearest Neighbor return the majority vote of their labels E.g. y(X1) = x, y(X2) = o
K yields smoother predictions, since we average over more data
• K=1 yield y=piece wise constant labeling
• K = N predicts y=globally constant (majority) label
Figure 7 convey the confusion matrix which demonstrates miss classified labels and classified labels for NJ Transit trains which are late by 60+ minutes, less than 20 minutes, 20 to 40 minutes and 40 to 60 minutes respectively.
Plot 8 signifies the correct classified instances are 278 (95% accuracy) while incorrect classified instances are 12 (4%).
After applying K Nearest Neighbor classifier the miss classified labels count is less as compared to Logistic function classifier and accuracy of classified labels is high.
After applying above mentioned classifiers, the accuracy is increased.
In next part I will discussed about visualization of data using Weka explorer.