NJ Transit Tweet Analysis & Predictions : Part 1
NJ Transit is the third largest Rail Commuter Network providing services in USA. Currently Eleven train lines are in service and serve their Commuter services in 5325 Square miles across New Jersey, New York, and Philadelphia (Pennsylvania) states.
From past decades, Because of negligence NJ Transit rail going through various Failures like, Train Delays and Train Cancellations. Many commuters daily commute with NJ Transit facing difficulties to manage their travel time.
Due to carelessness,NJ Transit now very unreliable and unpredictable service for commuters.
Because trains fail to reach their destination on time, NJ Transit convey train status with commuters via NJ Transit tweeter Handles. There is an individual Twitter handle for each train line.
· @NJTRANSIT_NEC (North east Corridor)
· @NJTRANSIT_NJCL (North Jersey Coast Line)
· @NJTRANSIT_ME (Metro Park Line)
· @NJTRANSIT_MOBO (Montclair Boonton Line)
· @NJTRANSIT_MBPJ (Main/Bergen County Line)
· @NJTRANSIT_PVL (Pascack Valley Line)
· @NJTRANSIT_RVL (Raritan Valley Line)
· @NJTRANSIT_ACRL (Atlantic City Rail Line)
Machine learning leverages classifiers and algorithms to model the NJ Transit and usually with the goal of predicting Train status or response. These algorithms are heavily based on statistics and mathematical optimization.
Data Collection
NJ Transit furnishes average data that simply differentiates trains as “on-time” or “late”. I have collected and extracted data set from NJ Transit twitter handle as shown in Figure 1.
Data Reduction:
After collecting March 2019 data from twitter handles of NJ Transit, I observed that, the data count is huge and project aim is to classify status of train, so I removed unnecessary labels from data set like stop sequence, train start time, train end station.
Data Cleaning:
One of the first steps I followed is fixing errors in NJ Transit data set is to find incomplete values and fill them out. I filled out missed and inappropriate values for example “Evening” label filled as false (0), and true (1) values, “isCancelled” label values filled with 1 or 0 as shown in figure 2.
Final NJ Transit Data set is saved at https://github.com/mhaskeshradha1/NJTransitTweetAnalysis
Data Analysis
In next section, I will more emphasizing on how train status will be classify as as “Late” or “cancel” by applying Classifiers and algorithms of machine learning.
You can follow me on Medium for next posts and I welcome feedback.