NJ Transit Tweet Analysis & Predictions : Part 1

Padmajeet Mhaske
3 min readMay 16, 2019

--

NJ Transit is the third largest Rail Commuter Network providing services in USA. Currently Eleven train lines are in service and serve their Commuter services in 5325 Square miles across New Jersey, New York, and Philadelphia (Pennsylvania) states.

From past decades, Because of negligence NJ Transit rail going through various Failures like, Train Delays and Train Cancellations. Many commuters daily commute with NJ Transit facing difficulties to manage their travel time.

Due to carelessness,NJ Transit now very unreliable and unpredictable service for commuters.

Because trains fail to reach their destination on time, NJ Transit convey train status with commuters via NJ Transit tweeter Handles. There is an individual Twitter handle for each train line.

· @NJTRANSIT_NEC (North east Corridor)

· @NJTRANSIT_NJCL (North Jersey Coast Line)

· @NJTRANSIT_ME (Metro Park Line)

· @NJTRANSIT_MOBO (Montclair Boonton Line)

· @NJTRANSIT_MBPJ (Main/Bergen County Line)

· @NJTRANSIT_PVL (Pascack Valley Line)

· @NJTRANSIT_RVL (Raritan Valley Line)

· @NJTRANSIT_ACRL (Atlantic City Rail Line)

Machine learning leverages classifiers and algorithms to model the NJ Transit and usually with the goal of predicting Train status or response. These algorithms are heavily based on statistics and mathematical optimization.

Data Collection

NJ Transit furnishes average data that simply differentiates trains as “on-time” or “late”. I have collected and extracted data set from NJ Transit twitter handle as shown in Figure 1.

Figure 1 Tweet made by NJ Transit — NEC

Data Reduction:

After collecting March 2019 data from twitter handles of NJ Transit, I observed that, the data count is huge and project aim is to classify status of train, so I removed unnecessary labels from data set like stop sequence, train start time, train end station.

Data Cleaning:

One of the first steps I followed is fixing errors in NJ Transit data set is to find incomplete values and fill them out. I filled out missed and inappropriate values for example “Evening” label filled as false (0), and true (1) values, “isCancelled” label values filled with 1 or 0 as shown in figure 2.

Table 1 Clean data set .csv file

Final NJ Transit Data set is saved at https://github.com/mhaskeshradha1/NJTransitTweetAnalysis

Data Analysis

Figure 2 data Analysis of all Labels using Weka Explorer

In next section, I will more emphasizing on how train status will be classify as as “Late” or “cancel” by applying Classifiers and algorithms of machine learning.

You can follow me on Medium for next posts and I welcome feedback.

--

--

Padmajeet Mhaske
Padmajeet Mhaske

Written by Padmajeet Mhaske

Padmajeet is a seasoned leader in artificial intelligence and machine learning, currently serving as the VP and AI/ML Application Architect at JPMorgan Chase.

No responses yet