Analysis

Since our study use interactive map to visualize better, analysis methods here is simple and straightforward. We find that the velocity fits gaussian distribution well.A simple t.test() or aov() test will be good. Since we are considering a more detailed location based model. Multinomial Mixed Logistic model which is generally the best model for transportation decision question is not necessary here. We have the following function to make the selection process tidy and we will adopt data.table all the time to make the deployment effcient. Here is a sample function for analysis on our website. We can analysis result from any scale simply by cancel some key variables

test_function = function(month_in, week_in, data = test_dt,PU,DO, distance_range_low = 0, distance_range_up = 6){
  x = data[month %in% month_in][week %in% week_in][type == 'taxi'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
  x %>% hist()
  y = data[month %in% month_in][week %in% week_in][type == 'bike'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
  hist(as.numeric(as.data.frame(y)))
  z = data[month %in% month_in][week %in% week_in][type == 'bike'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
  hist(as.numeric(as.data.frame(z)))
}
test_function(month_in = c(1,6,7),week_in = c(1,2,3), PU = "Washington Heights South", DO = "Morningside Heights",
              distance_range_low = 0, distance_range_up = 20, data = rbind(test_dt,test_mta_final))

Afterwards we can extract the mean estimate and a p.value based on either t.test or ANOVA.

Regression

We wound like to address a regression analysis to given out a more precise description of how climate factors affect the velocity besides initial data exploration.

For bike data:

lm_velocity <-
  weather_df %>%
  filter(type == "bike") %>%
  mutate(month = as.factor(month)) %>% as.data.frame()

lm(velocity ~ trip_distance + awnd + prcp + snwd + tmax + tmin, data = lm_velocity) %>% summary()

## 
## Call:
## lm(formula = velocity ~ trip_distance + awnd + prcp + snwd + 
##     tmax + tmin, data = lm_velocity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2692 -0.7067  0.0901  0.7926  3.8033 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.646e+00  1.676e-01  15.785   <2e-16 ***
## trip_distance  1.844e-04  1.642e-05  11.233   <2e-16 ***
## awnd           5.446e-03  3.686e-03   1.477   0.1398    
## prcp           9.348e-04  5.004e-04   1.868   0.0620 .  
## snwd          -8.173e-04  1.249e-03  -0.654   0.5130    
## tmax           5.846e-04  1.125e-03   0.520   0.6035    
## tmin          -2.337e-03  1.209e-03  -1.933   0.0534 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 1417 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.09777,    Adjusted R-squared:  0.09395 
## F-statistic: 25.59 on 6 and 1417 DF,  p-value: < 2.2e-16

We can discover that the regression coefficient for prcp,tmin,trip_distance is significant under 0.1 significance level.

The estimates are:

prcp: 9.348e-04
tmin: -2.337e-03
trip_distance: 1.844e-04

This is consistent with our initial data visualizaiton findings.

For taxi data:

lm_velocity <-
  weather_df %>%
  filter(type == "taxi") %>%
  mutate(month = as.factor(month)) %>% as.data.frame()
lm(velocity ~ trip_distance + awnd + prcp + snwd + tmax + tmin, data = lm_velocity) %>% summary()

## 
## Call:
## lm(formula = velocity ~ trip_distance + awnd + prcp + snwd + 
##     tmax + tmin, data = lm_velocity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9772 -1.1170 -0.3150  0.7746  8.4551 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.753e+00  2.162e-01  17.360   <2e-16 ***
## trip_distance  4.114e-04  1.937e-05  21.235   <2e-16 ***
## awnd           1.084e-03  5.103e-03   0.212    0.832    
## prcp          -1.055e-04  6.255e-04  -0.169    0.866    
## snwd           1.690e-04  1.011e-03   0.167    0.867    
## tmax           1.163e-03  1.493e-03   0.779    0.436    
## tmin          -1.166e-03  1.739e-03  -0.671    0.503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.745 on 1447 degrees of freedom
##   (26 observations deleted due to missingness)
## Multiple R-squared:  0.2389, Adjusted R-squared:  0.2357 
## F-statistic: 75.69 on 6 and 1447 DF,  p-value: < 2.2e-16

We can see that only trip_distance is significant, which is consistent with our prior knowledge and the data exploration before.