Since our study use interactive map to visualize better, analysis methods here is simple and straightforward. We find that the velocity fits gaussian distribution well.A simple t.test()
or aov()
test will be good. Since we are considering a more detailed location based model. Multinomial Mixed Logistic model which is generally the best model for transportation decision question is not necessary here. We have the following function to make the selection process tidy and we will adopt data.table
all the time to make the deployment effcient. Here is a sample function for analysis on our website. We can analysis result from any scale simply by cancel some key variables
test_function = function(month_in, week_in, data = test_dt,PU,DO, distance_range_low = 0, distance_range_up = 6){
x = data[month %in% month_in][week %in% week_in][type == 'taxi'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
x %>% hist()
y = data[month %in% month_in][week %in% week_in][type == 'bike'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
hist(as.numeric(as.data.frame(y)))
z = data[month %in% month_in][week %in% week_in][type == 'bike'][trip_distance >= 1000*distance_range_low][trip_distance <= 1000*distance_range_up][PUZone == PU][DOZone == DO][,.(velocity)]
hist(as.numeric(as.data.frame(z)))
}
test_function(month_in = c(1,6,7),week_in = c(1,2,3), PU = "Washington Heights South", DO = "Morningside Heights",
distance_range_low = 0, distance_range_up = 20, data = rbind(test_dt,test_mta_final))
Afterwards we can extract the mean estimate and a p.value based on either t.test or ANOVA.
We wound like to address a regression analysis to given out a more precise description of how climate factors affect the velocity besides initial data exploration.
For bike data:
lm_velocity <-
weather_df %>%
filter(type == "bike") %>%
mutate(month = as.factor(month)) %>% as.data.frame()
lm(velocity ~ trip_distance + awnd + prcp + snwd + tmax + tmin, data = lm_velocity) %>% summary()
##
## Call:
## lm(formula = velocity ~ trip_distance + awnd + prcp + snwd +
## tmax + tmin, data = lm_velocity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2692 -0.7067 0.0901 0.7926 3.8033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.646e+00 1.676e-01 15.785 <2e-16 ***
## trip_distance 1.844e-04 1.642e-05 11.233 <2e-16 ***
## awnd 5.446e-03 3.686e-03 1.477 0.1398
## prcp 9.348e-04 5.004e-04 1.868 0.0620 .
## snwd -8.173e-04 1.249e-03 -0.654 0.5130
## tmax 5.846e-04 1.125e-03 0.520 0.6035
## tmin -2.337e-03 1.209e-03 -1.933 0.0534 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 1417 degrees of freedom
## (60 observations deleted due to missingness)
## Multiple R-squared: 0.09777, Adjusted R-squared: 0.09395
## F-statistic: 25.59 on 6 and 1417 DF, p-value: < 2.2e-16
We can discover that the regression coefficient for prcp
,tmin
,trip_distance
is significant under 0.1 significance level.
The estimates are:
prcp
: 9.348e-04tmin
: -2.337e-03trip_distance
: 1.844e-04This is consistent with our initial data visualizaiton findings.
For taxi data:
lm_velocity <-
weather_df %>%
filter(type == "taxi") %>%
mutate(month = as.factor(month)) %>% as.data.frame()
lm(velocity ~ trip_distance + awnd + prcp + snwd + tmax + tmin, data = lm_velocity) %>% summary()
##
## Call:
## lm(formula = velocity ~ trip_distance + awnd + prcp + snwd +
## tmax + tmin, data = lm_velocity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9772 -1.1170 -0.3150 0.7746 8.4551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.753e+00 2.162e-01 17.360 <2e-16 ***
## trip_distance 4.114e-04 1.937e-05 21.235 <2e-16 ***
## awnd 1.084e-03 5.103e-03 0.212 0.832
## prcp -1.055e-04 6.255e-04 -0.169 0.866
## snwd 1.690e-04 1.011e-03 0.167 0.867
## tmax 1.163e-03 1.493e-03 0.779 0.436
## tmin -1.166e-03 1.739e-03 -0.671 0.503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.745 on 1447 degrees of freedom
## (26 observations deleted due to missingness)
## Multiple R-squared: 0.2389, Adjusted R-squared: 0.2357
## F-statistic: 75.69 on 6 and 1447 DF, p-value: < 2.2e-16
We can see that only trip_distance
is significant, which is consistent with our prior knowledge and the data exploration before.