Showcase and reviewing the different tools and technique within the tidyvert and timetk collection that allow user to be able to perform data wrangling, data decomposition as well as feature extraction for time series analysis
As mentioned in my earlier post, in this article we will be continuing our reviewing the current techniques used by timetk collection on how the data structure is being set up to perform data cleaning and wrangling after the extraction of dataset via web scrapping.
After which, we will be reviewing the ability to perform feature engineering to look at how time series features, decompositions, statistical summaries and convenient visualizations could be perform by timetk collection.
Similarly, we would first start with setting up the environment and installation of the packages required for data transformation using R. To ensure that we had cleared the environment to perform data manipulation, we would remove prior R object using the code below
Next, we would run the following code chunk to validate if the required packages are installed. In the event that the packages are not installed, the code will install the missing packages. Afterwhich, the code would read the required package library onto the current environment.
packages = c('tidyverse','tidyquant','tsibble','timetk','feasts','ggplot2','stats','lubridate'
,'data.table','rmarkdown','knitr','nycflights13')
for (p in packages) {
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
In this article, we will be using the same weather dataset from the nycflights13 package would be used and re-ingested to ensure that the comparison of the tidyvert and timetk collection is consistent.
The dataset is inclusive of 5 set of data points, namely airlines, airports, flights, planes and weather. Over here in this article, we will be using the weather dataset, where it contain weather related variable like temperature, humid and precipitation.
weather = read_csv("data/weather.csv")
paged_table(weather)
The timetk collection is a tidyverse toolkit to visualize, wrangle and transform time series data with much more interactive graphical visual. As mention in the post earlier here, we are able to understand and review the different function and tool that is within the timetk collection mianly for data visulization.
In this article I will looking into how data manipulation, time series decomposition and feature extraction could be done using timetk collection and how it is different from the tidyvert collection.
To begin our analysis, similarly as before we will start with data wrangling. We also be injecting the dataset to ensure that the dataset used had not been manipulated using the (collection. The timetk collection includes several essential data wrangling tools. In this article, we will be looking at the following functions:
Summarise by Time - Time-based aggregations
Filter by Time - Time-based filtering
Pad by Time - Filling in gaps
With summarise_by_time() we are able to aggregate the dataset with the required period using SUM() as well as period smoothing with either AVERAGE(), FIRST(),LAST() function. With the SUM() function we are able to compute the total humidity level of each week, as shown in the next line of code.
weather = weather %>%
select(origin, year, month, day, time_hour, temp, humid, precip)
weather %>%
group_by(origin) %>%
summarise_by_time(
time_hour, .by = "week",
humid = SUM(humid)
) %>%
plot_time_series(time_hour, humid, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
Period smoothing can also be computed where using AVERAGE() where we will be able to get the dataset average value that help in the smoothing of the weather dataset as shown. Similar technique could be done using FIRST() or MEDIAN().
weather %>%
group_by(origin) %>%
summarise_by_time(
time_hour, .by = "month",
adjusted = AVERAGE(humid)
) %>%
plot_time_series(time_hour, adjusted, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
In next part of data wrangling, we will looking at time filter to allow user to be able to select the dataset with their required time period using the filter_by_time() function. For the weather dataset, we would like to understand how the humidity level had change in the lower half of year 2013, by allocating the start_date to beginning of June and end_date to be the end of December we would be able to filter the required time period.
weather %>%
group_by(origin) %>%
filter_by_time(time_hour, "2013-06-01","2013-12-31") %>%
plot_time_series(time_hour, humid, .facet_ncol = 2, .interactive = FALSE, .y_intercept = 0)
Padding data would implies that we are able to fill in missing datapoint of the dataset which tidyvert collection had a similar concept as well. Using the padr package within the timetk collection, we are able to fill in missing datapoint easily.
Dataset that had been collected at times might not always be completed hence through pad_by_time() user will be able to fill in missing data to allow the irregular time series be regular. This is process is done through using pad_value where it would auto filling in the missing value/gaps of the weather dataset based on current dataset.
full_weather = weather %>%
pad_by_time(time_hour, .by = "auto")
head(full_weather)
# A tibble: 6 x 8
origin year month day time_hour temp humid precip
<chr> <dbl> <dbl> <dbl> <dttm> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 2013-01-01 06:00:00 39.0 59.4 0
2 JFK 2013 1 1 2013-01-01 06:00:00 39.0 59.4 0
3 LGA 2013 1 1 2013-01-01 06:00:00 39.9 57.3 0
4 EWR 2013 1 1 2013-01-01 07:00:00 39.0 61.6 0
5 JFK 2013 1 1 2013-01-01 07:00:00 39.0 59.4 0
6 LGA 2013 1 1 2013-01-01 07:00:00 41 55.0 0
After data wrangling of the dataset, we will be looking into the various graphical feature on the ploting of time plot of the weather dataset. Where within the timetk collection, the plot_time_series() function is used to generates an interactive plotly chart by default.
Also, we are able to use the following two function to allow the time series plot to be much more interactive where setting .interactive = TRUE and .plotly_slider = TRUE allow the user to hover datapoint and use the date slider at the bottom of the chart.C
ompared to plot using tidyvert collection in feasts package, the figure plotted using timetk allow the user to better understand the dataset and filter the dataset according to their time series analysis needs.
full_weather %>%
plot_time_series(.date_var= time_hour, .value = humid,
.line_size = 0.2, .line_color = "#3879A3",
.title = "Humidity level of hourly meterological data",
.x_lab = "Time period",
.y_lab = "Humidity Level",
.interactive=TRUE, .plotly_slider = TRUE)
The next line of code would demonstrate how we are able grouped data base on their different origin of the weather dataset with group_by() before we plot using plot_time_series(). As the dataset consist of humidity level of different origin the time plot above could be rather confusing and does not allow user to be able to understand with its huge volatility. Having a group plot of the individual time plot of each airport, would enable user to have a better judgement.
Other than grouping we are also able to filter dataset to the time period of Q1 only using a combination of mutate() and filter(). Moreover, we are also able to apply different color to each of the week number using color_var within the plot_time_series() function. With the color_var the date column is transformed into lubridate::month() number.
full_weather %>%
group_by(origin) %>%
mutate(quarter= case_when(
month >= 1 & month <= 3 ~ 'Q1'
, month >= 4 & month <= 6 ~ 'Q2'
, month >= 7 & month <= 9 ~ 'Q3'
, month >= 10 & month <= 12 ~ 'Q4')) %>%
filter(quarter == "Q1") %>%
plot_time_series(.date_var= time_hour, .value = humid,
.color_var = month(time_hour),
#.facet_ncol = 2, .facet_scales = "free_x",
.title = "Humidity level of hourly meterological data bases on individual origin",
.interactive=TRUE)
From the above figure, we are able to notice that with the use of color_var we better able to identify a common trend of humidity level within the same time period of the three airport location. This further shows the important of color visualization to allow the analysis to be more effective.
Also the timetk collection allow the time series visualization to be converted from interactive plotly to static format of ggplot2 just by changing the interactive option from TRUE to FALSE as shown in the code below. From this user, is easily able to translate their result onto their report or presentation while having the ability to choose to have it interactive if required.
full_weather %>%
group_by(origin) %>%
mutate(quarter= case_when(
month >= 1 & month <= 3 ~ 'Q1'
, month >= 4 & month <= 6 ~ 'Q2'
, month >= 7 & month <= 9 ~ 'Q3'
, month >= 10 & month <= 12 ~ 'Q4')) %>%
filter(quarter == "Q1" & origin == "JFK") %>%
plot_time_series(.date_var= time_hour, .value = humid,
.color_var = month (time_hour),
.interactive=FALSE,
.facet_ncol = 2, .facet_scales = "free",
.title = "Humidity level of hourly meterological data bases on individual origin",
.x_lab = "Time period",
.y_lab = "Humidity Level",
.color_lab = "Month") + scale_y_continuous(labels = scales::comma_format())
As mentioned before, correlation analysis is done in time series where we will be using autocorrelation to measure the linear relationship between lagged values of a time series. As some of the past lags have predictive information that would be useful in our forecast of the predicted time series.
Measuring level of correlation between a series and its lags using the timetk package is performed using plot_acf_diagnostics(), where we are able to compute ACF and PACF with the desired lag time that user input.
ACF plot - Autocorrelation between a target variable and lagged versions of itself
PACF plot - Partial Autocorrelation removes the dependence of lags on other lags highlighting key seasonalities.
weather %>%
group_by(origin) %>%
plot_acf_diagnostics(
time_hour, humid, # ACF & PACF
.lags = "9 days", # 9-Days of hourly lags
.interactive = TRUE
)
The formula for the critical value is +1.96/sqrt(n) and -1.96/sqrt(n) where n is the length of the time series. This is to ensure the significance of the ACF plots and that for a white noise time series, 95% of the acf points should fall within this range.
As the figure plotted above is not the usual ACF plot, it is rather difficult to translate the ACF result to under at which lag k would the time series cut. We will be reploting the ACF value with the used of the full_weather_acf dataset that we had compuated using the tidyverse collection. After we had calculated the ACF value of each lag k and its critical value, we will then use ggplot2 package to plot out the common ACF plot for interpretation.
weather_tsbl = as_tsibble(full_weather, key = origin, index = time_hour)
weather_tsbl = weather_tsbl %>%
fill_gaps() %>%
group_by_key() %>%
tidyr::fill(temp, humid, .direction = "down")
full_weather_2013 = weather_tsbl %>%
filter(origin == "JFK") %>%
mutate(quarter= case_when(
month >= 1 & month <= 3 ~ 'Q1'
, month >= 4 & month <= 6 ~ 'Q2'
, month >= 7 & month <= 9 ~ 'Q3'
, month >= 10 & month <= 12 ~ 'Q4')) %>%
select(origin, month, temp, humid, precip)
full_weather_acf=full_weather_2013 %>% ACF(humid, lag_max = 36)
full_weather_acf_plot = full_weather_acf %>%
mutate(Upper_critical=1.96/sqrt(8730)) %>%
mutate(Lower_critical=-1.96/sqrt(8730))
head(full_weather_acf_plot)
# A tsibble: 6 x 5 [1h]
# Key: origin [1]
origin lag acf Upper_critical Lower_critical
<chr> <lag> <dbl> <dbl> <dbl>
1 JFK 1h 0.966 0.0210 -0.0210
2 JFK 2h 0.916 0.0210 -0.0210
3 JFK 3h 0.856 0.0210 -0.0210
4 JFK 4h 0.791 0.0210 -0.0210
5 JFK 5h 0.724 0.0210 -0.0210
6 JFK 6h 0.657 0.0210 -0.0210
Similar result as the tidyvert package is observed where the ACF plot shows that the dataset is not stationary and that it contain a seasonal pattern. The following code chunk, would display the how the ACF plot would be manually plot with the use of having line and bar plot in one figure.
ggplot()+
geom_bar(data=full_weather_acf_plot, aes(x = lag ,y = acf),
stat='identity',width = 0.01, color = "black") +
geom_line(data=full_weather_acf_plot, aes(x = lag ,y = Upper_critical),
stat='identity',width = 0.05, color = "blue",linetype = 2) +
geom_line(data=full_weather_acf_plot, aes(x = lag ,y = Lower_critical),
stat='identity',width = 0.05, color = "blue",linetype = 2) +
labs(title = 'ACF plot of humidity level at JFK '
,y = 'ACF value'
,x = 'Lag') +
theme(text = element_text(color = "#444444", family = 'Helvetica Neue')
,plot.title = element_text(size = 26, color = '#333333')
,plot.subtitle = element_text(size = 13)
,axis.title = element_text(size = 14, color = '#333333')
,axis.title.y = element_text(angle = 0, vjust = .5)
)
Beside the single and group time series visualization, timetk collection allow user to perform visualize the time series by seasonality (in hours, days, weeks or months).By using the plot_seasonal_diagnostics() function as shown in the code below, we are able to identify any seasonal trend or hidden pattern of the whole weather dataset or individual airport weather dataset as shown in the plot below respectively.
weather %>%
plot_seasonal_diagnostics(time_hour, humid, .interactive = FALSE)
weather %>%
group_by(origin) %>%
plot_seasonal_diagnostics(time_hour, humid, .interactive = FALSE)
Another important function of timetk would be the easiness to perform anomaly detection. Anomaly detection is where we identify data point that has unusual high/low value within the dataset. In other words, that data point is different from the rest.
In timetk, we are able to identify anomaly detection visualization using plot_anomaly_diagnostics() function and to split the dataset by year. The code below allow us to be to identify the anomaly detection for the humidity level of the weather dataset in year 2013. Hence, we can see from the figure below that the dataset do not contain anomalies datapoint as it did not flag out any red dots.
weather %>%
group_by(origin) %>%
filter(origin == "JFK") %>%
plot_anomaly_diagnostics(time_hour, humid, .interactive=TRUE)
Overall, with timetk we are able to perform time series analysis with a greater control in data visulization. This is especially so with the different interactive function that allow data filtering, data selection and colourization of the graphical visual to be much simpler. With those build-in function within the timetk collection, user is able to be focus in the building of report apporved time series graphical visulization.
Also, even though the data wrangling tool that timetk collection had is very similar to those in tidyvert collection. The advantage of the timetk is the ability to work in both traditional R dataframe as well as tsibble dataframe. This is especially useful for raw dataset that is rather messy and unstructured, to get a quick analysis we are able to select the required variable and perform a quick time series analysis with timetk
However, the one of the limitation of the timetk collection is the different from norm approach of ploting ACF plot using the function within timetk collection. This non-standardize approach might add confusion to user that are familar with the usual ACF plot as shown in the tidyvert collection.
Overall, both tidyvert and timetk collection had enhanced the whole time series analysis experience since the forcast package days. From a very statics approach in the past to a much more interactive graphical visual. This whole new experience, allow the user to have more control and ability to beautify their various time series plots and ehanced the whole coding experience with various build-in function.
As mentioned earlier, tidyvert collection is a much more suitable approach for a standardize tsibble dataframe format while timetk collection is useful for a quick analysis of any dataframe format.
In terms of time series visualization, timetk collection is much more interactive as it allow dataset to be filtered for selected time period with slider, ability to look into details by hovering data as well as the ability to color datapoint by time period which is a function that was really useful through my review.