Showcase and reviewing the different tools and technique within the tidyvert and timetk collection that allow user to be able to perform data wrangling, data decomposition as well as feature extraction for time series analysis
In this article, we will reviewing and exploring the timetk and tidyvert family R packages, where will be conducting an overall comparison analysis to look at how the different collection work for time series forecasting using the targeted dataset. The article would be spitted into two article where the the first article would be about the different in data manipulation and feature selection method for timetk against tidyvert. Whereas in the second article would be on the reviewing of the forecasting methodology between timetk and tidyvert
For part 1 of the series of the article, in the first section of the article, we would looking into the current techniques used by tidyvert collection on how the data structure is being set up to perform data cleaning and wrangling after the extraction of dataset via web scrapping. After which, we will be reviewing the ability to perform feature engineering to look at how time series features, decompositions, statistical summaries and convenient visualizations could be perform by tidyvert collection.
In the next section of the article, we would be looking at the same concept but with the use of the timetk collection instead. Lastly, we will be reviewing both collection to analysis its similarities vs difference as well the strengthen and cases to use individual collection.
We would first start with setting up the environment and installation of the packages required for data transformation using R. To ensure that we had cleared the environment to perform data manipulation, we would remove prior R object using the code below
Next, we would run the following code chunk to validate if the required packages are installed. In the event that the packages are not installed, the code will install the missing packages. Afterwhich, the code would read the required package library onto the current environment.
packages = c('tidyverse','tidyquant','tsibbledata','tsibble','feasts'
,'stats','lubridate','data.table','rmarkdown','knitr','nycflights13')
for (p in packages) {
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
In this article, we will be using the dataset from the nycflights13 packages that contain a collection of data that is related to the different airline flying from different airport in New York City (NYC).
The dataset is inclusive of 5 set of data points, namely airlines, airports, flights, planes and weather. Over here in this article, we will be using the weather dataset, where it contain weather related variable like temperature, humid and precipitation.
weather = read_csv("data/weather.csv")
paged_table(weather)
The tidyvert collection is adapted from the popular forecast R package that is widely used in the past for time series forecasting using various classic time series model. The R forecast R packages provides different methods and tools for display and analysing of time series forecasts including exponential smoothing via state space models and automatic ARIMA modelling.
tidyvert collection is the new suite of packages for tidy time series analysis, that integrates easily into the tidyverse way of working. Much of the work on the set of packages under the tidyvert collection has been done by Rob Hyndman, professor of statistics at Monash University, and his team. The intention of the development of the tidyvert collection is for the replacement of the very popular forecast package. As of today, the forecast R package had retired in favour of fable package that is under the tidyvert collection as well.
Type of packages under the tidyvert collection for data manipulation and feature extraction:
1. tsibble - work along with dplyr for data manipulation
2. fable - tool used for feature extraction, decomposition of time series analysis and statistical analysis for the time series analysis
In the next few section of the article, we will be reviewing and discussing how the set of packages within the tidyvert collection work and collaborate to perform time series analysis as well as data manipulation.
The tsibble package provides an infrastructure for the tidy of temporal data (represents a state in time) using wrangling tools within the tsibble package. Also tsibble package is mainly used to transform the dataframe into tibble dataframe structure. Adapting the tidy data principles, tsibble dataframe format is required for the tidyvert collection.
tsibble consists of several properties that allow the:
The following code will demonstrate how the as_tsibble() function from the tsibble package to transform dataframe into tibble dataframe format. Also, to coerce a dataframe into tsibble dataframe, user will need to declare key and index. The code is demonstrated using the weather dataset that is from the nycflights13 package as mentioned earlier.
weather_tsbl = as_tsibble(weather, key = origin, index = time_hour)
head(weather_tsbl)
# A tsibble: 6 x 15 [1h] <UTC>
# Key: origin [1]
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
# … with 5 more variables: wind_gust <dbl>, precip <dbl>,
# pressure <dbl>, visib <dbl>, time_hour <dttm>
Additionally,tsibble is able to turn implicit missing values into explicit missing values, where we could use fill_gaps() to turn implicit to explicit form. The methodology of using fill_gaps() is shown in the code below where the fill_gaps() function is similar to how we had use fill() within tidyr to replace NAs based on its previous observation using time series analysis.
full_weather = weather_tsbl %>%
fill_gaps() %>%
group_by_key() %>%
tidyr::fill(temp, humid, .direction = "down")
head(full_weather)
# A tsibble: 6 x 15 [1h] <UTC>
# Key: origin [1]
# Groups: origin [1]
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
# … with 5 more variables: wind_gust <dbl>, precip <dbl>,
# pressure <dbl>, visib <dbl>, time_hour <dttm>
Another feature of tsibble package would be the ability to aggregate over calendar periods demostrated in the code chunk where this could be performed using index_by() and summarise(). index_by() is has similar function as group_by where it is used to group index only while summarise() is the function that aggregates selected variables over time periods.
Also, index_by() could be use as index functions for as.Date(), yearweek(), yearmonth() and yearquarter() to allow us to compute quarterly/monthly/weekly aggregations.
full_weather_mth_agg=full_weather %>%
group_by_key() %>%
index_by(year_month = ~ yearmonth(.)) %>%
summarise(
avg_temp = mean(temp, na.rm = TRUE),
sum_precip = sum(precip, na.rm = TRUE)
)
head(full_weather_mth_agg)
# A tsibble: 6 x 4 [1M]
# Key: origin [1]
origin year_month avg_temp sum_precip
<chr> <mth> <dbl> <dbl>
1 EWR 2013 Jan 35.6 3.53
2 EWR 2013 Feb 34.2 3.83
3 EWR 2013 Mar 40.1 2.93
4 EWR 2013 Apr 52.9 1.54
5 EWR 2013 May 63.2 5.44
6 EWR 2013 Jun 73.3 8.73
Using function such as summarise(), group_by() and index_by() from the tsibble package will help to take into account of the update of key and index respectively. Using a combination of index_by() + summarise() as shown in the above code can help with regularizing a tsibble of irregular time space too.
feasts package that is within the tidyvert collection is mainly used for the feature extraction and statistics for time series analysis. Also, feasts package provides a set of tools within the package that it is useful for the analysis of time series data.
Working with tidy temporal data that was previously set up using tsibble package, it is able to compute time series features, decomposition, statistical summaries and graphical visualizations. Features extraction is useful in the understanding of the behavior of time series data together with the closely integration of the tidy forecasting workflow used in the fable package.
To begin our analysis, we will first start with plotting a time plot using the auto_plot() function to look at the time plot of our dataset. auto_plot() automatically create an appropriate plot of choosen variable against time. In this case, it recognizes humidity level as a time series and produces a time plot as shown below.
From the figure below, we are able to observe that the humidity level fluctuate of high volatility that cause the understanding of the time plot to be rather challenging. In the next few section of the the article we will be looking into the different analysis of the time plot to try to identify if we are able to observed any trend/seasonal/cyclic pattern.
full_weather %>%
autoplot(humid)+
labs(title = "Humidity level of hourly meterological data",
subtitle = "mainly from LGA, JFK and EWR airport dataset",
y = "Humidity level")
With the feasts package, user is able to plot time plot based on the given time period in the dataset. We are also able to use the gg_season() and gg_subseries() to plot the season plot and there change in the seasonality respectively. Without the use of group_by() function, we are able to review the time plot of individual airport using gg_season() and gg_subseries() function. The code below would shown how we are able to use gg_season() to plot the different season plot based on individual airport. Similar technique is used for gg_subseries as well.
full_weather %>%
gg_season(humid)+
labs(title = "Humidity level of hourly meterological data",
subtitle = "Individual time plot for for LGA, JFK and EWR airport dataset",
y = "Humidity level")
Figure above is the same time plot as shown using autoplot() where the data point of each season are overlapped. A seasonal plot allows the underlying seasonal pattern to be seen more clearly, this is especially useful in for dataset within varying years or locations etc where it allow user to analyze the dataset further to look into the pattern changes over the year or based on location.
Time period of the time plot using the autoplot() and gg_season() function could be further look into yearly, monthly or weekly using the period tool within both function as shown in the following code.
full_weather %>%
gg_season(humid, period = "week")+
labs(title = "Humidity level of hourly meterological data",
subtitle = "Individual time plot for for LGA, JFK and EWR airport dataset",
y = "Humidity level")
From the figure above, we are able to observed that overall there is a similar seasonal trend for the EWR, JFK and LGA airport where the overall humidity fluctuate of the same cycle within the same week. This period tool allow user to perform time anlysis much effectively.
Lag plot would be an one of the approach to look a correlation of lagged observation (vertical axis) against the current observation, with points colored hourly in a scatterplot format.The correlations of the lag plots shown with the code below using the gg_lag() are what that make up the ACF. Where the stronger the correlation the closer the scatterplot point will be to the dotted line.
full_weather_2013 = full_weather %>%
filter(origin == "JFK") %>%
mutate(quarter= case_when(
month >= 1 & month <= 3 ~ 'Q1'
, month >= 4 & month <= 6 ~ 'Q2'
, month >= 7 & month <= 9 ~ 'Q3'
, month >= 10 & month <= 12 ~ 'Q4')) %>%
select(origin, month, temp, humid, precip)
full_weather_2013 %>%
gg_lag(humid, geom = "point")+
labs(title = "Lag plot of the time plot of the humidity level of hourly meterological data",
y = "Humidity level")
Correlation analysis is another form of measure that we could identify within the fable package. Correlation measures the extent of a relationship between two variables while autocorrelation measures the linear relationship between lagged values of a time series.
There are several autocorrelation coefficients corresponding to each panel in the lag plot. Where r1 measure the relationship between yt and yt-1 and r2 measure the relationship between yt and yt-2 and so on.
where T is length of the time series and that the autocorrelation coefficient make up the autocorrelation function (ACF). ACF() within the feast package allow us to compute ACF value as compare to its individual lag time as shown below using the weather dataset of JFK airport only. Similar technique is used for Partial autocorrelation function-PACF() and Cross Correlation Functions-CCF() as well.
full_weather_acf=full_weather_2013 %>% ACF(humid, lag_max = 36)
head(full_weather_acf)
# A tsibble: 6 x 3 [1h]
# Key: origin [1]
origin lag acf
<chr> <lag> <dbl>
1 JFK 1h 0.966
2 JFK 2h 0.916
3 JFK 3h 0.856
4 JFK 4h 0.791
5 JFK 5h 0.724
6 JFK 6h 0.657
The values in the acf column are r1,…,r9 where it is corresponding to the nine scatterplots in that we had plotted earlier on under lag plat section. Ploting of the ACF allow us to see how the correlations change with the lag k, where the plot is also known as correlogram.
full_weather_2013 %>%
ACF(humid) %>%
autoplot() + labs(title="ACF plot of the humidity level of hourly meterological data of JFK")
As shown in the ACF plot above, the humidity dataset of JFK airport do hava a trend which is similar to our analysis earlier on. This assumption is further validated where the autocorrelations for small lags tend to be large and positive as observations near in time are also near in ACF value. As such, the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase which is similar to the ACF plot shown above.
When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags.
When data are both trended and seasonal, you see a combination of these effects. The data plotted in figure above shows both trend and seasonality using the ACF plot. The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality.
gg_tsdisplay() function within the feasts package allow the user to plot a time series along with its ACF along with an customizable third graphic of either a PACF, histogram, lagged scatterplot or spectral density under the plot_type option of the gg_tsdisplay().
The following code below would demostrated how it is used and over here in our composite plot, we had selected the plot_type=‘auto’ to allow us to have view on the composite plot default setting. Similar technique is used for gg_tsresiduals() as well where user is able to ensemble plots for time series residuals.
full_weather_2013 %>%
gg_tsdisplay(humid)+
labs(title = "Summarize time series analysis of the humidity level of hourly meterological data of JFK airport",
y = "Humidity level")
Another major concept of time series analysis would be looking at decomposition. Time series decomposition is the isolation of the structural components such as trend and seasonality from the main data to allow user to better analysis trend/seasonal/cycle pattern.
When decomposing a time series into components,the three components that we could decompose would be trend-cycle, seasonal and remainder component (containing anything else in the time series). Within our time series, there can be more than one seasonal component, corresponding to the different seasonal periods as such decomposition is important for the time series analysis.
Within feasts package we would be using classical_decomposition() and STL() functions which came from the decompose() and stl() from the stats package. Decomposition function in the feasts package look into the extraction of seasonal/trend/cycle components from a time series persepctive.
When decomposing a time series, it is useful to adjust the series or data transformation in order to make the decomposition less complex as possible. Some method of adjustment include:
Calendar adjustment
Population adjustment
Inflation adjustment
Box-Cox Transformation - the use of power and logarithms transformation that is useful to solve the issue of the dataset having variation that increase/decrease with the level of the series.
The STL() functions in feasts package use a model-like formula interface, allowing you to control many aspects of the decomposition (using season(window = 5) allows the seasonality to change fairy quickly for quarterly data). The following code below will show how the STL() is used and that in this article we will be using the population adjustment to filter the dataset to just JFK airport weather only for our time series decomposition.
full_weather_stl=full_weather %>%
group_by(origin) %>%
filter(origin == "JFK") %>%
summarise(humid = sum(humid)) %>%
model(STL(humid ~ season(window = 5))) %>%
components()
head(full_weather_stl)
# A dable: 6 x 9 [1h] <UTC>
# Key: origin, .model [1]
# : humid = trend + season_week + season_day + remainder
origin .model time_hour humid trend season_day season_week
<chr> <chr> <dttm> <dbl> <dbl> <dbl> <dbl>
1 JFK STL(h… 2013-01-01 06:00:00 59.4 47.7 -0.596 5.82
2 JFK STL(h… 2013-01-01 07:00:00 59.4 47.8 0.898 4.87
3 JFK STL(h… 2013-01-01 08:00:00 59.5 47.9 3.53 2.57
4 JFK STL(h… 2013-01-01 09:00:00 62.2 48.0 4.76 2.48
5 JFK STL(h… 2013-01-01 10:00:00 61.6 48.1 6.50 2.49
6 JFK STL(h… 2013-01-01 11:00:00 64.3 48.1 7.62 2.42
# … with 2 more variables: remainder <dbl>, season_adjust <dbl>
Next, we will be using the autoplot(), gg_season() to plot out the time series graphic to show how each decomposed components vary within each time plot. Also, it allow us to visualise the seasonality without distractions of trend and remainder terms. The code chunk show how we are able to use autoplot() to plot the decomposition of the weather dataset from JFK airport into four component (trend, season_week, season_day and remainder). Similar technique is used for gg_season as well.
full_weather_stl %>%
group_by(origin) %>%
filter(origin == "JFK") %>%
summarise(humid = sum(humid)) %>%
model(STL(humid ~ season(window = 5))) %>%
components() %>%
autoplot()+
labs(title = "Time series decomposition of the humidity level of hourly meterological data of JFK airport",
y = "Humidity level")
The four component (trend, season_week, season_day and remainder) of the time series decomposition are shown separately in the bottom four panels. These components can be added together to form the time plot shown in the top panel. We are able to observe that seasonal component changes over time where the weekly humidity level had a seasonal pattern that as compared to the daily time period. The remainder component at the bottom panel would be the remainder after the seasonal and trend-cycle components have been removed from the dataset.
Grey bars at the left of individual panel show the relative scales of the components. Each represents the same length but as the plots are on different scales, the bars vary in size. The largest grey bar in the trend panel explain to us that the variation in the trend component is the smallest compared to rest of the component. If we were to rescale the other three component dataset, then all of the compoent would be of the same scale.
Within the feasts package, we are also able to conduct statistical analysis such as the calculation of mean, maximum etc. Autocorrelations as discussed earlier on are some of the key feature of time series. In this part of the code, we will looking at how numerical summary is calculated using the feature function. Over here, we will be computing the average humidity level of individual airport dataset.
full_weather_mean=full_weather %>%
features(humid, list(mean = mean)) %>%
arrange(mean)
head(full_weather_mean)
# A tibble: 3 x 2
origin mean
<chr> <dbl>
1 LGA 59.3
2 EWR 63.0
3 JFK 65.2
Overall, the tidyvert collection had been very useful in the data manipulation, feature extraction as well as the time series series. By having the dataset being easily convertible from the standard dataframe to tsibble dataframe format had allow the remaining analysis to be conducted easily with a fix structure data format.
The time series graphic within the feasts pacakge, allow user to be able to manipulate the dataset easily according to their needs following the dplyr package format structure. Also using the various time series we are able to see individual and combination of the season, trend and cycle component that allow us to perform time series analysis. In the next article, we will discussing about the timetk collection.