Time Series Forecasting with Excel and R
Last updated
Last updated
We were allowed to pick the subject and data set of our interest, and I managed to convinced Henry to analyze seed fund in the US since tech startup always has my highest level of interest.
This image may not relate to this project at all. Source: www.biospace.com. All images, data and R Script can be found here
This is a project I did with my classmate He Liu in DSO 522 Applied Time Series Analysis for Forecasting class by professor Robertas Gabrys, USC.
Working capital is a critical problem for every startup. Starting a company is always an expensive venture, and most founders do not have the means to bootstrap or solely fund initial startup activities. At early stage, a common external funding type is seed capital, which is a type of equity-based funding in which an investor invests capital into a business during its early stages in exchange for equity stake. We notice an increase in the investment size for a funding deal, and we are sure founders care about the investment capacity of the market. Therefore, we decide to analyze the changes in the average seeding fund size for a startup in the US.
Our objective is to build a model that can best forecast this number
Our data source is Crunchbase.com, a website that crowd sources information about the fundraising of many startups around the world. Unfortunately, one had to pay a Pro membership to download the most recent data from Crunchbase. Therefore, we decided to use the data from 2006 to 2015, which is not up to date but can serve to train our models. We avoid using data for before 2006 period due to low quality and unstable data structure. We also notice this data set is biased and probably missing observation information, but we try to fit the best model and will reconsider its applicability for future data set.
Our data set has 22,791 observations from 2006 to 2015. We take the average of funding amount for each quarter and come up with the final data set for building models, which has 40 rows (40 quarters). Our response variable is “Average Fund”. Our training set includes the first 36 quarters, and our testing set includes our last 4 quarters. Explanatory variables include several economics indicators, which we adapt from the U.S. Census Bureau and the U.S. Bureau of Economic Analysis’s data.
Models
RMSE
MAPE
Holt-Winter’s Smoothing
188,027
15.2%
Time Series Regression with Quadratic Time
132,364
11.6%
Regression Against Economic Indicators
153,345
13.7%
ARIMA
171,804
15.1%
Of all models we tried, Time Series Regression with Quadratic Time has the lowest RMSE and MAPE. This model also shows the best performance on testing set with MAPE of 14%. From our models we can see that the average seed stage investment into startup companies are steadily increasing over time. The average early stage investment fund for startups are seeing an upward trend with small degrees of fluctuation. However, from our fitted models there are no strong quarterly seasonality effect on the amount funded.
We’re aware of the unsystematic nature of this data set. However, it is our high interest to see the trend and changes in seed fund size in the US. Fluctuations in the data may be caused by other variables that we were not able to capture. We had to limit our number of observations since data has multiple missing values and some extreme outliers. But using the models that we created, it is shown that startups looking for early stage fundraising do not need to time the market for the best entering date. Besides, we also notice that with 35 observations in our training set, the ARIMA model may not show the optimal result.
In this class, we tried to use Excel as much as possible. When it comes to the function that Excel inherently does not provide, we use R.
The raw data is available at "Raw Dataset" sheet in the excel file "Duong_Liu.xlsx". The data includes information of the startups and their seed investor(s), the amount and date of the deal.
For model 1 to 6, we created dummy variables for quarters, and added trend variable. For model 7, we added other economic indicators into the data. The first observation is likely to be an outlier; we found it after building Model 1, so we removed it from Model 2 forward.
Through visualization, we observe trend and seasonality in the data set, so we do not try MA and SES as our models. Note: In the first quarter of 2006, the $100M investment in SpaceX causes the first observation in our dataset to be an outlier.
To find all parameters for the method, we set up the predicting function on Excel and used Solver to find the optimal values.
RMSE
MAPE
Training
188,027
15.2%
Testing
706,675
15.0%
Smoothing-based methods is very adaptive, however its predictive power is not strong in this data set.
Model 1 include trend variable and quarter variables to capture seasonal behavior of the data. We notice a low R-square of 9%.
Model 2 (after removing the outlier) sees an improvement in R-square (73%). However, there's a pattern in residual plot, which implies quadratic variables may be needed. We add quadratic Trend in Model 3.
Model 3 sees further improvement in R-square (81%). The quadratic variable is significant. There's no sign of autocorrelation. However, trend variable is insignificant, we go ahead to remove this variable in Model 5 (Model 4 is a stupid atempt I don't really want to metion. Yes, stupidity happens). We also built ACF plot to see if autocorellation exists. The plots show no sign of autocorrelation.
After trying different transformations. We decided to go with this model as our final prediction model. R-square is acceptable, and there's no sign of autocorrelation.
RMSE
MAPE
Training
132364.3
11.6%
Testing
322935.5
14.1%
It is reasonable to consider the impact of overall economic landscape on the amount of seed fund received in the US. We added two variables Real Quarterly GDP and Quarterly Unemployment Rate together with seasonal dummy variables in an attempt to explain the response variable – Funding size.
RMSE
MAPE
Training
153,345
13.7%
Testing
384,997
16.8%
This regression model shows useful result with R-square of 73%. Both additional variables are significantly useful. However, we do not advocate this model since we need to make assumptions regarding future value of explanatory variables (GDP and Unemployment Rate). This additional uncertainty is unwelcomed.
We conduct ARIMA modelling automatically in R to pick the best model.
Import Data
From visualization of the seasonal graph we can see that quarterly seasonality is not very significant in explaining our model. Quarterly average fund being raised for each company may follow another seasonal pattern or could be influenced by other variables.
Build ARIMA Model
Testing data does not really fit onto our predicted values, lying above the confidence interval of our prediction. The fitted values seem not to fit our training data very well.
Residuals Analysis
Residuals do not show signs of autocorrelation. However, the residuals are not very normal looking. This may be due to having too little observations.
RMSE
MAPE
Training set
171804.5
15.1%
Test set
389669.8
16.6%