Time Series Forecasting with Excel and R

Forecasting Average Seed Fund Size for Startups in the US

We were allowed to pick the subject and data set of our interest, and I managed to convinced Henry to analyze seed fund in the US since tech startup always has my highest level of interest.

This image may not relate to this project at all. Source: www.biospace.com. All images, data and R Script can be found here

This is a project I did with my classmate He Liu in DSO 522 Applied Time Series Analysis for Forecasting class by professor Robertas Gabrys, USC.

Problem Statement

Working capital is a critical problem for every startup. Starting a company is always an expensive venture, and most founders do not have the means to bootstrap or solely fund initial startup activities. At early stage, a common external funding type is seed capital, which is a type of equity-based funding in which an investor invests capital into a business during its early stages in exchange for equity stake. We notice an increase in the investment size for a funding deal, and we are sure founders care about the investment capacity of the market. Therefore, we decide to analyze the changes in the average seeding fund size for a startup in the US.

Our objective is to build a model that can best forecast this number

Our data

Our data source is Crunchbase.com, a website that crowd sources information about the fundraising of many startups around the world. Unfortunately, one had to pay a Pro membership to download the most recent data from Crunchbase. Therefore, we decided to use the data from 2006 to 2015, which is not up to date but can serve to train our models. We avoid using data for before 2006 period due to low quality and unstable data structure. We also notice this data set is biased and probably missing observation information, but we try to fit the best model and will reconsider its applicability for future data set.

Our data set has 22,791 observations from 2006 to 2015. We take the average of funding amount for each quarter and come up with the final data set for building models, which has 40 rows (40 quarters). Our response variable is “Average Fund”. Our training set includes the first 36 quarters, and our testing set includes our last 4 quarters. Explanatory variables include several economics indicators, which we adapt from the U.S. Census Bureau and the U.S. Bureau of Economic Analysis’s data.

Our Model Prediction Performance

Models

RMSE

MAPE

Holt-Winter’s Smoothing

188,027

15.2%

Time Series Regression with Quadratic Time

132,364

11.6%

Regression Against Economic Indicators

153,345

13.7%

ARIMA

171,804

15.1%

Of all models we tried, Time Series Regression with Quadratic Time has the lowest RMSE and MAPE. This model also shows the best performance on testing set with MAPE of 14%. From our models we can see that the average seed stage investment into startup companies are steadily increasing over time. The average early stage investment fund for startups are seeing an upward trend with small degrees of fluctuation. However, from our fitted models there are no strong quarterly seasonality effect on the amount funded.

Consideration

We’re aware of the unsystematic nature of this data set. However, it is our high interest to see the trend and changes in seed fund size in the US. Fluctuations in the data may be caused by other variables that we were not able to capture. We had to limit our number of observations since data has multiple missing values and some extreme outliers. But using the models that we created, it is shown that startups looking for early stage fundraising do not need to time the market for the best entering date. Besides, we also notice that with 35 observations in our training set, the ARIMA model may not show the optimal result.

Steps

Data Preparation

In this class, we tried to use Excel as much as possible. When it comes to the function that Excel inherently does not provide, we use R.

The raw data is available at "Raw Dataset" sheet in the excel file "Duong_Liu.xlsx". The data includes information of the startups and their seed investor(s), the amount and date of the deal.

For model 1 to 6, we created dummy variables for quarters, and added trend variable. For model 7, we added other economic indicators into the data. The first observation is likely to be an outlier; we found it after building Model 1, so we removed it from Model 2 forward.

Moving Average and Simple Exponential Smoothing

Through visualization, we observe trend and seasonality in the data set, so we do not try MA and SES as our models. Note: In the first quarter of 2006, the $100M investment in SpaceX causes the first observation in our dataset to be an outlier.

Holt-Winter’s Method - Multiplicative Seasonality

To find all parameters for the method, we set up the predicting function on Excel and used Solver to find the optimal values.

RMSE

MAPE

Training

188,027

15.2%

Testing

706,675

15.0%

Smoothing-based methods is very adaptive, however its predictive power is not strong in this data set.

Times Series Model

Model 1 include trend variable and quarter variables to capture seasonal behavior of the data. We notice a low R-square of 9%.

Model 2 (after removing the outlier) sees an improvement in R-square (73%). However, there's a pattern in residual plot, which implies quadratic variables may be needed. We add quadratic Trend in Model 3.

Model 3 sees further improvement in R-square (81%). The quadratic variable is significant. There's no sign of autocorrelation. However, trend variable is insignificant, we go ahead to remove this variable in Model 5 (Model 4 is a stupid atempt I don't really want to metion. Yes, stupidity happens). We also built ACF plot to see if autocorellation exists. The plots show no sign of autocorrelation.

After trying different transformations. We decided to go with this model as our final prediction model. R-square is acceptable, and there's no sign of autocorrelation.

RMSE

MAPE

Training

132364.3

11.6%

Testing

322935.5

14.1%

Regression Against Economic Indicators

It is reasonable to consider the impact of overall economic landscape on the amount of seed fund received in the US. We added two variables Real Quarterly GDP and Quarterly Unemployment Rate together with seasonal dummy variables in an attempt to explain the response variable – Funding size.

RMSE

MAPE

Training

153,345

13.7%

Testing

384,997

16.8%

This regression model shows useful result with R-square of 73%. Both additional variables are significantly useful. However, we do not advocate this model since we need to make assumptions regarding future value of explanatory variables (GDP and Unemployment Rate). This additional uncertainty is unwelcomed.

ARIMA

We conduct ARIMA modelling automatically in R to pick the best model.

Import Data

> # Import data
> y=read.csv("AverageFund.csv")
> # create a time series object
> y.ts=ts(y,start=c(2006,2),frequency = 4)
> y.ts
          Qtr1      Qtr2      Qtr3      Qtr4
2006            882185.2  835388.8 1066515.2
2007  551335.6  984146.3  786267.1  835583.3
2008 1082810.0  622218.5  607822.3  764711.5
2009 1053471.8  745861.8  867047.2 1085975.8
2010  956694.2  929708.9  941842.8 1118965.9
2011 1269820.1 1056194.7 1104696.0 1081424.2
2012 1022974.4 1284014.9 1142556.5 1567631.2
2013 1284622.1 1163391.9 1447087.2 1588991.9
2014 1684298.7 1443340.1 1455709.0 1712275.1
2015 1547170.4 2098802.4 2142926.2 1999798.9
> # upload library forecast(models and time series visualization functions)
> library(forecast)
> library(ggplot2)

> # create a seasonal graph for subset of data
> ggseasonplot(window(y.ts,start=c(2006,2)),main="Seasonal Plot")+geom_point()
> # Divide data into training and testing sets
> train.start=c(2006,2)
> train.end=c(2014,4)
> test.start=c(2015,1)
> test.end=c(2015,4)
> train.ts=window(y.ts,start=train.start,end=train.end)
> test.ts=window(y.ts,start=test.start,end=test.end)
> # size of training and testing sets
> nTrain=length(train.ts)
> nTest=length(test.ts)

From visualization of the seasonal graph we can see that quarterly seasonality is not very significant in explaining our model. Quarterly average fund being raised for each company may follow another seasonal pattern or could be influenced by other variables.

Build ARIMA Model

> # Build ARIMA model
> M1=auto.arima(train.ts,lambda="auto")
> # Generate future forecast
> M1F=forecast(M1,h=nTest,level=95)
> # predicted values on testing set
> M1F
        Point Forecast   Lo 95   Hi 95
2015 Q1        1592522 1254763 1921463
2015 Q2        1614969 1270103 1950768
2015 Q3        1637376 1285589 1979859
2015 Q4        1659745 1301209 2008746
> # Fitted values
> fitted(M1)
          Qtr1      Qtr2      Qtr3      Qtr4
2006            881425.4  891238.3  911612.7
2007  958712.9  884043.3  929485.1  920145.4
2008  926522.4  983567.1  929875.7  885606.6
2009  884116.8  944779.2  926585.8  937925.1
2010  993788.6 1009661.6 1016381.9 1024239.2
2011 1068294.7 1135345.7 1141888.0 1157339.4
2012 1164511.6 1157769.6 1208259.8 1217475.3
2013 1316538.2 1332702.0 1319669.3 1370029.1
2014 1440094.1 1515486.7 1522670.0 1530944.2
> # plot data and prediction
> autoplot(M1F)+ autolayer(test.ts)+autolayer(fitted(M1))
> # accuracy metrics
> accuracy(M1F,test.ts)[,c("RMSE","MAPE")]
                 RMSE     MAPE
Training set 171804.5 15.05462
Test set     389669.8 16.64504

Testing data does not really fit onto our predicted values, lying above the confidence interval of our prediction. The fitted values seem not to fit our training data very well.

Residuals Analysis

> # residual diagnistics
> checkresiduals(M1)

    Ljung-Box test

data:  Residuals from ARIMA(0,1,1) with drift
Q* = 6.0984, df = 5, p-value = 0.2968

Model df: 2.   Total lags used: 7

Residuals do not show signs of autocorrelation. However, the residuals are not very normal looking. This may be due to having too little observations.

RMSE

MAPE

Training set

171804.5

15.1%

Test set

389669.8

16.6%

PreviousMultiple Linear Regression with R NextCleaning Data in R, Average daily attendance (ADA) USA

Last updated 6 years ago