I recently presented our work on big housing data at the Bank of England conference on “Modelling with Big Data and Machine Learning”. This has been a super-interesting conference where I learned a lot. Now that the slides of the workshop have been uploaded online, I thought I would write a blog post to share something of what I learned. I’ll also take this chance to write about how big data are related to this blog and have the potential to influence theoretical economic models.

The first session of the conference was about *nowcasting*. I particularly liked the talk by Xinyuan Li, a PhD student at London Business School. In her job market paper, she asks if Google information is useful for nowcasting even when other macroeconomic time series are available. Indeed, most papers showing that Google Trends data improve nowcasting accuracy of, say, the unemployment rate, do not check if this improvement still holds once the researcher considers series of payrolls, industrial production, capacity utilization, etc. Li combines macroeconomic and Google Trends time series in a state-of-the-art dynamic factor model and shows that Google Trends add little, if any, nowcasting accuracy. However, if one increases the number of associated Google Trends time series by using Google Correlate, a tool that finds the Google searches most correlated with a given series, nowcasting accuracy improves. So under some conditions Google information is indeed useful.

The first keynote speaker was Domenico Giannone, from the New York FED. The question in his paper is whether predictive models of economic variables should be *dense* or *sparse*. In a sparse model only few predictors are important, while in a dense model most predictors matter. To answer this question it is not enough to estimate a LASSO model and count how many coefficients “survive”. Indeed, for LASSO to be well-specified, the correct model must be sparse. The key idea of the paper is to allow for sparsity, without assuming it, and let the data decide. This is done via a “spike and slab” model, that contains two elements: a parameter q that quantifies the probability that a coefficient is positive; and a parameter γ that shrinks the coefficients. The same predictive power can be achieved in principle by only including few coefficients or by keeping all coefficients but shrinking them. In a Bayesian setting, if the posterior distribution is concentrated at high values of q (and so low values of γ) it means that the model should be dense. This is what happens in the figure below, in five out of six datasets in micro, macro and finance. Yellow means high value for the posterior, and only in the case of micro 1 it is high for q ≈ 0. So in most cases a significant fraction of predictors is useful for forecasting, leading to an *illusion of sparsity*.

The most thought-provoking speech in the panel discussion on “Opportunities and risks using big data and machine learning” was again by Giannone. What he said is best summarized in a paper that everyone interested in time series forecasting with economic big data should read. His main point is that macroeconomists had to deal with “big data” since the birth of national accounting and business cycle measurement. State-of-the-art nowcasting and forecasting techniques that he jointly developed at the New York FED include a multitude of time series at different frequencies, such as the ones shown in the figure below. These series are highly collinear and rise and fall together, as shown in the heat map in the horizontal plane. According to Giannone, apart from a few exceptions, big data coming from the internet have little chance to improve over carefully collected data from established statistical national agencies.

On a different note, in a following Methodology session I found out about a very interesting technique: Shapley regressions. Andreas Joseph from the Bank of England talked about the analogy between Shapley values in game theory and in machine learning. In cooperative game theory Shapley values quantify how much every player contributes to the collective payoff. A recent paper advanced the idea of applying the same formalism to machine learning. Players become predictors and Shapley values quantify the contribution of each predictor. While there exist several ways to quantify the importance of predictors in linear models, Shapley values extend nicely to potentially highly non-linear models. His colleague Marcus Buckmann presented an application to financial crisis forecasting, using data back to 1870 (see figure below). Interestingly, global and domestic credit contribute a lot to forecasting, while current account and broad money are not so important. In general, Shapley regressions might help with the interpretability of machine learning “black boxes”.

The last session I’d like to write about is the one on text analytics. Eleni Kalamara, a PhD student in King’s College, presented her work on “making text count”. The general goal of her project is to see whether text from UK newspapers proxies sentiment and uncertainty and is useful to predict macroeconomic variables. What I found most interesting was the comparison of 13 different dictionaries that turn text into sentiment and uncertainty indicators. Given such a proliferation of metrics, it seems very useful to systematically compare them. Another interesting talk in the same session was given by Paul Soto. In his job market paper “breaking the word bank”, he used Word2Vec to find words related to “uncertainty” in transcripts of banks’ conference calls. Word2Vec is a machine learning algorithm that finds a vector representation for words taking into account both syntactics and semantics. The figure below shows a two-dimensional projection of the vector space; words related to uncertainty are highlighted in yellow to the right. In his paper, Soto shows that banks with higher idiosyncratic uncertainty are less likely to give loans and more likely to increase their liquidity.

There were a lot of other great talks. For example, Thomas Renault from Sorbonne showed how to detect financial market manipulation—in particular, pump and dump schemes—from Twitter. Luca Onorante from the European Central Bank demonstrated how to select the most relevant Google Trends in a context of Bayesian Model Averaging. Emanuele Ciani from the Bank of Italy developed on a method first introduced by Jon Kleinberg to predict the agents that would most benefit from policies, nicely combining ideas from prediction and from causal inference. For the many other interesting talks, please check the program or look at the slides.

So, what do big data have to do with complexity economics? This conference was purely about statistical models. My sense is that economic theorists are not responding to big data as much as empirical economists. True, heterogenous agent models that use micro evidence to discriminate between different macro models that produce the same macro outcomes are increasingly popular, but I don’t think they quite exploit the power of big data. On the other hand, large-scale “microsimulation” Agent-Based Models (ABMs) that are directly feeded with data and solved forward without imposing equilibrium constraints seem more promising to exploit the big data opportunities. A nice example of this is the ongoing work by Sebastian Poledna and coauthors on “Economic forecasting with an agent-based model”, exploiting comprehensive datasets for the Austrian economy. I plan to work on prediction with ABMs too during my postdoc funded by the James S. Mc Donnell Foundation — better out-of-sample forecasting performance would be a compelling motivation for the enhanced realism of ABMs that comes at the cost of other features that are considered important in mainstream theoretical models.