COVID & data science
After the health crisis, the predictive crisis?
Co-authors: Raphaël Hamez, Hervé Mignot
In the last few weeks, we have witnessed a proliferation of dashboards on the spread and then the ebb of the epidemic, as well as attempts, often inconclusive, to use epidemiological models rediscovered for the occasion, to measure the progress of the epidemic. Paradoxically enough, this crisis has led to the globalization of data, even though nations were confined.
We can now ask ourselves the question of the impact of this period on data analysis practices, and more particularly here on the development of predictive models. And we can see that if rarely has data had such an impact on the analysis of a crisis, rarely will a crisis have such an impact on the uses of data.
After this health crisis, are we going to experience a predictive crisis? Are we going to be blinded for a while, if we have never been able to predict before? For indeed, a kind of huge "edema" of data has built up in all data sets, from companies as well as from states, during the period of global containment.
Organization of the article:
the impacts on modeling practice
the areas of intervention to remedy the situation
an illustration on a sales forecasting model
the opportunities offered by this new "trace" in the data.
A predictive crisis?
We have already seen the real impact of the health crisis on predictive approaches in companies:
the indefinite suspension of customer behavior models (for predictive marketing: segmentation of customers by their purchases, attrition scores, propensity scores, etc.), whether due to a lack of data (store closures) or due to the exceptional context invalidating their construction hypotheses (no more use of loyalty cards, generalized unavailability of products biasing purchases, atypical and extreme behavior),
the uselessness for a certain period of time of sales forecasting and logistics planning models, whether by force of circumstance (store closures, forecasting becomes trivial, etc.) or by the new operational conditions (disappearance of significant transport capacities, etc.).
Nevertheless, data has retained its strategic dimension, and in some cases it is the only factual element on which companies try to build a rational response to events. Thus, very agile approaches to analyzing the first available signals, for the short term, have been put in place to prepare and optimize the response to the resumption of post-crisis activity (what can be extrapolated from the first reopenings? what is the demand? has it changed compared to before the crisis? how to optimize stock allocations in this case? etc.).
Paradoxically, it is perhaps risky to make predictions about the impact of the crisis on data science practices. But let's play the game and see which of our findings will stand the test of time in the months to come.
First, what are the potential impacts of containment and controlled decontainment on data analysis practices?
Will containment permanently impair the ability of companies to predict?
one of the assumptions of the use of predictive models is a form of continuity and the presence of patterns in the analyzed data
will the normal future be like the normal past? Will the behaviors that the data capture be the same before and after? And if not, since the historical data is no longer relevant, will it be necessary to carry out the great purge?
How long will we be in a transitory regime, with perhaps instabilities that are unsuitable for statistical modeling?
Axis 1: data pre-processing simply replace the data of the containment period by the data of the equivalent period of the last year (or a "smart" average of the past years), with possible adjustments to take into account the evolution in trends, the continuity with the data just before containment, etc.) consider the data from the containment period as missing data and apply the usual techniques in this case: replacement by statistics such as the mean, the median (this is similar to the previous case), imputation by a model adjusted on the historical data
Axis 2: the creation of "context" variables added to the data points of the period addition of binary variables over the period so that they "absorb" the particularities of the period (e.g.: binary regressor over the data points of the period) adding variables that are not usually included in the models because they do not vary sufficiently (e.g., in the case of a forecasting model, reintroducing the number of stores when usually we are satisfied with a trend)
Axis 3: the choice of modeling techniques that better manage sudden changes in the data such as techniques based on decision trees, or based on "robust" statistical learning approaches by performing resampling on recent data (as can be done for target variables that are highly unbalanced).
Illustration
Let us give here an "optimistic" illustration of the impact of the Covid-19 crisis on predictive models, focusing on the case of "transparent" variables in normal condition. The data, which are real, are those of a large retailer with a network of stores around the world. The concern of this company is to update its sales forecasting model following the pandemic. The figure below represents the weekly sales observed on its Chinese perimeter (its data has been normalized to 1 per discretion), along with the number of open stores (also normalized to 1). The period framed in light red corresponds to the containment period.
If we postulate here, for the example, a simplistic linear model of sales forecasting in China including as explanatory variables:
● the sales of one year ago over the equivalent period (past_year),
● the number of stores opened (nb_opened).
Over the pre-containment period, it is clear that the number of stores open will have no impact on the sales forecasting model. Even the small break observed between the passage from 2018 to 2019 (corresponding to a store closure) does not seem to be reflected in any trend/level change. In fact, the model's statistics are unequivocal, with the effect of the constant (Intercept) almost entirely offset by that of the number of stores opened (0.28 vs. -0.27, standardized coefficients).
For the "parsimony" enthusiasts, this observation will simply lead to removing the number of open stores in the modeling of future sales. For fans of feature engineering, this variable will never pass the "low variance threshold" filter. And for others, probably the majority, it will lead to entering a variable that is a priori silent (in any case in a normal situation) in the predictive modeling.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.28098 1.76752 0.159 0.874
past_year 0.95219 0.07049 13.508 <2e-16 ***
nb_opened -0.27352 1.80216 -0.152 0.880
Multiple R-squared: 80
A very simple calculation tells us that predicting future sales in the Covid period using this model is dramatic: setting the value of nb_opened to 0 will predict sales on average at 0.95 + 0.28, almost 25% higher than past values!
On the other hand, if we extend the "training" period of the linear model to the containment period, we will notice a very strong significance of the nb_opened variable (column Pr(>|t|) : 1.46e-15 ***). Here the weights of the model are such that a number of open stores equal to 0 will lead to predicting a decrease of about 60% in sales compared to last year, where in normal times (100% of open stores), the variable remains transparent.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.59426 0.07084 -8.389 3.51e-12 ***
past_year 0.94193 0.08212 11.470 < 2e-16 ***
nb_opened 0.60757 0.05932 10.243 1.46e-15 ***
Multiple R-squared: 78%
The lesson is clear: if the Covid-19 crisis calls into question the validity of existing predictive models, it also allows us to make previously "silent" variables speak (and which are undoubtedly easily available, or even already present in the models in place).
Paradoxically, now that we have traces of the impact of such a crisis in the data, a first use case is to generate realistic "catastrophe" scenarios based on real observations in times of crisis (what happens in two months if I have to close X number of stores tomorrow?) The more finely the effects of the variable in question are observed and measured (by taking a day rather than a week granularity, by looking at closures by province rather than globally, by cross-referencing with the duration of closure, etc.), the better the reaction to force majeure events can be estimated (we leave aside the question of whether it is appropriate to take these scenarios into account if they remain exceptional...)
Other modeling approaches can be considered:
build two-component models: a short-term model driven by a short history, a long-term model driven by a long history (it remains to be determined how to shift from one to the other, possibly on the basis of an external signal managing the relative weight of the two models, such as a recovery indicator for example)
Use fewer models based on learning from the data, and more structural models (assumptions are made about the equations governing the links between the explanatory variables and the target variable, and the parameters of these equations are adjusted to the data).
We voluntarily ignore at this stage the statistics of rare or extreme events, or the theory of catastrophes.
In summary
Preprocessing
Intelligent copy and paste
Missing value processing & imputation by model
Variable creation
Period regressors
Re-introduction of usually "transparent" variables
Modeling
Tree-like methods for breaks
Resampling
Other
Short and long term bi-component models
Structural models
Opportunities
The health crisis is also creating many opportunities from a data science perspective:
it has created a "natural experiment" on a global scale, and the analysis of the data will probably allow to highlight relationships, to measure impacts that were impossible to analyze in a "stationary regime", in the normal functioning of the society and the economy (think of the massive telecommuting, that no economy would have tested on this scale without fearing the worst)
new data sets have been created that allow the impact of causal variables whose effect was previously difficult to isolate (e.g., in the case of sales forecasting models, the number of sales outlets in a stable network vs. a time trend) to be revealed
and probably a gold mine for testing models by exploiting important differences in situations (containment, closures, etc.) between countries or regions, to evaluate the impact of variables that are usually "transparent" for the models (number of stores, logistic flows, etc.)
Data has therefore not lost its relevance and remains, as already mentioned, one of the only reference points for trying to understand what is happening and to pilot in these different times. It is likely that predictive capabilities will be challenged for some time. Simple analyses, on short cycles, will probably prevail operationally. But a formidable set of data has been created in the last few months, and the data science community will be eager to explore all the lessons learned.
Hervé Mignot
Data and R&D Associate