Data across disciplines

Having started as an economist and then transitioned into data science, I’ve been very interested in how to a) teach data science to economists, and b) what the disciplines can learn from each other.

A few weeks ago, I was assisting with some training (with Data Science Dojo) on behalf of the World Bank in Bucharest. The participants were mostly ministry of health employees from South America and Eastern Europe with a few externals too. With our usual bootcamps, the participants are mostly analysts or software engineers whereas the Bucharest participants were from more of an academic background, mostly epidemiologists and economists. I could see the objection forming on the lips of the economists “but it’s not causal!” And I could hear a few mumblings from the epidemiologists when it came to discussing precision and accuracy.  To avoid too much confusion over false friends and discipline sub-cultures, I put together two tables: one comparing economics / traditional statistics to predictive analytics, and another comparing epidemiology to predictive analytics.  Epidemiology isn’t my discipline but I had a little help from my friends (thanks Lizzie and Rosie!) and a doctor / epidemiologist on the course.

Economics / Traditional Statistics vs Predictive Analytics

Traditional Statistics Predictive Analytics
Terminology Independent Variables / Predictor

Dependent / Outcome Variable

Model = algorithm (+ parameters)

Features

Target

Model = algorithm + parameters + data

Focus Causal estimation – unbiased estimates of a treatment effect Prediction of an overall outcome
Data Often deliberate data collection (own or national surveys) More likely to use data collected in everyday operations
Algorithm Choice Depends on the data you have (e.g. time series, instrumental variable) Considers performance on prediction.  Also considers interpretability and computing time.
Approach for choosing variables Focus is to get unbiased estimates so can include IDs, time dummies etc.

More theoretical approach to choosing control variables. Check significance of adding individual feature.

Focus is performance on unseen data so can’t include IDs and time dummies.

Agnostic approach.

Include all but then prune back according to how much extra predictive power the feature adds.

Evaluate model Test for broken assumptions. Test for significance of added features. Look at the mean and standard deviation of multiple estimations of predictive performance.
Avoidance of overfitting In an ideal scenario, pre-registering our hypotheses. Test the model’s predictive performance on unseen data.
Example Which drug works to treat the malignant tumours? Is the tumour malignant?

Epidemiology vs Predictive Analytics

Epidemiology Predictive Analytics
Terminology (False friends!) Accuracy (estimate represents the true value)

Precision (similar results achieved with repeated measurement)

Bias (systematic source of error; source: selection or information)

Accuracy (model evaluation metric)

Precision (model evaluation metric)

Bias (systematic source of error; source: overfitting model to training data)

Focus Risk factors and disease outcome

Predict disease incidence at the aggregate level

Prediction of an overall outcome

Predict at the individual level

Data Often purposeful collection (10s – 100,000s) More likely to use data collected in everyday operations (100s – billions +)
Algorithm Choice Depends on the data you have (independent – Cox regression, Kaplan-Meier survival analysis) or not (linear, logistic, hierarchical)). Considers performance on prediction.  Also considers interpretability and computing time.
Approach for choosing variables Theoretical approach – research questions, is biological mechanism plausible for main effect and control variables? Use causal diagrams. Focus is performance on unseen data so can’t include IDs and time dummies.

Agnostic approach.

Include all but then prune back according to how much extra predictive power the feature adds.

Evaluate model Look at the significance of the coefficient on the variable.

Look at the confidence intervals around the coefficient.

Look at the mean and standard deviation of multiple estimations of predictive performance.
Avoidance of overfitting Test the model’s predictive performance on unseen data.
Example Assess odds ratio of lung cancer in smokers vs. non-smokers Is the tumour malignant?

Some of the differences are pretty superficial – different terminologies for the same concepts or the same terminology for different concepts. Some were due to more fundamental differences in focus, for example, the economists caring more about whether a particular policy caused a particular effect whilst data scientists (in predictive analytics) caring more about the overall predictive power of the model. This then translates into different data – if you really care about identifying a causal estimate, often that means paying to set up a study and collect detailed data with a small number of people.  Here the decision is often pretty binary ‘should we roll out this policy?’ and at a high level (e.g. making a decision on behalf of an entire borough or even nation). If you’re more interested in how to personalise the decision for each individual that you interact with, then you may have millions of decisions to make and your model needs to give an appropriate answer (which may be different) for each of them with the data available in the course of everyday operations (as a survey to every individual would be unfeasible!). Because you’re most interested in obtaining an unbiased estimate on the variable of interest in an economics study, you can bring in other control variables which would never get included in a predictive analytics model, for example, fixed effect and time dummies. In economics, panel techniques are a wonderful trick to control for unobserved variation but in predictive analytics including such variables would be akin to cheating as you’re not necessarily going to have to predict on the same individual.

So there are definitely good reasons to do with the focus of each discipline to explain the techniques diverging.  There are, however, some parts where I think the disciplines could learn from each other. Two aspects of a recent freelance project I completed illustrates this quite well I think. I was helping a membership company encourage its members to renew their subscription.

  1. Combining techniques: A colleague had previously built a predictive model to better identify members whose subscriptions were coming up for renewal but were at risk of terminating their membership. This helped the sales team focus their efforts on those they were at risk of losing. (Of course, it would be even better if we could create a model to predict those who were at risk of not renewing but would be responsive to a call but we needed slightly different data for that!) I then used panel techniques (which come from econometrics) to try to get closer to a causal estimate of what contributes towards the average member renewing their membership. Combining the techniques more commonly confined to different disciplines allowed us to target who and then have a somewhat better idea of what might encourage these individuals renew their membership.
  2. Holdout data: for the model trying to understand what contributes towards a member renewing, I withheld a proportion of my data to test the model against. Usually, in predictive analytics, because the focus is predictive power, the model is evaluated on how well its predictions perform on this unseen ‘holdout’ dataset (e.g. accuracy, precision etc, or more familiarly for economists,  R squared). However, because I was more interested in getting unbiased estimators on the possible ‘treatment’ variables, the overall predictive power matters less.  (Some phenomena are just difficult to explain with the amount of data available. To give you a sense of this – when I worked on identifying what contributes to wellbeing – often we’d be happy with R squareds within the 0.3 region which would be pretty bad performance if you were trying to predict wellbeing!) Because of this focus on unbiased estimates, I instead looked at the coefficients to see whether they changed significantly when running the same model on this holdout data.  I conducted paired t-tests of the coefficients of the models using the original training dataset and the holdout dataset. There was no significant difference at a 95% confidence interval between the paired coefficients. I have heard Spencer Greenberg speaking briefly about the use of holdout data in the field of psychology but couldn’t find too much online about how to best use holdout data helpfully in a different way given the different focus on unbiased coefficient estimation. So I’m very open to discussion on this!

Just a brief note on sample size and expense – I know one of the restrictions researchers (in economics and other social sciences) work within is budget for data collection.    Because the data collection is often tailored for the study in causal estimation studies, it is expensive and there often isn’t the appetite to collect much more than the required minimum sample size to detect the expected effect size due to budgetary reasons. Holdout data is usually about 30% of sample so you’d need to increase the sample size by 42% to get sufficient data to do a training / holdout split. I can definitely see there being pushback on this suggested change in technique because of increasing the cost of data collection for something that the discipline has not so far recognised as important. However, I would argue the added expense is justified. We have already made the decision to try to figure out whether a programme or policy is worth scaling up or continuing and justified the overheads of the evaluation.The additional amount for the data collection is likely to be much smaller than the 42% increase in the sample size as the overheads of the evaluation remain the same (the training and recruitment of data collectors, the data collection itself, the project management and the analysis).  Spending a relatively small proportion more would enable us to be much more confident that the results were not purely by chance.   The current solution in academia (a slightly different field to economics for impact evaluation) of pre-registering a small number of hypotheses and only researching those feels unsatisfactory in that it restricts us to our current knowledge base at the time of proposing the study and does not do justice to the exploratory nature of the research process. We often have a much better understanding of complex phenomena after exploring the data.  Trying to solve overfitting through pre-registration means that we have to commission further studies to explore the insights we picked up in exploring the data we already have.  Holdout data acts as another dataset to test those explorations.

I find it fascinating that these sub-cultures of using statistical techniques in slightly different ways have developed in different disciplines and would be very interested in talking more to people from these different disciplines to see what we can cross-fertilise!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.