Tea With Strangers new directions?

In Autumn last year, the Tea With Strangers London hosts were sat around a table full of food and tea, and contemplating the minds of strangers. We were trying to figure out why people cancel last minute or don’t show, (which, as you can imagine, is disappointing for us!)  and also turned our analysis on ourselves as to why we’re sometimes a bit slow to organise teas (which, as you can imagine, is disappointing for our strangers). Since the tools of my trade are research ones, I rolled up my sleeves and a) analysed our participation data, and b)  conducted qualitative interviews with some of our hosts and some of our strangers.

The numbers are reasonably small, and the direction of causality of the quantitative analysis is reasonably difficult to interpret but the queries did provide lists of groups of participants by characteristic (including one participant who’d cancelled last minute 17 times!) which made it easier to select participants to interview.

What I learnt most from the interviews is that each host conducts their teas very differently, and it’d be useful for new recruits to co-host (in a “Community-Tea”!) with experienced ones to see the range of approaches and find their unique style. For the full write-up, please click here.



Model Metrics in Singapore

I helped teach the Data Science and Data Engineering Bootcamp by Data Science Dojo in Singapore earlier this month. I came away with a refreshed appreciation of the importance of randomness which pops up frequently in techniques, and also a renewed love for ensemble methods.  One of the things that students consistently found difficult was getting their head around the different evaluation metrics so here’s my attempt to explain and simplify.

One of the things which always surprises students is being able to write a machine learning algorithm in one line of code. How could it be so easy, they ask? And then we start to ask: is this model any good? What do the metrics tell you? And then students fall from the cliff of confusion very much into the desert of despair.


Binary Classification Metrics

So a binary classification model predicts “yes” or “no” for an observation, for example, “is this a fraudulent transaction?” or “does this person have a disease?” There are 4 states of the world:

  1. You guess “yes” and it’s a “yes” – yay!
  2. You guess “no” and it’s a “no” – yay!
  3. You guess “yes” and it’s a “no” – less yay…
  4. You guess “no” and it’s a “yes” – less yay again…

These states are often put into a table called a “confusion matrix” (one hopes this is named ironically..!) and are said to be “true” if the actual class matches the predicted class, and “false” if it doesn’t.

confusion matrix

The simplest metric is accuracy: out of your predictions, how many are correct (true positive + true negative / total)? But accuracy turns out to not be a good measure for rare events. Sticking with the example of fraud: say 1 in every 1000 transactions is fraudulent, and you predicted that they all were fraudulent as a simple starting model, the accuracy of the model would be 99.9%. Not bad, right? But actually not helpful in giving you any action to take.  As data scientists, we need to be aware of multiple metrics because data sets differ in two key ways:

  • The ratio between positives and negatives (“the class distribution”) => this is rarely 50:50;
  • The cost of wrong observations.

Accuracy is only good for symmetric data sets where the class distribution is 50:50 and the cost of false positives and false negatives are roughly the same.  For example, if you were trying to predict whether someone is female (ignoring non-binary people for the sake of the example!) where there are roughly equal numbers of males and females, and the cost of a false positive (when you mark someone as female but they are actually male) and a false negative (when you mark someone as male but they are actually female) are about the same.

Precision looks at the ratio of correct positive observations to all positive observations (true positives / (true positives + false positives).  You improve precision by reducing the number of false positives. It is a measure of how good predictions are with regard to false positives and so is useful when false positives are costly. For example, whilst it may seem useful to detect as many cases of a disease as possible,  if the cost of the treatment had serious side effects, you may want to reduce the number of false positives that you subject to the treatment unnecessarily.

Recall / sensitivity is the ratio of correctly predicted positive events (true positives / (true positives + false negatives)). It is a measure of how good predictions are with regard to false negatives and you improve it by reducing the number of false negatives i.e. missing true cases. You will want to focus on improving recall / sensitivity when the cost of missing a case is big, for example, in predicting terrorism.

The F1 score is the harmonic mean of precision and recall – this is for cases where an uneven class distribution matters and if false positives and false negatives have similar costs. For example, in the case of tax dodgers (few relative to population i.e. an uneven class distribution), it may be equally costly to miss a tax dodger (due to the lost tax revenue) and to accuse someone of dodging tax (due to the undermined trust). You take the harmonic mean instead of a simple mean because the denominators for calculating precision and recall are different, and so it doesn’t make sense to calculate a simple mean*.

You can also have other combinations of precision and recall which reflect how much you care about each. This is known as the F-beta where beta is how much more you care about recall. This is then useful in translating actual costs of false positives and negatives into the metric. For example, say the cost of falsely identifying an employee as leaving their role costs the company $1,000 in a pay rise, and the cost of missing that an employee is leaving costs the company $10,000. in replacing them, then you’d use an F10 score as the cost of the false negative is 10x the cost of the false positive.

The important thing to remember in all of this is that whichever type of error is more important or costs more is the one that should receive more attention.

In cross-validation, you’re running the algorithm on multiple samples of the data and so it creates lots of values for the metrics. Look at the mean and standard deviation of the metrics.  Ideally, you want a high mean (accuracy, precision, recall / sensitivity, F1 score) and low standard deviation which suggest low bias and low variance respectively BUT there is a trade-off between bias and variance which means that you’re always looking for the sweet spot between low bias (illustrated by the high mean) and low variance.  As you increase the model complexity you decrease the bias but if you go too far you end up overfitting and increasing the variance. If there is a low standard deviation but also low mean you can increase the complexity of your model. If there is a high mean but high standard deviation also it probably means that you’ve overfitted your model.

Practically speaking, you can decrease the complexity by doing things like reducing depth, increasing the minimum number of samples per leaf node or decreasing the number of random splits per node in tree-based models. Do the opposite to make your model more complex.  I would highly recommend trying to change one parameter at a time and observing the change to the mean of which metric you’re using and the standard deviation.

The table below summarises how to choose the evaluation metric:

evaluationmetrics (2)

Continuous Variable Model Metrics

When predicting a continuous variable, the idea of being right or wrong for each prediction doesn’t really work.  As an example, if I have the task of predicting the height of a group of people, I am not predicting “are they exactly 170cm tall?” The answer for pretty much everyone would be “no”. So I have to predict a height for each person depending on their characteristics.  This is a rather sad way to put it but predicting a continuous variable is about minimising how wrong you are. For this reason, the above framework of true positives and true negatives doesn’t really work.  So we need to turn to some other metrics to evaluate how good our continuous model is.

RMSE is the root mean squared error. It is calculated by taking the difference between the square of predicted variable, say house price, and the square of the actual house price, and then taking the square root. It gives you an idea of how far the predicted values are from the actual values. The fact that it involves squaring the predicted house price and the actual house price means that the RMSE overweights large errors in prediction. This is important when predicting well on outliers is particularly important.

If predicting on outliers is not important (e.g. if they probably represent measurement error in your equipment) then you can use MAE (mean absolute error) which is just the average of the difference between the predicted value and the actual value.

In practice, the RMSE and MAE are used less commonly than R squared (also known as the “coefficient of determination”) as R squared is standardised.  R squared is effectively the ratio between explained variation and total variation observed. The values of R squared are only between 0 and 1 with 0 being the worst model (no variation explained) and 1 being the best (all observed variation explained by the model). (Well, actually the R squared can go below 0 but this would mean that your model would explain less than simply guessing the mean which is terrible!) The fact that R squared is standardised means that it’s a lot easier to develop an intuition around whether it indicates a good or a bad model compared with RMSE and MAE.  One thing to note is that the R squared always increases when you more variables to your model.  The adjusted R squared  adds a penalty for every variable you add so that you don’t overfit. The disadvantage of the adjusted R squared is that it weights each parameter added equally and so doesn’t take into account that adding x3 may be worthwhile because it explains lots of the variance but adding x6 isn’t worthwhile because it’s not adding that much.

*See the second answer on this stack overflow.


What drives bookings of accessible accommodation?

Accomable is a website which allows disabled travellers to find and book accessible accommodation. [Edit: Accomable were acquired by Airbnb Nov 2017]. They were interested in assessing how they measured up to market standard metrics, and also what drives the number of bookings, a key driver of growth in their business.


Accomable was interested in tracking metrics recommended by a VC firm for market places.  The metrics look at buyers, sellers and the overall market place to understand how its growing and what could fuel further growth. I wrote Stata code to when enabled Accomable to select the months of interest and which populated a spreadsheet of the metrics broken down by month for the months selected.  This involved looping over the selected months for each calculation and using “putexcel” to refresh the spreadsheet.

Modelling Booking

One of the most important questions at Accomable is “what drives the number of bookings?” More bookings means more customers finding what they need on the site, and ultimately drives profit.

I was interested in answering this question from both the travellers’ and hosts’ perspective as the decision to book will be a combination of factors.  Some of that data we didn’t have access to, for example, the travellers’ specific circumstances at the time, for example, whether they’d just got a windfall or just really needed a break. But we had data about the travellers’ interaction with the site which is what is within Accomable’s locus of control anyway. The decision to book a particular property of course also brings in the characteristics of that particular property.

I had access to data on users, properties and bookings which I merged together. I then started engineering features, for example, on the travellers’ side:

  • I extracted whether the prospective traveller had provided their email address and / or phone number (which indicates a certain legitimacy of the booking enquiry)
  • I extracted the length of their message (Airbnb “design for trust” in suggesting the length of messages through the box size – too short a message may not show enough effort but too long a message may scare off a host!)

The characteristics of the property, such as having a step-free bathroom or a pool, were already well-delineated into features so there wasn’t much more to add.  Such characteristics are particularly interesting in this case given that disabled people often find it difficult to find goods and services that meet their accessibility needs.  I did add, however, a feature which indicated whether the property was popular – Accomable doesn’t communicate to travellers which are the most popular properties and so I used this feature as a proxy for underlying quality which isn’t observed in our dataset (and which makes it more likely to get a unbiased consistent estimator on the other variables).

I also extracted some features of the booking itself, for example:

  • The time between the booking enquiry and the check-in date (with the rationale being to understand whether last minute or pipe dream travel plans contributed more to the number of bookings).
  • The time of day that the traveller sent the booking enquiry (with the rationale being to understand the booking behaviour of the traveller e.g. are they getting bored at work in the mid-afternoon and being a holiday?)

I then tried out a number of OLS regression models for the following dependent variables: the number of orders per traveller, the number of bookings per property and the price. The adjusted R squared terms ranged from 0.15 to 0.44 with the price models being the least stable (in terms of the coefficients varying greatly upon the inclusion of additional variables).


  • Bookings per traveller is higher if the traveller had booked a popular property (most popular that month) in the past, and also if they provided their email and their phone number (likely to be indications of seriousness of interest)
  • The number of bookings a property receives is much much higher when it’s a swap (this could be because it may be easier to trust that the property is suitable if the traveller is swapping homes with a host they know to have similar accessibility requirements), and when there’s an electric bed and a ceiling hoist (this suggest a scarcity in supply of such properties relative to demand).  Properties with an electric bed have, on average, 2.2 more bookings than properties without an electric bed.
  • Bookings were most popular at 12noon at 7pm, suggesting lunchtime and evening browses!


I recommended to Accomable to focus on recruiting properties which allowed a home swap, and also which catered for travellers requiring the more intensive accessibility features such as electric beds and ceiling hosts, which other websites are unlikely to be able to cater for. I also recommended focusing marketing emails during the popular booking times.



Promoting Mobile Money in Kenya

“Mobile money” refers to mobile-based money transfer and savings services.  Mobile money has been around in Kenya about 10 years, and according to CCN, the biggest brand, M-Pesa, has 18 million active users in the country and has lifted 2% of Kenyan households out of extreme poverty.  Proponents of mobile money explain this by mobile money enabling safer and easier savings, and reducing financial barriers and transaction costs to starting a small business.  A competitor was interested in seeing how it could increase its market share.

The client conducted a survey about 399 individuals’ mobile phone and mobile money use. I was also provided with access to a snapshot of the same individuals’ mobile money transaction data. (I describe my approach to the data analysis below but if you’d like to skip to the presentation of my findings, here it is).

The data required considerable cleaning for removal of duplicates and correction of typos. I then visualised the transaction data. One of the most interesting things I found as I was exploring was that the balance on the mobile money accounts seemed highly skewed towards low balances.


So I zoomed in a bit…


These graphs gave initial indications that these particular customers don’t seem to use the mobile money accounts for long-term savings (the modal value of savings is 0-5 Kenyan Shillings (about US$0.05)).  The below graphs which show the deposits being low relative to receipts of money also suggest that the accounts are being used more to enable transactions than savings.

I linked the transaction data with the survey data to understand who the customers were, and their behaviour with regard to mobile phone and mobile money use. I used OLS regression (with accompanying tests) to model the costs associated with mobile money, and probits to model the probability of choosing a provider, of using their mobile money service and of starting using it within the last year.  I then hypothesised about the behavioural barriers to accessing the client’s mobile money services, and created some recommendations on the findings.

I was most interested to find that customers perceive that comparing prices between providers, switching providers and sending money to another network is difficult.  The customers were very familiar with the price of sending money, weren’t convinced to change networks by promotions and the most frequently chose a provider because they trusted them.  Because of the lack of responsiveness to promotions, I advised the client to focus on developing products which were relatively expensive or under-provided in the current market.

Please see the presentation of my findings.


Previous Projects