Data Science meets life: finding a car

Challenge: Having just moved to San Francisco, I needed to find a specialist car which was wheelchair accessible (my partner needs to be able to get in the back still on his wheels!) I was shocked at the prices at a local specialist dealership (where the cheapest, oldest cars start at $30k…)


  • Predict how much a car should cost on the basis of characteristics you’re generally told when you’re buying (age, brand, engine size etc)
  • Compare similar cars at the dealership and on Craig’s List to see how much of a mark-up there is.
  • Build a searchable web app with a search function suited to searching for additional accessibility features.

The first step was getting the data on which to build the model. Having scraped the listings website of a local specialist dealership, I ended up with a list of c.600 cars, their price and their characteristics.  Hmm, not enough data to do much validating and testing with. So I scraped all car and truck listings across the US on Craig’s List. This returned c.80k listings. Now we’re in business! It returned such beauties as…


old car

So I restricted it to cars that were at least driveable, and iteratively added features to train my model, and tested whether it reduced how far off the mark I was. The most complex model I tested was a linear regression with polynomials of order 2 and interaction terms.   The next stage is cross-validating my model. With just training and test datasets, I was at risk of learning too much from the test dataset and overfitting to it. Watch this space!

See a technical write up of my progress so far here and my presentation of the project here.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.