If you are working towards the Data Scientist certification, you will need to demonstrate your ability to select and apply appropriate models to a given business problem.
Specifically we are looking for:
- Correctly identified the type of problem (regression, classification or clustering)
- Has selected and fitted a model for that problem to be used as a baseline.
- Has selected and fitted a comparison model for the problem that they were provided.
Along with:
- Compared the performance of the two models/approaches using any method appropriate to the type of problem.
- Has described what the model comparison shows about the selected approaches.
Be aware that there are some additional requirements related to the business application that you will also need to meet - take a look at the business focus and business metrics sections of the rubric here.
What type of problem is it?
The first thing we want you to demonstrate is that you can convert a business problem into an analytic one. We will give you a problem posed from the business perspective. But what type of modelling problem is it? Is it a regression problem? A classification problem? An unsupervised clustering problem?
(Hint: it will only be one of those three).
For example, we might you a problem where the business is wanting to predict high or low numbers of reviews. You want to be able to predict a binary (high or not) outcome, so this would be a binary classification problem.
And that is all we need you to tell us.
How many models?
Now you know what type of problem you are working on, you want to start fitting some models. There are two things to avoid here. The first is fitting too many models, the second is not fitting enough models.
As you will see from the criteria above, we are looking for you to fit two models. Any time you are developing a model it is really good practice to start with something simple to use as a baseline. From this you can compare any other models that you choose to fit.
My tip for your baseline model is to keep it really simple. Remember when you were taught about linear regression or logistic regression? Now is the time to use those methods. They are simple to fit, simple to interpret and give you a starting point to go from.
From there, you can really go on and fit anything that you think appropriate. But, you only need to fit one additional model. As well as seeing people forget to fit a second model, one of the other things we see a lot is people fitting every single model they have seen before. This isn't really recommended for any analysis project, and in your practical exam, we would rather you focus on fitting a small number of models and demonstrate your ability to pick appropriate methods to solve the problem.
Evaluating the models
Now you have two models, you need to somehow determine which of these models is doing a better job. When it comes to the evaluation criteria we want to see that you have performed a technical evaluation of your model. Unlike your courses, here you are going to have to choose which method you are going to use to compare models.
And when you have done that, we also want to know what that tells you about the models. Which of the models performs better? Which would you recommend to use to approach the business problem you were given? It doesn't need to be long, but we want to see that you can take the learnings from your model development process.