DataCamp Certifications: Assess data quality and perform validation tasks

Why validate data?

The best way to understand why this is so important is to think about an example.

Let’s say we ask you to help out the marketing team with determining which types of coffee shop they should market their new product in. The data description asks that you remove any rows where the number of reviews is missing. Let's suppose we did not do that. We might go on to tell the marketing team that they should work with a particular type of shop. Months later, that approach is not be working and the company is losing money. After lots of looking back and analyzing why, it turns out it was because you did not remove missing values.

You might think this is an extreme, invented story to scare you, but these are the potential consequences of failing to validate data. While you are studying the impact is low, you don't quite get the answer you expected. To a business, the impact could be huge. Decisions are made based on inaccurate information just from not validating and cleaning your data.

 

What sort of validation and cleaning will you need to do?

We have included a couple of problems that you need to fix in every project data set. We could make it really easy for you and tell you exactly where to find them. But we want this to be as real as possible. We will give you the information on how the column should be structured, and you need to make sure that it meets this criteria. Just like you will have to in your data jobs.

This could include:

  • Replacing values with another value
  • Removing rows that meet (or do not meet) some criteria
  • Correcting mistakes in the data (e.g. spellings)
  • Converting data to a different type (convert characters to numbers)

For those of you working on the associate level, the instructions we give will usually be more specific - for example, we will probably tell you exactly what to do with a missing value. But if you are working on the higher level, we may expect you to use your own judgment from time to time.

Because this is an area a lot of candidates find difficult, we produced a webinar especially to discuss this in detail. We recommend you take a look!

 

https://www.datacamp.com/resources/webinars/data-cleaning-for-everyone 

 

Higher levels: Data Analyst and Data Scientist

If you are taking the higher level Data Analyst and Data Scientist certifications, you will be asked to write a report, which is graded by human markers. 

My tip is to make it really easy for whoever will grade your work. Create a list with one point for each column. That way, the grader will be absolutely certain you have looked at every single column and won't be able to fail you. Not only are you making the grading easier, but it is also easier for you to see what you have done and be certain you have checked every column.

Here is an example solution:

The original data is 200 rows and 9 columns. After validation, there were 198 rows remaining. The following describes what I did to each column:

  • Region: There were 10 unique regions, as expected
  • Place name: There were 185 unique place names, suggesting that some names are duplicated, this should be confirmed with the team providing the data
  • Place type: There are only 4 values for each place type, Coffee Shop, Cafe, Espresson Bar and Others. This matches what is expected
  • Rating: Values range from 3.9 to 5.0, so all are within the range expected
  • Reviews: I removed rows where the Review value was missing. This was 2 rows, leaving 198 rows of data
  • Price: There are 3 price categories, as expected
  • Delivery option: There are 2 delivery options - True/False, as expected
  • Dine-in Option:I converted missing values to False, there were originally no false values
  • Takeaway option: I converted missing values to False, there were also originally no false values