The competency model for data scientists was developed using job/task analysis, industry research, and expert interviews. While these competencies are not expected to be an exhaustive list, as expectations of data scientists may vary from organization to organization, we aimed to capture the most common, core data competencies a data scientist is expected to demonstrate to be effective in their role.
Data Management: The process of gathering, storing, and using data. Includes data collection, data cleaning and transformation, data storage, data quality, and data security. |
Associate (Level 1) |
Professional (Level 2) |
Perform standard data import, joining and aggregation tasks using R or Python
- Import data from flat files into R or Python.
- Import data from databases into R or Python.
- Aggregate numeric, categorical variables and dates by groups using R or Python.
- Combine multiple tables by rows or columns using R or Python.
- Filter data based on different criteria using R or Python.
|
Collect data from non-standard formats (e.g. json) by modifying existing code
- Adapt provided code to import data from an API using R or Python.
- Identify the structure of HTML and JSON data and parse them into a usable format for data processing and analysis using R or Python.
|
Perform standard cleaning tasks to prepare data for analysis
- Match strings in a dataset with specific patterns using R or Python.
- Convert values between data types in R or Python.
- Clean categorical and text data by manipulating strings in R or Python.
- Clean date and time data in R or Python.
|
Perform standard data extraction, joining and aggregation tasks using SQL
- Aggregate numeric, categorical variables and dates by groups using PostgreSQL.
- Interpret a database schema and combine multiple tables by rows or columns using PostgreSQL.
- Extract data based on different conditions using PostgreSQL.
- Use subqueries to reference a second table (e.g. a different table, an aggregated table) within a query in PostgreSQL
|
Assess data quality and perform validation tasks
- Identify and replace missing values using R or Python.
- Perform different types of data validation tasks (e.g. consistency, constraints, range validation, uniqueness) using R or Python.
- Identify and validate data types in a data set using R or Python.
|
|
Exploratory Analysis: Use of statistics and visualizations to analyze and identify trends in data sets. Includes calculating metrics and creating data visualizations. |
Associate (Level 1) |
Professional (Level 2) |
Calculate metrics to effectively report characteristics of data and relationships between features using Python or R
- Calculate measures of center (e.g. mean, median, mode) for variables using R or Python.
- Calculate measures of spread (e.g. range, standard deviation, variance) for variables using R or Python.
- Calculate skewness for variables using R or Python.
- Calculate missingness for variables and explain its influence on reporting characteristics of data and relationships in R or Python.
- Calculate the correlation between variables using R or Python.
|
Identify and reduce the impact of characteristics of data
- Identify when imputation methods should be used and implement them to reduce the impact of missing data on analysis or modeling using R or Python.
- Describe when a transformation to a variable is required and implement corresponding transformations using R or Python.
- Describe the differences between types of missingness and identify relevant approaches to handling types of missingness.
- Identify and handle outliers using R or Python.
|
Create data visualizations in R or Python to demonstrate the characteristics of data
- Create and customize bar charts using R or Python.
- Create and customize box plots using R or Python.
- Create and customize line graphs using R or Python.
- Create and customize histograms graph using R or Python.
|
|
Create data visualizations in R or Python to represent the relationships between features
- Create and customize scatterplots using R or Python.
- Create and customize heatmaps using R or Python.
- Create and customize pivot tables using R or Python.
|
|
Statistical Experimentation: Procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions. |
Associate (Level 1) |
Professional (Level 2) |
Describe statistical concepts that underpin hypothesis testing and experimentation
- Define different statistical distributions (e.g. binomial, normal, Poisson, t-distribution, chi-square, and F-distribution, etc. ).
- Explain the statistical concepts in hypothesis testing (e.g. null hypothesis, alternative hypothesis, one-tailed and two-tailed hypothesis tests, etc. ).
- Explain the statistical concepts in the experimental design (e.g. control group, randomization, confounding variables, etc. ).
- Explain parameter estimation and confidence intervals.
|
N/A |
Apply sampling methods to data
- Distinguish between different types of random sampling techniques and apply the methods using R or Python
- Sample data from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python
- Calculate a probability from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python
|
|
Implement methods for performing statistical tests
- Run statistical tests (e.g. t-test, ANOVA test, chi-square test) using R or Python.
- Analyze the results of statistical tests from R or Python.
|
|
Model Development: Iterative process in which models are designed, tested and built upon. Includes feature engineering, model selection and evaluation. |
Associate (Level 1) |
Professional (Level 2) |
Prepare data for modeling by implementing relevant transformations.
- Create new features from existing data (e.g. categories from continuous data, combining variables with external data) using R or Python.
- Explain the importance of splitting data and split data for training, testing, and validation using R or Python.
- Explain the importance of scaling data and implement scaling methods using R or Python.
- Transform categorical data for modeling using R or Python.
|
N/A |
Implement standard modeling approaches for supervised learning problems.
- Identify regression problems and implement models using R or Python.
- Identify classification problems and implement models using R or Python.
|
|
Implement approaches for unsupervised learning problems.
- Identify clustering problems and implement approaches for them using R or Python.
- Explain dimensionality reduction techniques and implement the techniques using R or Python.
|
|
Use suitable methods to assess the performance of a model.
- Select metrics to evaluate regression models and calculate the metrics using R or Python.
- Select metrics to evaluate classification models and calculate the metrics using R or Python.
- Select metrics and visualizations to evaluate clustering models and implement them using R or Python.
|
|
Programming for Data Science: Application of fundamental programming concepts to solve data science problems. |
Associate (Level 1) |
Professional (Level 2) |
Use common programming constructs to write repeatable production quality code for analysis in R or Python.
- Define, write and execute functions
- Use and write the control flow statements
- Use and write loops and iterations
|
Demonstrates best practices in production code including version control, testing, and package development using R or Python
- Describe the basic flow and structures of package development
- Explain how to document code in packages, or modules
- Explain the importance of the testing and write testing statements
- Explain the importance of version control and describe key concepts of versioning
|
Data Communication: The practice of presenting data findings in a compelling and actionable way to technical and non-technical audiences. Includes data storytelling, data visualization, presentation and writing skills. |
Associate (Level 1) |
Professional (Level 2) |
Present data concepts to small, diverse audiences
- Explain findings and/or the reasoning for selecting approaches.
|
Frame, convey, and summarize stories using data
- Employ techniques in data storytelling to propose findings and relay solutions to business stakeholders
|
Employ data visualization to support findings
- Create charts using visualization tools.
- Use visualizations that support the findings being presented.
|
Employ multiple tactics (written and verbal) to communicate to business leaders
- Deliver a verbal presentation addressing the business goals, outcomes and recommendations
- Provide a written explanation of findings and/or reasoning for selecting approaches
|
Business Acumen: The collection of both general and organization-specific knowledge about how things get done and why, used with the intent to positively impact the organization. |
Associate (Level 1) |
Professional (Level 2) |
N/A |
Make recommendations for analytic approaches based on business goal
- Explain how solution addresses the business problem
- Provide recommendations for future action to be taken based on the outcome of the work done
|
|
Judge performance of analytic results against relevant business criteria
- Define a KPI to compare model performance to business criteria in the problem
- Compare the performance of the two models/approaches using the defined KPI
|