The competency model for data scientists was developed using job/task analysis, industry research, and expert interviews. While these competencies are not expected to be an exhaustive list, as expectations of data scientists may vary from organization to organization, we aimed to capture the most common, core data competencies a data scientist is expected to demonstrate to be effective in their role.
Data Management: The process of gathering, storing, and using data. Includes data collection, data cleaning and transformation, data storage, data quality, and data security. 
Associate (Level 1) 
Professional (Level 2) 
Perform standard data import, joining and aggregation tasks using R or Python
 Import data from flat files into R or Python.
 Import data from databases into R or Python.
 Aggregate numeric, categorical variables and dates by groups using R or Python.
 Combine multiple tables by rows or columns using R or Python.
 Filter data based on different criteria using R or Python.

Collect data from nonstandard formats (e.g. json) by modifying existing code
 Adapt provided code to import data from an API using R or Python.
 Identify the structure of HTML and JSON data and parse them into a usable format for data processing and analysis using R or Python.

Perform standard cleaning tasks to prepare data for analysis
 Match strings in a dataset with specific patterns using R or Python.
 Convert values between data types in R or Python.
 Clean categorical and text data by manipulating strings in R or Python.
 Clean date and time data in R or Python.

Perform standard data extraction, joining and aggregation tasks using SQL
 Aggregate numeric, categorical variables and dates by groups using PostgreSQL.
 Interpret a database schema and combine multiple tables by rows or columns using PostgreSQL.
 Extract data based on different conditions using PostgreSQL.
 Use subqueries to reference a second table (e.g. a different table, an aggregated table) within a query in PostgreSQL

Assess data quality and perform validation tasks
 Identify and replace missing values using R or Python.
 Perform different types of data validation tasks (e.g. consistency, constraints, range validation, uniqueness) using R or Python.
 Identify and validate data types in a data set using R or Python.


Exploratory Analysis: Use of statistics and visualizations to analyze and identify trends in data sets. Includes calculating metrics and creating data visualizations. 
Associate (Level 1) 
Professional (Level 2) 
Calculate metrics to effectively report characteristics of data and relationships between features using Python or R
 Calculate measures of center (e.g. mean, median, mode) for variables using R or Python.
 Calculate measures of spread (e.g. range, standard deviation, variance) for variables using R or Python.
 Calculate skewness for variables using R or Python.
 Calculate missingness for variables and explain its influence on reporting characteristics of data and relationships in R or Python.
 Calculate the correlation between variables using R or Python.

Identify and reduce the impact of characteristics of data
 Identify when imputation methods should be used and implement them to reduce the impact of missing data on analysis or modeling using R or Python.
 Describe when a transformation to a variable is required and implement corresponding transformations using R or Python.
 Describe the differences between types of missingness and identify relevant approaches to handling types of missingness.
 Identify and handle outliers using R or Python.

Create data visualizations in R or Python to demonstrate the characteristics of data
 Create and customize bar charts using R or Python.
 Create and customize box plots using R or Python.
 Create and customize line graphs using R or Python.
 Create and customize histograms graph using R or Python.


Create data visualizations in R or Python to represent the relationships between features
 Create and customize scatterplots using R or Python.
 Create and customize heatmaps using R or Python.
 Create and customize pivot tables using R or Python.


Statistical Experimentation: Procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions. 
Associate (Level 1) 
Professional (Level 2) 
Describe statistical concepts that underpin hypothesis testing and experimentation
 Define different statistical distributions (e.g. binomial, normal, Poisson, tdistribution, chisquare, and Fdistribution, etc. ).
 Explain the statistical concepts in hypothesis testing (e.g. null hypothesis, alternative hypothesis, onetailed and twotailed hypothesis tests, etc. ).
 Explain the statistical concepts in the experimental design (e.g. control group, randomization, confounding variables, etc. ).
 Explain parameter estimation and confidence intervals.

N/A 
Apply sampling methods to data
 Distinguish between different types of random sampling techniques and apply the methods using R or Python
 Sample data from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python
 Calculate a probability from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python


Implement methods for performing statistical tests
 Run statistical tests (e.g. ttest, ANOVA test, chisquare test) using R or Python.
 Analyze the results of statistical tests from R or Python.


Model Development: Iterative process in which models are designed, tested and built upon. Includes feature engineering, model selection and evaluation. 
Associate (Level 1) 
Professional (Level 2) 
Prepare data for modeling by implementing relevant transformations.
 Create new features from existing data (e.g. categories from continuous data, combining variables with external data) using R or Python.
 Explain the importance of splitting data and split data for training, testing, and validation using R or Python.
 Explain the importance of scaling data and implement scaling methods using R or Python.
 Transform categorical data for modeling using R or Python.

N/A 
Implement standard modeling approaches for supervised learning problems.
 Identify regression problems and implement models using R or Python.
 Identify classification problems and implement models using R or Python.


Implement approaches for unsupervised learning problems.
 Identify clustering problems and implement approaches for them using R or Python.
 Explain dimensionality reduction techniques and implement the techniques using R or Python.


Use suitable methods to assess the performance of a model.
 Select metrics to evaluate regression models and calculate the metrics using R or Python.
 Select metrics to evaluate classification models and calculate the metrics using R or Python.
 Select metrics and visualizations to evaluate clustering models and implement them using R or Python.


Programming for Data Science: Application of fundamental programming concepts to solve data science problems. 
Associate (Level 1) 
Professional (Level 2) 
Use common programming constructs to write repeatable production quality code for analysis in R or Python.
 Define, write and execute functions
 Use and write the control flow statements
 Use and write loops and iterations

Demonstrates best practices in production code including version control, testing, and package development using R or Python
 Describe the basic flow and structures of package development
 Explain how to document code in packages, or modules
 Explain the importance of the testing and write testing statements
 Explain the importance of version control and describe key concepts of versioning

Data Communication: The practice of presenting data findings in a compelling and actionable way to technical and nontechnical audiences. Includes data storytelling, data visualization, presentation and writing skills. 
Associate (Level 1) 
Professional (Level 2) 
Present data concepts to small, diverse audiences
 Explain findings and/or the reasoning for selecting approaches.

Frame, convey, and summarize stories using data
 Employ techniques in data storytelling to propose findings and relay solutions to business stakeholders

Employ data visualization to support findings
 Create charts using visualization tools.
 Use visualizations that support the findings being presented.

Employ multiple tactics (written and verbal) to communicate to business leaders
 Deliver a verbal presentation addressing the business goals, outcomes and recommendations
 Provide a written explanation of findings and/or reasoning for selecting approaches

Business Acumen: The collection of both general and organizationspecific knowledge about how things get done and why, used with the intent to positively impact the organization. 
Associate (Level 1) 
Professional (Level 2) 
N/A 
Make recommendations for analytic approaches based on business goal
 Explain how solution addresses the business problem
 Provide recommendations for future action to be taken based on the outcome of the work done


Judge performance of analytic results against relevant business criteria
 Define a KPI to compare model performance to business criteria in the problem
 Compare the performance of the two models/approaches using the defined KPI
