For Data Engineers

Updated November 03, 2023 21:58

Data Management: The process of gathering, storing and using data. Includes data collection, data cleaning and transformation, data storage, data quality, and data security.

Associate (Level 1)

Perform data extraction, joining and aggregation tasks (SQL)

Aggregate numeric, categorical variables and dates by groups using PostgreSQL.
Interpret a database schema and combine multiple tables by rows or columns using PostgreSQL.
Extract data based on different conditions using PostgreSQL.
Use subqueries to reference a second table (e.g. a different table, an aggregated table) within a query in PostgreSQL

Perform standard data import, joining and aggregation tasks using Python

Import data from flat files into Python.
Import data from databases into Python
Aggregate numeric, categorical variables and dates by groups using Python.
Combine multiple tables by rows or columns using Python.
Filter data based on different criteria using Python.

Perform cleaning tasks to prepare data for analysis (SQL & Python)

Match strings in a dataset with specific patterns.
Convert values between data types.
Clean categorical and text data by manipulating strings.
Clean date and time data.

Assess data quality and perform validation tasks (SQL & Python)

Identify and replace missing values.
Perform different types of data validation tasks (e.g. consistency, constraints, range validation, uniqueness).
Identify and validate data types in a data set.

Collect data from non-standard formats (e.g. json) by modifying existing code (Python)

Adapt provided code to import data from an API using Python.
Identify the structure of HTML and JSON data and parse them into a usable format for data processing and analysis using Python.

Interpret a database schema and explain database design concepts (such as normalization, design, schemas, data storage options)

Explain the design schema of a database
Identify from a schema how tables are connected and how to join multiple tables
Explain concepts in database design (normalization, design schemas, data storage options, etc)

Identify different cloud tools that can be used for storing data and creating and maintaining data pipelines

Identify the most common cloud tools used for data storage (file storage and databases)
Identify the most common cloud tools used for creating and managing data pipelines

Programming for Data Engineering: Application of fundamental programming concepts to solve data engineering problems

Associate (Level 1)

Use common programming constructs to write repeatable production quality code for data processing (Python)

Define, write and execute functions
Use and write the control flow statements
Use and write loops and iterations

Demonstrates best practices in production code including version control, testing and package development (Python)

Describe the basic flow and structures of package development
Explain how to document code in packages, or modules
Explain the importance of the testing and write testing statements
Explain the importance of version control and describe key concepts of versioning

Demonstrates software engineering principles (OOP, profiling, debugging) to write efficient, modular code in Python

Use object-oriented programming principles to create basic classes and methods
Identify inefficient or memory/CPU intensive code and be able to suggest approaches to improving efficiency and balancing requirements
Identify common coding errors and adapt code to remove errors

Exploratory Analysis: Use of statistics and visualizations to analyze and identify trends in datasets.

Associate (Level 1)

Use data visualization tools to demonstrate characteristics of data (theory)

Distinguish between different types of data visualizations (bar chart, box plot, line graph, and histogram) in demonstrating the characteristics of data.
Interpret data visualizations (bar chart, box plot, line graph, and histogram) and summarize the characteristics of the data.

Read and analyze data visualizations to represent the relationships between features (theory)

Distinguish between different types of data visualizations (scatterplot, heatmap, and pivot table) in representing the relationships between features.
Interpret the data visualizations (scatterplot, heatmap, and pivot table) and summarize the relationship between features.