For Data Engineers

 

Data Management: The process of gathering, storing and using data. Includes data collection, data cleaning and transformation, data storage, data quality, and data security.
Associate (Level 1)
Perform data extraction, joining and aggregation tasks (SQL)
  • Aggregate numeric, categorical variables and dates by groups using PostgreSQL.
  • Interpret a database schema and combine multiple tables by rows or columns using PostgreSQL.
  • Extract data based on different conditions using PostgreSQL.
  • Use subqueries to reference a second table (e.g. a different table, an aggregated table) within a query in PostgreSQL
Perform standard data import, joining and aggregation tasks using Python 
  • Import data from flat files into Python.
  • Import data from databases into Python
  • Aggregate numeric, categorical variables and dates by groups using Python.
  • Combine multiple tables by rows or columns using Python.
  • Filter data based on different criteria using Python.
Perform cleaning tasks to prepare data for analysis (SQL & Python)
  • Match strings in a dataset with specific patterns.
  • Convert values between data types.
  • Clean categorical and text data by manipulating strings.
  • Clean date and time data.
Assess data quality and perform validation tasks (SQL & Python)
  • Identify and replace missing values.
  • Perform different types of data validation tasks (e.g. consistency, constraints, range validation, uniqueness).
  • Identify and validate data types in a data set.
Collect data from non-standard formats (e.g. json) by modifying existing code (Python)
  • Adapt provided code to import data from an API using Python.
  • Identify the structure of HTML and JSON data and parse them into a usable format for data processing and analysis using Python.
Interpret a database schema and explain database design concepts (such as normalization, design, schemas, data storage options) 
  • Explain the design schema of a database
  • Identify from a schema how tables are connected and how to join multiple tables
  • Explain concepts in database design (normalization, design schemas, data storage options, etc)
Identify different cloud tools that can be used for storing data and creating and maintaining data pipelines
  • Identify the most common cloud tools used for data storage (file storage and databases)
  • Identify the most common cloud tools used for creating and managing data pipelines
Programming for Data Engineering: Application of fundamental programming concepts to solve data engineering problems
Associate (Level 1)
Use common programming constructs to write repeatable production quality code for data processing (Python)
  • Define, write and execute functions
  • Use and write the control flow statements
  • Use and write loops and iterations
Demonstrates best practices in production code including version control, testing and package development (Python)
  • Describe the basic flow and structures of package development
  • Explain how to document code in packages, or modules
  • Explain the importance of the testing and write testing statements
  • Explain the importance of version control and describe key concepts of versioning
Demonstrates software engineering principles (OOP, profiling, debugging) to write efficient, modular code in Python
  • Use object-oriented programming principles to create basic classes and methods
  • Identify inefficient or memory/CPU intensive code and be able to suggest approaches to improving efficiency and balancing requirements
  • Identify common coding errors and adapt code to remove errors
Exploratory Analysis: Use of statistics and visualizations to analyze and identify trends in datasets. 
Associate (Level 1)
Use data visualization tools to demonstrate characteristics of data (theory)
  • Distinguish between different types of data visualizations (bar chart, box plot, line graph, and histogram) in demonstrating the characteristics of data.
  • Interpret data visualizations (bar chart, box plot, line graph, and histogram) and summarize the characteristics of the data.
Read and analyze data visualizations to represent the relationships between features (theory)
  • Distinguish between different types of data visualizations (scatterplot, heatmap, and pivot table) in representing the relationships between features.
  • Interpret the data visualizations (scatterplot, heatmap, and pivot table) and summarize the relationship between features.