Data Management

Generic orchestration framework for knowledge warehousing workloads utilizing Amazon Redshift RSQL

Spread the love


Tens of 1000’s of shoppers run business-critical workloads on Amazon Redshift, AWS’s quick, petabyte-scale cloud knowledge warehouse delivering one of the best price-performance. With Amazon Redshift, you may question knowledge throughout your knowledge warehouse, operational knowledge shops, and knowledge lake utilizing normal SQL. You can too combine AWS companies like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to reap the benefits of all the analytic capabilities within the AWS Cloud.

Amazon Redshift RSQL is a local command-line consumer for interacting with Amazon Redshift clusters and databases. You possibly can connect with an Amazon Redshift cluster, describe database objects, question knowledge, and consider question ends in numerous output codecs. You should utilize Amazon Redshift RSQL to exchange current extract, rework, load (ETL) and automation scripts, equivalent to Teradata BTEQ scripts. You possibly can wrap Amazon Redshift RSQL statements inside a shell script to duplicate current performance within the on-premise techniques. Amazon Redshift RSQL is on the market for Linux, Home windows, and macOS working techniques.

This put up explains how one can create a generic configuration-driven orchestration framework utilizing AWS Step Capabilities, Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon DynamoDB, and AWS Methods Supervisor to orchestrate RSQL-based ETL workloads. If you happen to’re migrating from legacy knowledge warehouse workloads to Amazon Redshift, you should use this system to orchestrate your knowledge warehousing workloads.

Answer overview

Prospects migrating from legacy knowledge warehouses to Amazon Redshift could have a big funding in proprietary scripts like Fundamental Teradata Question (BTEQ) scripting for database automation, ETL, or different duties. Now you can use the AWS Schema Conversion Instrument (AWS SCT) to mechanically convert proprietary scripts like BTEQ scripts to Amazon Redshift RSQL scripts. The transformed scripts run on Amazon Redshift with little to no adjustments. To find out about new choices for database scripting, consult with Speed up your knowledge warehouse migration to Amazon Redshift – Half 4.

Throughout such migrations, you might also need to modernize your present on-premises, third-party orchestration instruments with a cloud-native framework to duplicate and improve your present orchestration functionality. Orchestrating knowledge warehouse workloads consists of scheduling the roles, checking if the pre-conditions have been met, working the enterprise logic embedded inside RSQL, monitoring the standing of the roles, and alerting if there are any failures.

This answer permits on-premises prospects emigrate to a cloud-native orchestration framework that makes use of AWS serverless companies equivalent to Step Capabilities, Lambda, DynamoDB, and Methods Supervisor to run the Amazon Redshift RSQL jobs deployed on a persistent EC2 occasion. You can too deploy the answer for greenfield implementations. Along with assembly practical necessities, this answer additionally supplies full auditing, logging, and monitoring of all ETL and ELT processes which can be run.

To make sure excessive availability and resilience, you should use a number of EC2 situations which can be part of an auto scaling group together with Amazon Elastic File System (Amazon EFS) to deploy and run the RSQL jobs. When utilizing auto scaling teams, you may set up RSQL onto the EC2 occasion as part of the bootstrap script. You can too deploy the Amazon Redshift RSQL scripts onto the EC2 occasion utilizing AWS CodePipeline and AWS CodeDeploy. For extra particulars, consult with Auto Scaling teams, the Amazon EFT Consumer Information, and Integrating CodeDeploy with Amazon EC2 Auto Scaling.

The next diagram illustrates the structure of the orchestration framework.

Architecture Diagram

The important thing parts of the framework are as follows:

  1. Amazon EventBridge is used because the ETL workflow scheduler, and it triggers a Lambda perform at a preset schedule.
  2. The perform queries a DynamoDB desk for the configuration related to the RSQL job and queries the standing of the job, run mode, and restart data for that job.
  3. After receiving the configuration, the perform triggers a Step Capabilities state machine by passing the configuration particulars.
  4. Step Capabilities begins working totally different phases (like configuration iteration, run kind verify, and extra) of the workflow.
  5. Step Capabilities makes use of the Methods Supervisor SendCommand API to set off the RSQL job and goes right into a paused state with TaskToken. The RSQL scripts are endured on an EC2 occasion and are wrapped in a shell script. Methods Supervisor runs an AWS-RunShellScript SSM doc to run the RSQL job on the EC2 occasion.
  6. The RSQL job performs ETL and ELT operations on the Amazon Redshift cluster. When it’s full, it returns successful/failure code and standing message again to the calling shell script.
  7. The shell script calls a customized Python module with the success/failure code, standing message, and the callwait TaskToken that was acquired from Step Capabilities. The Python module logs the RSQL job standing within the job audit DynamoDB audit desk, and exports logs to the Amazon CloudWatch log group.
  8. The Python module then performs a SendTaskSuccess or SendTaskFailure API name based mostly on the RSQL job run standing. Primarily based on the standing of the RSQL job, Step Capabilities both resumes the circulate or stops with failure.
  9. Step Capabilities logs the workflow standing (success or failure) within the DynamoDB workflow audit desk.

Conditions

It is best to have the next conditions:

Deploy AWS CDK stacks

Full the next steps to deploy your assets utilizing the AWS CDK:

  1. Clone the GitHub repo:
    git clone https://github.com/aws-samples/amazon-redshift-rsql-orchestration-framework.git

  2. Replace the next the surroundings parameters in cdk.json (this file could be discovered within the infra listing):
    1. ec2_instance_id – The EC2 occasion ID on which RSQL jobs are deployed
    2. redshift_secret_id – The title of the Secrets and techniques Supervisor key that shops the Amazon Redshift database credentials
    3. rsql_script_path – Absolutely the listing path within the EC2 occasion the place the RSQL jobs are saved
    4. rsql_log_path – Absolutely the listing path within the EC2 occasion used for storing the RSQL job logs
    5. rsql_script_wrapper – Absolutely the listing path of the RSQL wrapper script (rsql_trigger.sh) on the EC2 occasion.

    The next is a pattern cdk.json file after being populated with the parameters

        "surroundings": {
          "ec2_instance_id" : "i-xxxx",
          "redshift_secret_id" : "blog-secret",
          "rsql_script_path" : "/residence/ec2-user/blog_test/rsql_scripts/",
          "rsql_log_path" : "/residence/ec2-user/blog_test/logs/",
          "rsql_script_wrapper" : "/residence/ec2-user/blog_test/instance_code/rsql_trigger.sh"
        }
    

  3. Deploy the AWS CDK stack with the next code:
    cd amazon-redshift-rsql-orchestration-framework/lambdas/lambda-layer/
    sh zip_lambda_layer.sh
    cd ../../infra/
    python3 -m venv ./venv
    supply .venv/bin/activate
    pip set up -r necessities.txt
    cdk bootstrap <AWS Account ID>/<AWS Area>
    cdk deploy --all

Let’s have a look at the assets the AWS CDK stack deploys in additional element.

CloudWatch log group

A CloudWatch log group (/ops/rsql-logs/) is created, which is used to retailer, monitor, and entry log information from EC2 situations and different sources.

The log group is used to retailer the RSQL job run logs. For every RSQL script, all of the stdout and stderr logs are saved as a log stream inside this log group.

DynamoDB configuration desk

The DynamoDB configuration desk (rsql-blog-rsql-config-table) is the essential constructing block of this answer. All of the RSQL jobs, restart data and run mode (sequential or parallel), and sequence during which the roles are to be run are saved on this configuration desk.

The desk has the next construction:

  • workflow_id – The identifier for the RSQL-based ETL workflow.
  • workflow_description – The outline for the RSQL-based ETL workflow.
  • workflow_stages – The sequence of phases inside a workflow.
  • execution_type – The kind of run for RSQL jobs (sequential or parallel).
  • stage_description – The outline for the stage.
  • scripts – The listing of RSQL scripts to be run. The RSQL scripts have to be positioned within the location outlined in a later step.

The next is an instance of an entry within the configuration desk. You possibly can see the workflow_id is blog_test_workflow and the outline is Take a look at Workflow for Weblog.

It has three phases which can be triggered within the following order: Schema & Desk Creation Stage, Knowledge Insertion Stage 1, and Knowledge Insertion Stage 2. The stage Schema & Desk Creation Stage has two RSQL jobs working sequentially, and Knowledge Insertion Stage 1 and Knowledge Insertion Stage 2 every have two jobs working in parallel.

{
	"workflow_id": "blog_test_workflow",
	"workflow_description": "Take a look at Workflow for Weblog",
	"workflow_stages": [{
			"execution_flag": "y",
			"execution_type": "sequential",
			"scripts": [
				"rsql_blog_script_1.sh",
				"rsql_blog_script_2.sh"
			],
			"stage_description": "Schema & Desk Creation Stage"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_3.sh",
				"rsql_blog_script_4.sh"
			],
			"stage_description": "Knowledge Insertion Stage 1"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_5.sh",
				"rsql_blog_script_6.sh"
			],
			"stage_description": "Knowledge Insertion Stage 2"
		}
	]
}

DynamoDB audit tables

The audit tables retailer the run particulars for every RSQL job inside the ETL workflow with a singular identifier for monitoring and reporting functions. The rationale why there are two audit tables is as a result of one desk shops the audit data at a RSQL job degree and the opposite shops it at a workflow degree.

The job audit desk (rsql-blog-rsql-job-audit-table) has the next construction:

  • job_name – The title of the RSQL script
  • workflow_execution_id – The run ID for the workflow
  • execution_start_ts – The beginning timestamp for the RSQL job
  • execution_end_ts – The top timestamp for the RSQL job
  • execution_status – The run standing of the RSQL job (Working, Accomplished, Failed)
  • instance_id – The EC2 occasion ID on which the RSQL job is run
  • ssm_command_id – The Methods Supervisor command ID used to set off the RSQL job
  • workflow_id – The workflow_id below which the RSQL job is run

The workflow audit desk (rsql-blog-rsql-workflow-audit-table) has the next construction:

  • workflow_execution_id – The run ID for the workflow
  • workflow_id – The identifier for a specific workflow
  • execution_start_ts – The beginning timestamp for the workflow
  • execution_status – The run standing of the workflow or state machine (Working, Accomplished, Failed)
  • rsql_jobs – The listing of RSQL scripts which can be part of the workflow
  • execution_end_ts – The top timestamp for the workflow

Lambda capabilities

The AWS CDK creates the Lambda capabilities that retrieve the config knowledge from the DynamoDB config desk, replace the audit particulars in DynamoDB, set off the RSQL scripts on the EC2 occasion, and iterate via every stage. The next is an inventory of the capabilities:

  • rsql-blog-master-iterator-lambda
  • rsql-blog-parallel-load-check-lambda
  • rsql-blog-sequential-iterator-lambda
  • rsql-blog-rsql-invoke-lambda
  • rsql-blog-update-audit-ddb-lambda

Step Capabilities state machines

This answer implements a Step Capabilities callback activity integration sample that allows Step Capabilities workflows to ship a token to an exterior system through a number of AWS companies.

The AWS CDK deploys the next state machines:

  • RSQLParallelStateMachine – The parallel state machine is triggered if the execution_type for a stage within the configuration desk is ready to parallel. The Lambda perform with a callback token is triggered in parallel for every of the RSQL scripts utilizing a Map state.
  • RSQLSequentialStateMachine – The sequential state machine is triggered if the execution_type for a stage within the configuration desk is ready to sequential. This state machine makes use of a iterator design sample to run every RSQL job inside the stage as per the sequence talked about within the configuration.
  • RSQLMasterStatemachine – The first state machine iterates via every stage and triggers totally different state machines based mostly on the run mode (sequential or parallel) utilizing a Selection state.

Transfer the RSQL script and occasion code

Copy the instance_code and rsql_scripts directories (current within the GitHub repo) to the EC2 occasion. Make certain the framework listing inside instance_code is copied as effectively.

The next screenshots present that the instance_code and rsql_scripts directories are copied to the identical dad or mum folder on the EC2 occasion.

Instance Code Scripts Image
Instance Code EC2 Copy Image
RSQL Script Image
RSQL Script EC2 Copy Image

RSQL script run workflow

To additional illustrate the mechanism to run the RSQL scripts, see the next diagram.

RSQL Script Workflow Diagram

The Lambda perform, which will get the configuration particulars from the configuration DynamoDB desk, triggers the Step Capabilities workflow, which performs the next steps:

  1. A Lambda perform outlined as a workflow step receives the Step Capabilities TaskToken and configuration particulars.
  2. The TaskToken and configuration particulars are handed onto the EC2 occasion utilizing the Methods Manger SendCommand API name. After the Lambda perform is run, the workflow department goes into paused state and waits for a callback token.
  3. The RSQL scripts are run on the EC2 occasion, which carry out ETL and ELT on Amazon Redshift. After the scripts are run, the RSQL script passes the completion standing and TaskToken to a Python script. This Python script is embedded inside the RSQL script.
  4. The Python script updates the RSQL job standing (success/failure) within the job audit DynamoDB desk. It additionally exports the RSQL job logs to the CloudWatch log group.
  5. The Python script passes the RSQL job standing (success/failure) and the standing message again to the Step Capabilities workflow together with TaskToken utilizing the SendTaskSuccess or SendTaskFailure API name.
  6. Relying on the job standing acquired, Step Capabilities both resumes the workflow or stops the workflow.

If EC2 auto scaling teams are used, then you should use the Methods Supervisor SendCommand to make sure resilience and excessive availability by specifying a number of EC2 situations (which can be part of the auto scaling group). For extra data, consult with Run instructions at scale.

When a number of EC2 situations are used, set the max-concurrency parameter of the RunCommand API name to 1, which makes certain that the RSQL job is triggered on just one EC2 occasion. For additional particulars, consult with Utilizing concurrency controls.

Run the orchestration framework

To run the orchestration framework, full the next steps:

  1. On the DynamoDB console, navigate to the configuration desk and insert the configuration particulars supplied earlier. For directions on how you can insert the instance JSON configuration particulars, consult with Write knowledge to a desk utilizing the console or AWS CLI.DynamoDB Config Insertion
  2. On the Lambda console, open the rsql-blog-rsql-workflow-trigger-lambda perform and select Take a look at.Workflow Trigger Lambda Function
  3. Add the check occasion just like the next code and select Take a look at:
    {
    	"workflow_id": "blog_test_workflow",
    	"workflow_execution_id": "demo_test_26"
    }

    Workflow Trigger Lambda function Payload

  4. On the Step Capabilities console, navigate to the rsql-master-state-machine perform to open the small print web page.RSQL Master Step Function
  5. Select Edit, then select Workflow Studio New. The next screenshot reveals the first state machine.RSQL Master Step Function Flow
  6. Select Cancel to go away Workflow Studio, then select Cancel once more to go away edit mode. You’re directed again to the small print web page.
    RSQL Master Step Function Details
  7. On the Executions tab, select the most recent run.
    RSQL Master Step Function Execution
  8. From the Graph view, you may verify the standing of every state by selecting it. Each state that makes use of an exterior useful resource has a hyperlink to it on the Particulars tab.RSQL Master Step Function Execution Graph
  9. The orchestration framework runs the ETL load, which consists of the next pattern RSQL scripts:
    • rsql_blog_script_1.sh – This script creates a schema rsql_blog inside the database
    • rsql_blog_script_2.sh – This script creates a desk blog_table inside the schema created within the earlier script
    • rsql_blog_script_3.sh – Inserts one row into the desk created within the earlier script
    • rsql_blog_script_4.sh – Inserts one row into the desk created within the earlier script
    • rsql_blog_script_5.sh – Inserts one row into the desk created within the earlier script
    • rsql_blog_script_6.sh – Inserts one row into the desk created within the earlier script

It’s worthwhile to exchange these RSQL scripts with the RSQL scripts developed in your workloads by inserting the related configuration particulars into the configuration DynamoDB desk (rsql-blog-rsql-config-table).

Validation

After you run the framework, you’ll discover a schema (referred to as rsql_blog) with one desk (referred to as blog_table) created. This desk consists of 4 rows.

RSQL Execution Table

You possibly can verify the logs of the RSQL job within the CloudWatch log group (/ops/rsql-logs/) and likewise the run standing of the workflow within the workflow audit DynamoDB desk (rsql-blog-rsql-workflow-audit-table).

RSQL Script CloudWatch Logs
RSQL Workflow Audit Record

Clear up

To keep away from ongoing fees for the assets that you simply created, delete them. AWS CDK deletes all assets besides knowledge assets equivalent to DynamoDB tables.

  • First, delete all AWS CDK stacks
  • On the DynamoDB console, choose the next tables and delete them:
    • rsql-blog-rsql-config-table
    • rsql-blog-rsql-job-audit-table
    • rsql-blog-rsql-workflow-audit-table

Conclusion

You should utilize Amazon Redshift RSQL, Methods Supervisor, EC2 situations, and Step Capabilities to create a contemporary and cost-effective orchestration framework for ETL workflows. There is no such thing as a overhead to create and handle totally different state machines for every of your ETL workflow. On this put up, we demonstrated how you can use this configuration-based generic orchestration framework to set off advanced RSQL-based ETL workflows.

You can too set off an e mail notification via Amazon Easy Notification Service (Amazon SNS) inside the state machine to the notify the operations group of the completion standing of the ETL course of. Additional, you may obtain a event-driven ETL orchestration framework by utilizing EventBridge to begin the workflow set off lambda perform.


Concerning the Authors

Akhil is a Knowledge Analytics Advisor at AWS Skilled Providers. He helps prospects design & construct scalable knowledge analytics options and migrate knowledge pipelines and knowledge warehouses to AWS. In his spare time, he loves travelling, taking part in video games and watching motion pictures.


Ramesh Raghupathy is a Senior Knowledge Architect with WWCO ProServe at AWS. He works with AWS prospects to architect, deploy, and migrate to knowledge warehouses and knowledge lakes on the AWS Cloud. Whereas not at work, Ramesh enjoys touring, spending time with household, and yoga.

Raza Hafeez is a Senior Knowledge Architect inside the Shared Supply Apply of AWS Skilled Providers. He has over 12 years {of professional} expertise constructing and optimizing enterprise knowledge warehouses and is obsessed with enabling prospects to appreciate the ability of their knowledge. He makes a speciality of migrating enterprise knowledge warehouses to AWS Fashionable Knowledge Structure.

Dipal Mahajan is a Lead Advisor with Amazon Internet Providers based mostly out of India, the place he guides international prospects to construct extremely safe, scalable, dependable, and cost-efficient functions on the cloud. He brings intensive expertise on Software program Improvement, Structure and Analytics from industries like finance, telecom, retail and healthcare.

Leave a Reply

Your email address will not be published. Required fields are marked *