Metadata-Driven Configurable AWS Data Pipeline

AWSArchitectureadvanced

About This Architecture

Metadata-driven AWS data pipeline architecture ingests from Kafka and Oracle CDC logs into S3 raw and staging buckets, then validates schemas and transforms data using Lambda functions. AWS Glue ETL jobs process batch data to Redshift while Glue streaming jobs push real-time data through Kinesis Firehose to MSK, orchestrated by Amazon MWAA Airflow with EventBridge scheduling. Configuration and metadata stored in dedicated S3 buckets enable dynamic pipeline behavior without code changes, supporting both batch and streaming workloads with CloudWatch monitoring and SNS alerting. Fork this diagram on Diagrams.so to customize the metadata schema, add data quality checks, or integrate additional source systems like DynamoDB Streams or RDS change data capture.

People also ask

How do I build a metadata-driven data pipeline on AWS that handles both batch and streaming workloads with Kafka and Oracle CDC sources?

Use S3 buckets for metadata and configuration to drive Lambda schema validators and CDC transformers, orchestrate AWS Glue batch ETL jobs to Redshift and streaming jobs to Kinesis Firehose with Amazon MWAA Airflow, enabling dynamic pipeline behavior without code changes as shown in this architecture diagram.

AWSData EngineeringETLGlueMWAAKafka

Domain:: Data Engineering
Audience:: data engineers building metadata-driven ETL pipelines on AWS

Generated by Diagrams.so — AI architecture diagram generator with native Draw.io output. Fork this diagram, remix it, or download as .drawio, PNG, or SVG.

Generate your own architecturediagram →