About This Architecture
Metadata-driven AWS data pipeline architecture ingests from Kafka and Oracle CDC logs into S3 raw and staging buckets, then validates schemas and transforms data using Lambda functions. AWS Glue ETL jobs process batch data to Redshift while Glue streaming jobs push real-time data through Kinesis Firehose to MSK, orchestrated by Amazon MWAA Airflow with EventBridge scheduling. Configuration and metadata stored in dedicated S3 buckets enable dynamic pipeline behavior without code changes, supporting both batch and streaming workloads with CloudWatch monitoring and SNS alerting. Fork this diagram on Diagrams.so to customize the metadata schema, add data quality checks, or integrate additional source systems like DynamoDB Streams or RDS change data capture.