Role:Data Engineer
Focus:Cloud Architecture
Tools:Spark, Airflow, SQL
Passion:Big Data & Pipelines
Developed a scalable streaming pipeline using Kafka, Spark Streaming, and Druid to ingest and analyze clickstream data, providing dashboards with sub-second latency.
Led the migration of an on-premise data warehouse to Snowflake, designing new schemas, optimizing ETL jobs (using Airflow & dbt), and reducing query times by 60%.
Built a robust batch data processing framework on AWS using S3 Data Lake, PySpark on EMR, and Airflow for orchestration, handling terabytes of daily data.
A comprehensive 5-part series covering advanced techniques for performance tuning Apache Spark applications, including partitioning strategies, memory management, and optimization best practices for production environments.
Deep dive into implementing ACID transactions, time travel, and schema evolution using Delta Lake on AWS. Includes practical examples and performance comparisons with traditional data lake architectures.
Step-by-step guide to building scalable real-time data pipelines using Apache Kafka, including producer/consumer patterns, partitioning strategies, and monitoring best practices for high-throughput systems.
Implementing automated data quality monitoring using Great Expectations framework, including custom expectation development, integration with Airflow DAGs, and alerting strategies for data anomalies.
Comprehensive guide to migrating from on-premise data warehouses to cloud solutions like Snowflake, including schema design, ETL optimization, cost analysis, and lessons learned from real-world migrations.
Contributed code to enhance the AWS provider for Apache Airflow, adding new operators and improving existing sensor logic. Merged into main branch (v2.x).
Authored a 5-part blog series detailing techniques for performance tuning Apache Spark applications, covering partitioning, caching, and memory management.