AWS · Snowflake · Spark · Kafka · dbt · Airflow

Raw data in.
Reliable decisions out.
— at any scale.

Data Engineer with 5+ years designing cloud-native data platforms at Amazon and TouchWorld. I build the pipelines that power ML models, BI dashboards, and regulatory compliance — at any scale, in any domain.

Amazon
3 years at
high-volume scale
HIPAA
Healthcare data
governance built
ML-Ready
Feature stores for
fraud & risk models
TouchWorld
Data Engineer · Farmington, MI
Feb 2025 — Present
Current role
Designed AWS Glue + PySpark pipelines processing large-scale insurance data — policies, claims, underwriting, and customer records — for downstream analytics and fraud detection.
Implemented CDC and incremental ingestion for historical and late-arriving claims, enabling timely underwriting insights and reducing stale data incidents.
Built ML-ready feature stores in Snowflake and Redshift — curated datasets powering fraud detection, risk scoring, and customer churn models for data science teams.
Strengthened data governance with IAM, KMS, Secrets Manager, and Lake Formation to meet HIPAA and GDPR standards across all data assets.
Containerized ETL jobs with Docker + EKS and automated all infrastructure with Terraform — zero manual deployments to production.
PythonPySpark AWS GlueS3 RedshiftSnowflake AirflowLambda Lake FormationDocker TerraformEKS
Amazon
Data Engineer · Hyderabad, India
Sep 2020 — Jul 2023
3 years
Designed and deployed AWS infrastructure (Lambda, Step Functions, DynamoDB, Redshift) supporting high-volume applications at Amazon scale — reliability and performance at millions of transactions/day.
Built and optimized ETL pipelines and Redshift data models integrating diverse transactional datasets, enabling accurate, timely analytics for cross-functional teams.
Operated large-scale Spark/EMR jobs and fine-tuned SQL queries for high-performance data processing across big data workloads.
Automated data ingestion with Python and REST APIs, eliminating manual dependencies and enabling self-service data access for analysts.
Contributed to CI/CD with Jenkins and Terraform; orchestrated microservices on Kubernetes (EKS) to improve deployment cycles and scalability.
PythonSQL SparkEMR RedshiftLambda Step FunctionsDynamoDB JenkinsTerraform Kubernetes
GSPANN
Data Analyst · Hyderabad, India
Aug 2019 — Sep 2020
1 year
Built data pipelines consolidating sales, customer, and inventory data from multiple retail systems, improving accessibility for reporting and analytics.
Migrated ETL workflows to AWS EMR and Aurora Postgres, increasing batch processing reliability and throughput.
Automated ingestion pipelines (API, ODBC, file-based) to streamline POS data integration and reduce manual processing effort.
SQLPython AWS EMRAurora Postgres Redshift
01
AWS GluePySpark SnowflakeAirflow
Problem — Insurer drowning in unprocessed, low-quality claims

Healthcare Claims Data Platform

End-to-end platform ingesting, validating, and curating insurance claims. Batch and CDC ingestion with HIPAA-compliant handling, late-arriving data logic, and automated quality gates — producing ML-ready feature stores and analytics datasets.

HIPAA
Compliant
Batch+CDC
Dual ingestion
ML-Ready
Feature store output
View on GitHub
02
PostgreSQLAWS DMS S3Snowflake
Problem — Nightly full-loads missing real-time transactional changes

Change Data Capture Pipeline

CDC pipeline from PostgreSQL to Snowflake via AWS DMS. Full idempotency, deduplication, schema evolution, and historical preservation — replacing expensive full-loads with sub-minute incremental ingestion and zero data loss.

<60s
Change latency
100%
Idempotent
Zero
Data loss
View on GitHub
03
EventBridgeLambda GlueS3
Problem — Batch jobs creating hours-long analytics delays

Event-Driven Processing Pipeline

Near-real-time pipeline triggered on S3 uploads via EventBridge. Lambda and Glue handle automated transformations with zero idle compute — cutting end-to-end latency from hours to under 2 minutes on a fully serverless architecture.

<2min
End-to-end
$0
Idle compute
Serverless
Architecture
View on GitHub
04
AWS GlueLambda IAMLake Formation
Problem — Ad-hoc AWS scripts with no reusability or governance

Cloud-Native AWS Data Platform

Library of production-ready Glue jobs, Lambda functions, and architecture patterns with IAM least-privilege policies and Lake Formation governance. Annotated architecture diagrams for batch and event-driven designs — built for reuse across projects.

IaC
All infra
Governed
Lake Formation
Zero
Manual deploys
View on GitHub
05
PySparkPython SQLpytest
Problem — One-off Spark jobs with no tests or reuse path

Spark & Python Utility Library

Reusable PySpark transformation modules and Python utilities with unit tests. Covers window functions, advanced aggregations, partitioning strategies, and common ETL patterns — designed as plug-in components for any pipeline.

Tested
Unit coverage
Modular
Plug-in design
SQL+Py
Dual language
View on GitHub
06
TerraformDocker KubernetesJenkins
Problem — "Works on my machine" kills production reliability

DevOps & Infrastructure as Code

Complete DevOps layer: Terraform modules for S3, Glue, and IAM; Dockerized PySpark for local dev parity; Jenkins + GitHub Actions CI/CD on every merge; Kubernetes (EKS) orchestration. Production never gets a surprise.

IaC
Terraform
CI/CD
Jenkins + GHA
EKS
Orchestrated
View on GitHub

The modules that separate a junior builder from a senior platform engineer — quality, governance, performance, and observability as first-class concerns.

07
Data Quality

Data Quality Framework

Schema-agnostic validation enforcing null checks, referential integrity, accepted values, and drift detection — versioned code, not manual checks.

Accepted value validation via accepted_values.sql
Schema drift alerting before data reaches consumers
Pluggable across Glue, Spark, and dbt pipelines
08
Governance

Healthcare Data Governance

HIPAA-compliant layer with dynamic masking, row-level access control, PHI classification, and audit trails — built from real healthcare platform work.

Dynamic column masking for SSN, DOB, diagnosis codes
Role-based access via Snowflake + Lake Formation
PHI lineage tracking for HIPAA audit readiness
09
Performance

Snowflake Query Optimization

Query optimization patterns and warehouse tuning at scale — clustering keys, materialized views, result caching, and right-sizing with cost analysis.

Clustering key selection for partition pruning
Query profiling and bottleneck identification
Cost-per-query analysis and auto-suspend tuning
10
Observability

Pipeline Monitoring & SLA

Production observability with SLA tracking, severity-tiered alerting, and runbook docs — because a pipeline without monitoring is a future incident.

SLA breach alerting via CloudWatch & SNS
Dead-letter queue handling for failed records
Severity tiers and escalation paths documented
Cloud & Infrastructure
AWS (Glue, EMR, Redshift, Lambda)
Expert
Snowflake
Expert
Terraform + IaC
Proficient
Docker / Kubernetes (EKS)
Proficient
Processing & Orchestration
Apache Spark / PySpark
Expert
Apache Airflow
Expert
CDC (DMS, Debezium)
Expert
Kafka / Databricks
Proficient
Languages & Querying
Python
Expert
SQL (CTEs, Window Fns)
Expert
dbt
Proficient
Scala (familiar)
Familiar
Governance & Analytics
HIPAA / GDPR
Proficient
Lake Formation / IAM
Proficient
Power BI / Tableau / QuickSight
Proficient
SageMaker (exposure)
Exposure
Master of Science
Computer Science
University of the Pacific · Stockton, CA
December 2024
Bachelor of Science
Electronics & Communication Engineering
Vishnu Institute of Technology · Andhra Pradesh, India
May 2019