Pawan Yandapalli — Data Engineer

About

I'm Pawan.

the person behind the pipelines

Bay Area, CA

I've spent the last five-plus years building data infrastructure — first at Amazon, where my pipelines supported millions of transactions a day, and more recently in healthcare, where the data is messier and the stakes are higher.

I like the unglamorous parts of this job: schema contracts, backfills, and data quality checks. Most of this site is about things that broke, because that's where the real engineering happens.

Right now I'm pointing that same discipline at AI infrastructure — RAG pipelines, eval harnesses, feature stores — and looking for a team that takes its data as seriously as its product. If that sounds like yours, say hello.

Experience

Five years of building data systems
that hold up in production.

Everything below was designed for bad data and 3am pages, not ideal conditions.

Master’s · University of the Pacific Microsoft Fabric DE · AWS · LLM & RAG certified

2019 — present

Feb 2025 → Present 1 yr · current

TouchWorld Farmington, MI Now

Data Engineer · Healthcare data platform AWS DMS/Glue ingestion · S3 zone architecture · SCD policy modeling · late-arriving claims handling

≈88% faster ingestion 4h → 30min on 500K+ records/day

01 Built AWS DMS and Glue ingestion pipelines unifying policy, claims, and customer data — 500K+ records/day — from disparate policy-administration and claims-management systems into a governed data lake.

02 Implemented Glue ETL validation rules — claim-date-vs-policy-period checks, customer dedup, medical/claim code standardization — fixing data quality issues at the source for underwriting and fraud analytics.

03 Designed a raw/cleansed/curated S3 zone architecture feeding Amazon Redshift, and modeled policy history with Slowly Changing Dimensions (SCD) to support multi-year compliance retention.

04 Built pipelines to handle late-arriving claims across the FNOL, adjudication, settlement, and subrogation lifecycle, supporting downstream fraud-pattern detection.

05 Enforced HIPAA/GDPR-compliant security using IAM, KMS encryption, and Secrets Manager across the ingestion and analytics pipeline.

Ingestion speed

88% faster

Chose S3 zone architecture (raw/cleansed/curated) over direct-to-Redshift loads — isolates data quality issues before they reach analytics. Slowly Changing Dimensions for policy history — supports multi-year compliance retention without duplicating full snapshots.

AWS DMSAWS GluePySparkAmazon Redshift Apache AirflowDelta LakeLake FormationTerraformEKSPower BI

Ingestion latency4h → 30min

Records/day500K+

ComplianceHIPAA / GDPR

Aug 2023 → Dec 2024 M.S.

University of the Pacific California, US

M.S. Computer Science

Focused on cloud computing, data engineering, and applied machine learning — coursework spanning distributed data systems, database optimization, cloud & mobile security, AI in healthcare, and BI storytelling.

Distributed SystemsMachine LearningData Systems

DegreeM.S. CS

WhereCalifornia

Sep 2020 → Jul 2023 3 yrs

Amazon Hyderabad, India

Software Development Engineer, Data Platforms · AWS data infrastructure

Distributed AWS data infrastructure supporting millions of daily transactions across enterprise operational and analytics systems.

~60% query runtime reduction Redshift distribution-key & sort-key redesign

01 Owned and scaled distributed AWS data infrastructure supporting millions of daily transactions across enterprise operational and analytics systems.

02 Built scalable Apache Spark and AWS EMR distributed processing workflows, integrating 10+ transactional data sources into centralized analytics platforms.

03 Optimized Amazon Redshift performance through distribution-key and sort-key redesign — approximately 60% reduction in analytical query runtime.

04 Automated infrastructure provisioning and orchestration using Jenkins, Terraform, Lambda, and Step Functions to support production-grade deployments.

PythonSparkEMRRedshift LambdaStep Functions JenkinsTerraform

Query speedup~60%

Data sources10+

ScaleMillions of daily transactions

Aug 2019 → Sep 2020 1 yr

GSPANN Hyderabad, India

Data Analyst · Retail data integration (Client: Kohl's)

Analyzed retail sales, inventory, customer, and shipment data across BigQuery and Redshift — ingested by the team's data engineering pipelines — for the Kohl's account.

Multi-cloud · retail analytics BigQuery + Redshift reporting for Kohl's

01 Designed and optimized SQL-based analytical datasets and reporting queries across BigQuery and Redshift, improving accuracy and performance of business reporting.

02 Performed data profiling, cleansing, and validation on structured and semi-structured datasets (CSV, JSON, TXT, GZIP), improving reporting accuracy and consistency.

03 Partnered with data engineers to translate business reporting requirements into validated, analysis-ready datasets, supporting store-level performance visibility.

SQLPythonGoogle BigQueryAmazon Redshift

ClientKohl's

PlatformsBigQuery + Redshift

FocusAnalytics & Reporting

Projects

Selected work.

Each writeup covers the problem, the constraints, the tradeoffs, and what broke along the way. Flagship systems first; the rest build context.

case studies & modules

★ FlagshipCS · 01/Healthcare · HIPAA · production

Healthcare claims data platform.

Problem Insurance claims arriving from disparate sources with no governance, no quality gates, and no ML-ready surface for downstream teams.

Why It Mattered Nightly full-load batch meant stale data was driving underwriting and fraud decisions. A 4-hour signal is useful; a next-morning one often isn't — and claims corrections arrived out-of-order, so ingestion had to be replay-safe.

Constraints HIPAA PHI boundary · sub-hour SLA · 24/7 OLTP source that can’t be locked or scanned during business hours.

Solution Dual batch + CDC ingestion through AWS Glue & PySpark; HIPAA-compliant masking via Lake Formation; curated Redshift zones (raw/cleansed/curated) with automated quality gates.

Impact 500K+ records/day; latency cut from 4h → 30min; supports underwriting, fraud, and claims analytics.

Trade-off CDC via AWS DMS over Debezium — less control over the binlog reader, but ops cost is near-zero and HIPAA boundaries are easier to argue.

AWS GluePySparkAmazon Redshift AirflowLake Formation

View on GitHub

claims-platform · cdc sequence · last 30s LIVE

POSTGRES OLTP

INSERT0ms

claim_id=4719

UPDATE12ms

policy_id=82

DELETE28ms

claim_id=4682

→

AWS DMS · CDC ●

CAPTURE3ms

binlog → S3

STREAM15ms

parquet batch

TOMBSTONE31ms

soft-delete tag

→

REDSHIFT · TARGET ●

UPSERT8ms

claims.fact

MERGE20ms

policies.dim

SOFT-DEL43ms

claims.fact

cdc event processorPYTHON

# cdc event processor — idempotent for event in cdc_stream: if event.op == "INSERT": target.upsert(event.after) elif event.op == "DELETE": target.soft_delete(event.before) metrics.record(latency_ms=43)

CDC APPLY LATENCY

records / day500K+

end-to-end latency4h → 30min

DLQ errors0

orchestrationAirflow

RECORDS / DAY

500K+

LATENCY

4h → 30m

▼ 88% improvement

COMPLIANCE

HIPAA

by design

DATA ZONES

3

raw · cleansed · curated

★ FlagshipCS · 02/Personal data platform · live

Health OS — biometric data platform.

Problem Apple Health exports — no idempotency, no lineage, no way to trust derived metrics. A re-sent day double-counted; a schema change silently corrupted a trend.

Constraints Personal stack · zero-ops budget · scale-to-zero · public, cheap reads · schema must survive an iOS upgrade.

Solution A Supabase Edge Function (Deno) ingests Apple Health exports; the raw / clean / derived medallion lives entirely in Postgres — a date-keyed raw table, a range-flagging clean view, and a derived view of window functions that compute rolling 7-day trends plus recovery & momentum scores. Idempotent upsert on date, non-destructive validation, and a public, RLS-guarded read of the derived view.

Impact Sub-minute end-to-end latency on a live biometric stream; re-POSTing identical data is a provable no-op; the entire derived layer is rebuildable from raw — enabling same-day recovery decisions instead of trusting a black box.

Trade-off All derivation lives in Postgres views & window functions instead of app code — heavier SQL, but the UI computes nothing and every metric is reproducible from raw. Idempotency is keyed on date (one row per day) rather than a payload hash: simpler and partial-day friendly, at the cost of sub-daily granularity.

Supabase Postgres Edge Functions · Deno Row-Level Security Window functions Medallion

Open the live app

Connecting… reading v_health_derived · Supabase —

—RECOVERY

Momentum—

Readiness—

Recovery · last 30 days

health-os · architecture

LIVE HealthOS — Biometric Platform

Healthy

01 Capture

Apple Watchsensors

Health Auto ExportJSON

ADR-001 · Export over API

Full-fidelity export beats rate-limited APIs for a personal data platform.

17 metric typesOK

POST /ingest

per export x-ingest-key

02 Ingest

Edge Function · DenoIDEMPOTENT INGESTION

upsert on daterange-flagnever reject

ADR-002 · date is the key

Re-POST is a no-op — date is the primary key, so an upsert overwrites in place.

scale-to-zero · cold-safeOK

validated write

<60s queryable Supabase · Postgres

03 Medallion store

RAWhealth_daily table

▼

CLEANview · range-flagged

▼

DERIVEDview · window fns

ADR-003 · 3 layers, not 1

Raw preserved forever — every derived metric can be rebuilt from source.

medallion patternOK

computed views

17 metrics HR · HRV · sleep · load

04 Serve

DashboardPUBLIC READ

freshnessquality flags

ADR-004 · RLS read path

Read-only data is open — RLS allows anon select on the derived view; writes require auth.

DEMO → LIVEOK

// idempotent ingestion — Supabase Edge Function (Deno) Deno.serve(async (req) => { if (req.headers.get("x-ingest-key") !== KEY) return 401; const rows = byDate(await req.json()); // Apple Health → 1 row/day await sb.from("health_daily") // date is PK → re-POST overwrites .upsert(rows, { onConflict: "date" }); }); -- clean & derived layers are Postgres views (window fns)

Failure & recovery ● NOMINAL● WALKING FAILURE PATH

Exhealth export

→

⊞upsert on date

→

±range check

→

⚚quality_flags[]

→

↺backfill re-run

→

✓history rebuilt

What can fail· Duplicate uploads· Export schema drift· Out-of-range noise· Late / partial days

How it recovers· Upsert on date → no-op· Raw stored, replayable· Flag, never reject· Partial UPSERT on date

Observability● LIVE

PipelineHealthy

Ingest latency<60 s

Data quality100 %

Recovery today—

CS · 03/Serverless · event-driven

Event-driven
processing pipeline.

Problem

Hourly batch jobs were burning idle compute and pushing end-to-end latency past 4 hours for time-sensitive analytics.

Constraints

Bursty load · scale-to-zero budget · no idle pay · must survive single-AZ outages.

Solution

Serverless pipeline triggered by S3 EventBridge events, processed via Lambda and Glue with scale-to-zero compute and DLQ for failed records.

Impact

End-to-end latency dropped from hours to under 2 minutes; $0 idle compute; auto-scaling under bursty load.

Trade-off

Lambda’s 15-min ceiling means anything bigger spills to Glue — happy with the seam, but it pushes complexity to the orchestration layer.

AWS S3 EventBridge λLambda Glue SQS DLQ

View on GitHub

event-pipeline · topology · us-east-1LIVE

INGESTION

source-bucket

S3

Raw events (JSON / Parquet)

us-east-1

eventbridge-rule

EVENTBRIDGE

S3 Object Created (Prefix Filter)

HEALTHY

raw-events

S3

Landing Zone (Immutable)

us-east-1

events.rule

EVENT PATTERN

{ "source": ["aws.s3"], "detail-type": ["Object Created"], "detail": { "bucket": { "name": ["source-bucket"] }, "object": { "key": [{ "prefix": "inco/" }] } } }

HEALTHY

ROUTE / COMPUTE

ingest.filter

CLOUDWATCH

Ingest Count & Metrics

HEALTHY

normalize.fn

λLAMBDA

128 MB · 12s p99 — Event Normalization

HEALTHY

enrich.job

GLUE (2.2X)

Scale-to-zero Enrichment

HEALTHY

dlq.failed

SQS DLQ

Replay-ready Failed Records

HEALTHY

PERSIST / ANALYTICS

curated-data

S3

Parquet / Snappy Curated Data

us-east-1

events.fact

SNOWFLAKE

Clustered Tables (event_id)

HEALTHY

dashboards

QUICKSIGHT

Operational Dashboards

HEALTHY

consumers

APPS / ML / BI

Downstream Consumers

EVENT FLOWData / Trigger Flow

ASYNC / ERROR FLOWFailure / Retry Path

HEALTH STATUSHealthyUnhealthy

DATA FORMATSJSONPARQUET

REGIONus-east-1

CS · 04/DevOps · IaC

Terraform data platform (IaC).

Problem Manual deploys across multiple environments create drift, hot-fixes, and inconsistent state between dev, staging, and prod.

Constraints Four environments · multi-team git workflow · zero-trust between dev / stage / prod · every change auditable.

Solution Terraform modules for S3, Glue, IAM; dockerized PySpark for local dev parity; Jenkins + GitHub Actions CI/CD on every merge; EKS for orchestration.

Impact Pytest coverage on every transform. Every environment reproducible from git.

Trade-off Terraform monorepo over per-service repos — faster onboarding, but a noisy terraform plan diff on every PR. Mitigated with path-scoped CI jobs.

TerraformDockerEKSJenkinsGitHub Actionspytest

View on GitHub

deploy.yml · run #1248 · main

commit a4f7c12 · feat(glue): scale workers 6.2X

PASSED · 47s

$ terraform init 1.2s

$ terraform validate 0.8s

$ pytest -q transforms/ — 84 passed 6.4s

$ terraform plan -out=tfplan 3.4s

$ terraform apply tfplan — 3 to add, 1 to change 12.7s

$ smoke: glue_job.invoke(claims-etl) 22.1s

terraform/modules/glue_job/main.tf

1resource "aws_glue_job" "pipeline" { 2 name = "${var.env}-claims-etl" 3 role_arn = aws_iam_role.glue.arn 4 max_capacity = 10 5 worker_type = "G.2X" 6 timeout = 2880 7}

COVERAGE

100% IaC

Terraform + Pytest
on every change

TESTING

Pytest

on every
transform

ENVIRONMENTS

4

dev · stage
prod · sandbox

ORCHESTRATION

EKS

containerized
execution

CS · 05/AI infrastructure · RAG

Enterprise RAG pipeline — reliable document retrieval at inference time.

Problem Enterprise document corpora are operationally dark — no natural language access, no source attribution, no auditability. Policy docs, governance files, and SOPs exist but can't be queried safely at inference time.

Architecture Chunking → OpenAI embeddings → pgvector storage → lexical rerank top-5 → GPT-4o (t=0). Multi-tenant metadata filtering. FastAPI served.

Tradeoffs pgvector over a managed vector DB — keeps the stack in Postgres, avoids a separate service to operate for this scale. Lexical reranking over Cohere to avoid API dependency. temperature=0 for deterministic, auditable answers.

Reliability Idempotent ingestion prevents duplicate chunks across re-runs. Multi-tenant isolation via metadata filters — no data crosses tenant boundaries at the retrieval layer. Supports air-gap VPC deployment by swapping OpenAI for self-hosted sentence-transformers.

Impact Self-directed learning project — see the GitHub module for benchmark methodology and results.

OpenAI API pgvector LangChain FastAPIDockerPython 3.11pytest

View module →

rag_pipeline · query flow

DOC

Source
Docs

parse

Semantic
Chunk

encode

⬡

Embed
3072d

upsert

VEC

pgvector
ANN search

top-20

↑5

Rerank
top-5

top-5

GPT-4o
t=0

# src/retrieval.py — query pipeline def query(question: str, tenant_id: str) -> dict: q_vec = embed(question) # 3072-dim hits = index.query( vector=q_vec, top_k=20, filter={"tenant_id": tenant_id} ) ctx = rerank(question, hits.matches)[:5] return gpt4o_generate(question, ctx)

LAST QUERY TRACE

embed

42ms

ann top-20

183ms

rerank →5

155ms

generate

380ms

TOTAL · p50 760ms ✓ within SLA

p50 Latency

380ms

query end-to-end

p95 Latency

720ms

worst case

Context Match

98.2%

semantic confidence

Embed Latency

42ms

text-emb-3-large

CS · 06/AI infrastructure · evaluation

LLM evaluation harness — production readiness gating for AI systems.

Problem Teams deploy LLMs without objective evidence they work. Hallucination, faithfulness drift, and latency go undetected until production failures.

Architecture Eval cases (question + context + expected) → model under test → dual-path scoring: LLM-as-judge (GPT-4o) for faithfulness, hallucination, and relevance + lexical groundedness check (zero API cost, microsecond latency) as first-pass filter → EvalSummary with pass rate, p95 latency, total cost.

Tradeoffs Separated faithfulness (continuous) from hallucination (boolean) — distinct failure modes requiring distinct interventions. temperature=0 for both model and judge ensures reproducible evals across runs. Judge-contestant separation (GPT-4o evaluating GPT-4o-mini) avoids self-evaluation bias.

Impact gpt-4o: 100% pass, 0% hallucination, faithfulness 0.962 · gpt-4o-mini: 87.5% pass, 12.5% hallucination, 2× faster, 80× cheaper — clear cost/quality matrix for deployment decisions at the model selection layer.

OpenAI API LLM-as-judge Python 3.11pytestdataclasses

View module →

llm_eval · model comparison output

# evals/governance_evals.py — sample output Running eval: 8 cases | model=gpt-4o-mini | judge=gpt-4o [1/8] gov-001... PASS | faith=0.95 | 342ms | $0.0001 [3/8] gov-003... PASS | faith=0.88 | 401ms | $0.0001 ... MODEL COMPARISON Metric gpt-4o-mini gpt-4o pass_rate 0.875 1.000 avg_faithfulness 0.891 0.962 hallucination_rate 0.125 0.000 avg_latency_ms 354.0 687.0 cost_per_query_usd 0.0001 0.008

EVAL SCORECARD8 CASES

pass_ratemini 87.5% 4o 100%

avg_faithfulness0.891 0.962

hallucination_ratemini 12.5% 4o 0%

avg_latency_msmini 354ms 4o 687ms

cost_per_query_usdmini $0.0001 4o $0.008

cost vs qualitymini 80× cheaper · 4o zero hallucinations

Pass Rate

100%

gpt-4o

Hallucination

0%

gpt-4o

Faithfulness

0.962

avg score

Cost / Query

$0.008

gpt-4o judge

CS · 07/AI infrastructure · feature serving

Real-time feature store — streaming infrastructure for low-latency ML inference.

Problem Fraud detection needs sub-10ms feature serving and 90-day point-in-time correct training data. Most teams solve one. This solves both without skew.

Architecture Kafka → Redis (online, 1hr TTL, atomic HINCRBY, 100-entity batch) for inference; Kafka → Snowflake → dbt → Airflow (6h) for training. FastAPI serves both. Zero read/write contention.

Tradeoffs Redis over DynamoDB for sub-ms hash ops and pipeline batching — traded for single-region availability. 1hr TTL over manual expiry, so stale features self-expire. Point-in-time SQL (event_timestamp ≤ label_timestamp) prevents data leakage — the most common feature-engineering mistake.

Impact p50 read 2ms · p99 8ms · 100 entities batched in 6ms · 6h offline refresh. Point-in-time correctness eliminates train/serve skew — no silent feature drift.

KafkaRedis FastAPI SnowflakedbtAirflowDocker

View module →

Feature store · Dual-path architecture Live system

WRITE PATH (STREAMING)

Streams in real-time

Source

Kafka events

user-events

12.4K msg/s

~25ms

Processor

Feature Processor

Python · FastAPI

2.1K/s

~18ms

Orchestrator

dbt + Airflow

7d · 3d updates

1.2K rows/s

~2m

ONLINE <10MS · OFFLINE 6H REFRESH

SERVE PATH (INFERENCE)

Low latency · High availability

Online store

Redis Online

1hr TTL · p99 8ms

Healthy

Offline store

Snowflake Offline

90d · point-in-time

Healthy

Model API

FastAPI

/predict

Healthy

Online P50

2ms

Median latency

Online P99

8ms

Tail latency

Batch 100

6ms

Job duration

Refresh

6h

Full refresh

Engineering modules

How the pipelines stay trustworthy.

Quality, governance, performance and observability, each treated as a system of its own rather than an afterthought.

04 · modules

07 Data quality

Data quality framework

Schema-agnostic validation enforcing null checks, referential integrity, accepted values, and drift detection — versioned code, not manual spot-checks.

Accepted-value validation via accepted_values.sql

Schema-drift alerts before data reaches consumers

Pluggable across Glue, Spark, and dbt pipelines

Great ExpectationsSQL

View module →

08 Governance

Healthcare data governance

HIPAA-compliant layer with dynamic column masking, row-level access control, PHI classification, and audit trails — built from real platform work at TouchWorld.

Dynamic masking for SSN, DOB, diagnosis codes

Role-based access via Redshift + Lake Formation

PHI lineage tracking for HIPAA audit readiness

Amazon RedshiftLake Formation

View module →

09 Performance

Snowflake query optimization

Optimization patterns and warehouse tuning at scale — clustering keys, materialized views, result caching, and cost-aware right-sizing. Self-directed practice applying the same tuning discipline used on Redshift at Amazon.

Clustering-key selection for partition pruning

Query profiling and bottleneck identification

Cost-per-query analysis & auto-suspend tuning

SnowflakeSQL

View module →

10 Observability

Pipeline monitoring & SLA

Production observability with SLA tracking, severity-tiered alerting, and runbook docs — because a pipeline without monitoring is a future incident.

SLA-breach alerting via CloudWatch & SNS

Dead-letter queue handling for failed records

Severity tiers & documented escalation paths

CloudWatchSNSAirflow

View module →

Collaboration & practices

Building systems others can trust.

How I work with the teams around a pipeline: shared contracts, clear tradeoffs, and documentation that survives me leaving the room.

📋

Data Contracts

Partnered with analysts, software engineers, and business stakeholders to define shared data contracts, reducing ambiguity between producers and consumers of the same pipeline.

🏗️

Architecture Documentation

Write architecture decisions the way I'd want to inherit them — constraints, tradeoffs, and the reasoning behind the call — so the next engineer isn't guessing.

🎓

Self-Directed Growth

Built production-style RAG, LLM evaluation, and feature-store projects outside of work to understand how data engineering patterns extend into applied AI systems.

⚡

Data Quality Ownership

Implemented validation rules at the source — claim-date checks, deduplication, code standardization — catching data quality issues before they reach downstream models.

Architecture Decision Records

Why I picked
these tools.

Written up the way I'd write an ADR at work: constraints first, then the tradeoff, then the call.

decisions → rationale

Design Decision

Why a raw/cleansed/curated S3 zone architecture?

Pros

Isolates data quality issues before they reach analytics

Raw zone preserves full reprocessing capability

Clear ownership boundary per zone for governance

Curated zone stays lean for fast Redshift loads

Cons

More storage and orchestration than a single-layer lake

Requires clear contracts on what moves between zones

Verdict: For claims and policy data feeding underwriting and fraud models, catching bad data early was worth the extra storage and pipeline stages.

Design Decision

Why Amazon Redshift for the warehouse?

Pros

Native fit with the rest of the AWS stack (Glue, DMS, S3, Lake Formation)

Distribution-key / sort-key tuning is a known lever for this team's workloads

IAM-based access control simplifies the HIPAA boundary

No separate vendor relationship or data-sharing agreement to manage

Cons

Manual distribution/sort-key tuning vs. automatic optimization elsewhere

Cluster resize is heavier than elastic-scaling alternatives

Verdict: Staying inside the existing AWS/IAM boundary simplified the HIPAA compliance story more than any warehouse-specific feature would have.

Design Decision

Why Redis over DynamoDB?

Redis wins

Sub-ms HINCRBY vs 10–20ms DynamoDB

Pipeline batching: 100 entities in 6ms

TTL auto-expiry — zero manual cleanup

Atomic rolling counts, no contention

DynamoDB edge

Fully managed — no node ops

Durable by default; Redis is memory-bound

Verdict: DynamoDB benchmarked at 10–20ms per request at volume. Redis pipeline batching at 6ms for 100 entities made the choice clear.

Signature architecture

Real-Time Data Platform

The pipeline that unifies policy, claims, and customer data for underwriting and fraud analytics. Handles 500K+ records/day with idempotent, replay-safe CDC ingestion. This is the diagram you remember.

Architecture

PROD Real-Time Data Platform aws · us-east-1 · multi-AZ

DWGSYS-001

REV9

Operational

Records / day500K+

Ingestion latency<30min

Latency improvement88%

Query speedup~60%

01 Source

PostgreSQLWAL

MySQLbinlog

MongoDBoplog

ADR-001 · CDC over batch

Log-based capture via AWS DMS avoids full reloads and cut ingestion latency from 4+ hours to under 30 minutes.

Policy admin + claims systemsOK

AWS DMS · CDC

4h → 30min latency cut incremental capture

02 Transport

AWS DMS + GlueCDC INGESTION

500K+ records/dayraw · cleansed · curated

ADR-002 · S3 zone architecture

Raw/cleansed/curated zones isolate data quality issues before they reach underwriting and fraud analytics.

4h → 30min latencyOK

Glue ETL

validation & standardization

03 Processing

PySparkTRANSFORMATION

Delta Lakeclaim-date validationdedup

ADR-003 · Validate at the source

Claim-date-vs-policy-period checks and code standardization run before data reaches curated zone.

Airflow-orchestratedOK

load

curated zone → warehouse

04 Warehouse

Amazon RedshiftANALYTICS

RAWlanded CDC

▼

CLEANvalidated

▼

CURATEDbusiness-ready

SCD policy historyOK

serve

4 consumers SLA-tiered reads

05 Consumers

BI / Dashboards

Feature Store

RAG / LLM

Serving APIs

One source, many reads

The same modeled tables feed BI, ML features and retrieval — no divergent copies.

BI · ML · AIOK

← scroll the pipeline →

Platform services IAM access control Glue catalog Great Expectations quality Schema Registry contracts CloudWatch monitoring TLS + masking PII security Terraform IaC

Failure & recovery ● NOMINAL● WALKING FAILURE PATH

DMSCDC task

→

GlGlue ETL job

→

DLQerror zone (S3)

→

⚠alert + page

→

↺reprocess from raw zone

→

✓backfill verified

What can fail· CDC task failures· Schema incompatibility· Transformation errors· Downstream outages

How it recovers· Retry with exponential backoff· DLQ for inspection· Automated alerts· Backfill after fix

Observability

Records / day500K+

End-to-end latency<30 min

Latency improvement88 %

Query speedup~60 %

Technologies & tools

Production depth across the data lifecycle.

Core competency Proficient

9

Skill Domains

40+

Technologies

12

AWS Services

2×

Databricks Certified

02

Data Engineering

5 core · 6 proficient

Core competency

Apache SparkETL / ELTCDCApache Airflow

Proficient

dbtData Modeling · OLTP / OLAPData WarehousingData ContractsData ObservabilityDistributed Processing

OPERATIONAL

regionus-east-1

records/day500K+

latency<30min

complianceHIPAA/GDPR

clock—:—:—

Operating principles

How I work.

08 · principles

01

Real-time systems fail quietly before they fail loudly.

Lag climbs for minutes before alarms fire. Lag, DLQ depth, and consumer offsets are the vital signs — watch them first.

02

Observability ships in the first PR.

A pipeline you can trace beats a clever one you can’t explain. SLOs, lag metrics, and a DLQ — or it doesn’t ship.

03

Replay recovery beats low latency.

You can tolerate extra milliseconds; you cannot tolerate unrecoverable state. Idempotent ingestion before speed.

04

Designed for bad data, not clean data.

Clean data is a best-case assumption. Drift, null coercion, and type changes are the default — validate at every boundary.

05

Schema contracts are the API between teams.

Every undocumented schema change is a future incident. Contract validation at the CDC boundary stops silent corruption.

06

Architecture is the tradeoffs, not the diagram.

What you chose, rejected, and why — the decision log matters more than another flowchart.

07

Boring systems win.

The Airflow DAG running 18 months without a page is the goal. Reliability beats cleverness — every time.

08

AI systems are only as good as their data.

Models inherit every flaw in the pipeline beneath them. The data engineer’s job gets more important when the model starts.

Production maturity isn’t the absence of incidents. It’s how cleanly the next one is prevented.

Operating principle · 5 yrs in production

Building now

What I'm building now.

Mostly AI infrastructure. To me it's the same job as data engineering, one layer up — the systems below are being extended and shipped now.

2025–26 · active

RAG Pipelines Self-directed RAG project with LangChain — chunking, embedding, and retrieval quality optimization. See CS·05. personal project · deployed

Vector infrastructure Production patterns for serving embeddings alongside relational feature stores — pgvector vs Pinecone vs Weaviate for different workloads. prototype

LLM data pipelines Treating LLM inputs and outputs as structured data — schemas, contracts, and Airflow-orchestrated batch inference at scale. building

LLM observability Treating prompts, tools, and traces as data contracts — monitoring token usage, latency, and output quality as pipeline SLOs. notes · prototype

Apache Iceberg Hidden-partition pruning & manifest-rewrite economics at warehouse scale — plus time-travel for AI training data versioning. paper · code

Real-time AI features Online / offline parity for fraud and risk feature serving — sub-second freshness for model inference without sacrificing data quality. paper · benchmark

Open source

13 modules, one repository.

SQL and CDC pipelines through RAG, LLM evaluation, and real-time feature stores — built to mirror production.

Python 99.7% HCL 0.3%

View repository AWS → Azure mapping

data-engineer-portfolio

data-engineer-portfolio/

▸01_sql/Advanced SQL & window functions

▸02_python/Reusable utilities w/ tests

▸03_spark_pyspark/Spark transforms & opt.

▸04_cloud_aws/Glue, Lambda, diagrams

▾05_end_to_end_projects/CDC & batch pipelines

▸06_devops/Docker, CI/CD, Terraform

▸07_data_quality_framework/Great Expectations

▸08_healthcare_data_governance/HIPAA masking & lineage

▸09_snowflake_optimization/Query tuning & clustering

▸10_pipeline_monitoring_sla/CloudWatch SLA & alerts

▾11_rag_pipeline/Enterprise RAG pipeline

▾12_llm_eval_harness/LLM evaluation harness

▾13_realtime_feature_store/Real-time feature store

Stars —

Forks —

Last push loading…

MIT licensed · Python github.com/pawanyandapalli7 →

Certifications

Data platform & applied AI infrastructure.

Each one came with hands-on projects, not just an exam.

2025 — 2026

Microsoft New · Jun 2026 Microsoft Certified: Fabric Data Engineer Associate DP-700Issued Jun 2026 · Expires Jun 2027Fabric · Azure · Data engineering

Show credential Credential ID FD9ADBF3BEDBBCE7

Data platform

Databricks

AWS Databricks Platform Architect

Issued Aug 2025 · Expires Aug 2027

Designs Databricks lakehouse architectures on AWS — workspaces, networking, security, and cost governance.

Show credential Credential ID 159432612

Databricks

Azure Databricks Platform Architect

Issued Sep 2025

Designs Azure Databricks platform architectures — identity, networking, and governance with Unity Catalog.

Show credential Credential ID 161234702

dbt Labs

dbt Fundamentals

Issued Oct 2025

Builds tested, version-controlled transformation layers — models, sources, tests, docs, and deployments.

Show credential Credential ID 162432944

Applied AI & LLM

NVIDIA

AI for All: From Basics to GenAI Practice

Foundations through applied GenAI practice across the NVIDIA AI stack.

Jan 2026GenAI

DeepLearning.AI

AI For Everyone

Frames AI project scoping, feasibility, and organizational adoption — the business side of AI systems.

Verify credential ↗

Oct 2025AI Strategy

Contact

Let’s build the data infrastructure that powers great AI products.

Open to Data Engineer, Senior Data Engineer, Data Platform Engineer, and Data Infrastructure Engineer roles — teams building at the intersection of reliable data pipelines and AI systems. Bay Area, CA — remote / hybrid friendly.

EmailShow email
Phone+1 779 977 7799
LinkedInlinkedin.com/in/pawanyandapalli
GitHubgithub.com/pawanyandapalli7
LocationBay Area, CA · Remote / Hybrid
Looking forData Engineer · Senior DE · Data Platform · Data Infra

Data infrastructure that survives scale.

I'm Pawan.

Five years of building data systemsthat hold up in production.

Selected work.

Healthcare claims data platform.

Health OS — biometric data platform.

Event-drivenprocessing pipeline.

Terraform data platform (IaC).

Enterprise RAG pipeline — reliable document retrieval at inference time.

LLM evaluation harness — production readiness gating for AI systems.

Real-time feature store — streaming infrastructure for low-latency ML inference.

How the pipelines stay trustworthy.

Data quality framework

Healthcare data governance

Snowflake query optimization

Pipeline monitoring & SLA

Building systems others can trust.

Data Contracts

Architecture Documentation

Self-Directed Growth

Data Quality Ownership

Why I pickedthese tools.

Real-Time Data Platform

Production depth across the data lifecycle.

Languages

Data Engineering

Streaming & Messaging

Cloud & Warehousing

Lakehouse & Big Data

DevOps & Infrastructure

Governance & Observability

AI Infrastructure

BI Tools

How I work.

Real-time systems fail quietly before they fail loudly.

Observability ships in the first PR.

Replay recovery beats low latency.

Designed for bad data, not clean data.

Schema contracts are the API between teams.

Architecture is the tradeoffs, not the diagram.

Boring systems win.

AI systems are only as good as their data.

What I'm building now.

13 modules, one repository.

Data platform & applied AI infrastructure.

Let’s build the data infrastructure that powers great AI products.

Data infrastructure that
survives scale.

Five years of building data systems
that hold up in production.

Event-driven
processing pipeline.

Why I picked
these tools.