ML engineer and technical lead with 10+ years building production GenAI and ML systems at scale. Own the full GenAI stack: from fine-tuning and alignment to embedding infrastructure and production serving. Shipped systems handling 50M+ monthly queries, driving $108M+ in attributable revenue, and operating inside a $150B+ retail ecosystem. Move from research artifact to production system end-to-end.
Core Expertise
Fine-tuning & Alignment: PEFT/LoRA fine-tuning on domain-specific corpora, quantization (GPTQ, AWQ, GGUF), safety evaluation, red-teaming, and multi-intent SLM development. Hands-on from dataset curation through deployment.
Semantic Search & Embeddings: Bi-encoder and cross-encoder architectures, ANN indexing, hybrid sparse-dense retrieval pipelines, and embedding evaluation frameworks. Experience taking dense retrieval from prototype to 50M+ query/month production traffic.
RAG Systems: Chunking strategies, retrieval reranking, structured output generation, and hallucination evaluation. Built RAG pipelines backed by Azure OpenAI with custom retrieval layers and output validation.
Production Serving: vLLM, Triton Inference Server, OpenAI-compatible API deployment, latency optimization, model compression, and controlled A/B rollouts. Optimized GPU inference to hit P99 <200ms at scale.
Foundation Models: Pre-training transformer-based models from scratch on large-scale corpora: custom tokenization, training objective design, distributed training with DeepSpeed/PyTorch FSDP, and data curriculum scheduling. Applied to behavioral sequence modeling and representation learning, with fine-tuning for downstream personalization, ranking, and prediction tasks.
Production Systems
-
Large-scale semantic search: End-to-end dense retrieval system on a top-10 US e-commerce platform. Bi-encoder + cross-encoder pipeline with ANN indexing and GPU inference. 50M+ queries/month at P99 <200ms; +5% conversion lift over keyword baselines; $108M+ attributable revenue.
-
Domain-adapted SLM: PEFT/LoRA fine-tuning and quantization for multi-intent query understanding. Includes safety eval and red-teaming. Deployed via Triton + vLLM as an OpenAI-compatible API.
-
RAG pipeline: LLM-driven system using Azure OpenAI for natural language → structured output generation. Custom retrieval layer with output validation for structured task domains.
-
Hierarchical taxonomy classifier: Hybrid LSTM/CNN predicting across 500K+ categories, deployed via TensorFlow Serving in high-throughput production traffic.
Technical Skills
LLMs & GenAI: Fine-tuning, PEFT/LoRA, quantization (GPTQ/AWQ/GGUF), RAG systems, semantic search, embedding models, vLLM, Triton Inference Server, vector databases, Transformers
ML Systems: PyTorch, TensorFlow, DeepSpeed, distributed training, model serving at scale, A/B testing, pipeline automation
Infrastructure: Python, SQL, Docker, AWS, Azure, REST/gRPC, Elasticsearch, Redis, Git, CI/CD