Contributed to the development of end-to-end ML and research pipelines at an early-stage AI startup. Designed and documented internal systems for model distillation, evaluation, and deployment while collaborating with senior engineers to streamline data workflows, automation, and reproducibility across the stack.
- Built and maintained the model distillation framework from data generation to evaluation and deployment.
- Authored two major internal technical wikis:
- A comprehensive guide to the modern fine-tuning ecosystem (Swift, TRL, Megatron-LM, RL frameworks).
- A deep-dive into distributed training protocols and scaling challenges beyond 30B parameters.
- Conducted performance benchmarking and API-driven model evaluations; improved internal inference runtime from 20 minutes → 2 seconds through low-level argument tuning and process optimization.
- Automated large-scale evaluation pipelines and documented production workflows for remote development, SSH environments, and code review cycles.
- Collaborated cross-functionally on synthetic data generation for distillation pipelines, ensuring alignment between research output and production infrastructure.
- Integrated and optimized inference pipelines using vLLM and related frameworks for large-scale evaluation and model serving.
This role provided extensive exposure to production-grade ML systems, distributed training, and the realities of scaling AI research infrastructure in startup environments.
Trained and benchmarked multiple model architectures (Transformers, CNN-BiLSTM hybrids, diffusion VAEs, etc.) on structured and scientific datasets.
- Used Kaggle’s GPU runtime for rapid prototyping of scientific ML experiments before scaling them to NexaCompute or Hugging Face.
- Developed and validated preprocessing pipelines (tokenization, normalization, augmentation) and verified consistency before dataset release.
- Logged and tracked experiment performance using Weights & Biases (W&B) integration for reproducibility.
- Applied aggressive optimization strategies (e.g., efficient batching, precision tuning, caching) to minimize runtime and compute cost.