Multi-Model Inference Cluster

vLLM ROCm Docker AMD GPU

Deployed a 4-node inference cluster serving 12 concurrent LLMs with sub-200ms first-token latency on AMD Instinct GPUs using ROCm 6.0.

Internal Developer Platform

Ansible Docker Compose Systemd

Built a self-service provisioning platform enabling 80+ engineers to spin up standardized dev environments in under 90 seconds.

GPU-Automated CI/CD Pipeline

GitHub Actions PyTorch ROCm

Designed a CI/CD pipeline with automated GPU workload testing, model benchmarking, and performance regression detection.

Self-Hosted AI Gateway

Ollama Nginx Rate Limiting

Architected a self-hosted API gateway routing requests across multiple local models with authentication, rate limiting, and usage analytics.

Infrastructure Monitoring Stack

Prometheus Grafana Alertmanager

Deployed a full observability stack with custom exporters for GPU metrics, model latency, and queue depth across a distributed fleet.

Disaster Recovery Automation

Terraform Bash Systemd

Automated backup, failover, and recovery workflows for a multi-site infrastructure with RTO under 15 minutes.