Your mission
Your responsibilities:
- Design, build, and maintain Go microservices that handle AI model inference, data processing pipelines, and real-time streaming workflows.
- Architect scalable APIs (gRPC/REST) that serve as the bridge between AI models and production applications.
- Own the Kubernetes infrastructure (EKS), including deployments, autoscaling policies, service mesh, and cluster health monitoring.
- Implement service-to-service communication using gRPC and message queues (RabbitMQ/SQS) for asynchronous processing.
- Integrate with cloud AI services (AWS Bedrock, OpenAI, Anthropic) and manage model serving infrastructure.
- Build multi-tenant capabilities including authentication (JWT/JWKS), rate limiting, usage tracking, and tenant isolation.
- Partner with the Data & AI team to productionize machine learning models—wrapping them in production-ready services with proper health checks, circuit breakers, and graceful degradation.
- Build comprehensive observability: structured logging, metrics (Prometheus), distributed tracing (Jaeger/Tempo), and alerting.
- Implement CI/CD pipelines and infrastructure-as-code (Terraform) for automated deployments and disaster recovery.
- Ensure high availability through proper monitoring, incident response, and post-mortem analysis.
- Optimize resource utilization for GPU workloads and cost-efficient scaling strategies.