Lead Support Engineer
While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning and growth.
If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi!
As a Application Lead for the Production Engineering (PE) and Service Delivery (SD) stream, you will take ownership of the reliability, scalability, and availability of critical healthcare AI platforms. You will move beyond day-to-day ticket resolution to define the Support Strategy, focusing on Site Reliability Engineering (SRE) principles.
You will lead a team of engineers, guiding them through complex incident resolutions, orchestrating cloud automation, and bridging the gap between Data Engineering, ML Engineering, and Operations. You will serve as the primary escalation point and technical architect for support operations, ensuring high availability for multi-geography projects.
Role & Responsibilities
Technical Leadership & Strategy
- Lead and mentor senior and junior support engineers; conduct code reviews, technical sessions, and manage 24/7 shift rotations.
- Implement SRE best practices by defining and tracking SLOs, SLIs, and Error Budgets.
- Act as Major Incident Manager and primary escalation point, leading war rooms, RCAs, and corrective actions.
Automation & Reliability Engineering
- Architect and enforce IaC standards using Terraform/CloudFormation to ensure reproducible, self-healing environments.
- Design and maintain monitoring, logging, and alerting systems for proactive issue detection.
- Ensure zero-downtime releases, automated rollbacks, and mature CI/CD pipelines.
Collaboration & Innovation
- Partner with Solution Architects and engineering teams to improve system supportability and maintainability.
- Introduce modern cloud-native tools to enhance automation and operational efficiency.
Technical Skills
Cloud & Infrastructure (AWS/GCP)
- Compute & Serverless: EC2/GCE, Lambda/Cloud Functions, autoscaling with custom metrics.
- Networking: VPCs, Transit Gateways, Load Balancers, DNS, VPN/Direct Connect.
- Security: IAM (least privilege), KMS, WAF, secure public endpoints.
- Storage: S3/GCS lifecycle policies, EBS/Persistent Disk optimization.
Software Engineering
- Strong development experience in Python (Flask/Django/FastAPI) or Java (Spring Boot).
- Proven ability in deep code debugging, performance tuning, and memory leak analysis.
- Hands-on experience with PRs, code reviews, and automated testing.
Microservices & Kubernetes
- Microservices using REST, gRPC, Kafka/SQS/PubSub with resiliency patterns.
- Distributed tracing and observability using OpenTelemetry.
- Kubernetes (EKS/GKE): cluster upgrades, ingress, autoscaling (HPA/VPA), security policies.
- Experience with Helm charts and Kubernetes Operators.
Automation, IaC & DevOps
- Terraform modules, remote state management, multi-environment setups.
- Ansible for configuration management and OS hardening.
- CI/CD with Jenkins, Gitflow/Trunk-based strategies.
- Artifact and container registry management (Docker, Artifactory, Nexus).
Databases & Caching
- PostgreSQL performance tuning, replication, and DR strategies.
- NoSQL databases (Cassandra/MongoDB) and caching layers (Redis/Memcached).