Site Reliability Engineer โข Observability โข Java App Support โข DevOps Engineer
๐ฏ Professional Summary
Results-driven Site Reliability Engineer with 4+ years of experience ensuring high availability and performance for mission-critical payment platforms on AWS. At DXC Technology:
- ๐ป Reduced MTTR by 30% through Python-based automation and structured incident response workflows
- ๐ Cut alert noise by 40% via systematic Datadog monitor optimization โ directly improving on-call quality and MTTD
- โ
Sustained 99.9%+ uptime across 50+ microservices processing millions of financial transactions daily
- ๐ Reduced P1/P2 repeat incidents by 25% through RCA-driven root cause elimination and permanent fixes
Deep expertise in incident command, Kubernetes, CI/CD pipelines, Terraform IaC, and production Java/Spring Boot systems.
๐ Key Skills
- Cloud & Infrastructure: AWS (EC2, S3, VPC, IAM, Auto Scaling), Kubernetes, Docker, Terraform
- Observability & Monitoring: Datadog (APM, Logs, SLOs, Monitors), Splunk, Grafana, New Relic, Dynatrace
- SRE Practices: Incident Management, P1/P2 War Rooms, RCA, SLI/SLO, Error Budgets, Alerting, On-Call
- Programming: Python (automation, log analysis, alerting scripts), Java
- CI/CD & DevOps: Azure DevOps, Jenkins, GitHub Actions, Git, Maven
-
| Frameworks & Databases: Spring Boot, Spring MVC, Spring Data JPA, Spring Cloud |
MySQL, PostgreSQL |
- Ticketing Tools: Jira, ServiceNow
- ITIL Practices: Incident, Change, Major Incident, and Problem Management
Monitoring & Observability
Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:






Tools: Datadog |
Grafana |
Kibana |
New Relic |
Dynatrace |
Splunk |
- Datadog Administration: Onboarding services, configuring agents, tuning metrics collection, and managing monitors end-to-end.
- Visualization: Designing Datadog dashboards and SLO tracking for real-time visibility across logs, metrics, and APM traces.
- Alerting: Optimizing monitor thresholds to reduce alert noise by 40% โ improving MTTD and on-call quality.
Process & Framework
- Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
- ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.
๐ Key Achievements
| Achievement |
Impact |
| ๐ป Reduced MTTR by 30% |
Python automation scripts for alert triage, log correlation & incident response at Qatar Airways |
| ๐ Cut alert noise by 40% |
Systematic Datadog monitor tuning โ improved on-call quality & MTTD |
| โ
Sustained 99.9%+ uptime |
Mission-critical payment infrastructure handling millions of daily international transactions |
| ๐ Reduced repeat incidents by 25% |
RCA-driven root cause elimination with permanent corrective fixes |
๐ผ Professional Experience
DXC Technology, Bangalore โ Site Reliability Engineer (Dec 2022 โ Present)
| Client: Qatar Airways โ Payments Platform |
AWS ยท Datadog ยท Kubernetes ยท Python ยท Java/Spring Boot ยท Microservices |
- Managed fault-tolerant AWS infrastructure (EC2, VPC, IAM, S3, Auto Scaling) underpinning 50+ microservices processing high-volume international payment transactions.
- Maintained 99.9%+ uptime for mission-critical financial services, consistently meeting all SLO targets across production environments.
- Reduced MTTR by 30% by engineering Python automation scripts for alert triage, log correlation, and incident response workflows โ eliminating repetitive manual investigation steps.
- Optimized Datadog monitors and alerting thresholds, reducing alert noise and false positives by 40%, enabling faster and more accurate incident detection.
- Designed and owned Datadog dashboards and SLO tracking for end-to-end system visibility spanning logs, metrics, and APM traces.
- Led P1/P2 incident war rooms and post-incident root cause analysis (RCA); implemented permanent corrective actions that cut repeat incidents by 25%.
- Deployed and managed containerized workloads on Kubernetes โ resolved CrashLoopBackOff failures, tuned resource limits/requests, and implemented HPA for cost-effective auto-scaling.
- Built and maintained CI/CD pipelines via Azure DevOps (Git, Maven), enabling reliable zero-downtime deployments with significantly reduced rollback rates.
- Provisioned and managed AWS resources using Terraform (IaC), improving environment consistency, reducing provisioning errors, and accelerating deployment velocity.
- Partnered with development teams to troubleshoot Java/Spring Boot applications by analyzing JVM metrics, heap dumps, GC logs, and API latency data to resolve production performance bottlenecks.
- Created and escalated Jira & ServiceNow tickets to development teams for faster incident resolution and tracking.
- Prepared structured incident runbooks and playbooks, shared with clients and business stakeholders for operational clarity.
Wipro Ltd โ Site Reliability Engineer (Apr 2022 โ Nov 2022)
| Domain: Enterprise Solutions |
Critical Transaction Platforms |
Datadog ยท Grafana ยท Python ยท AWS ยท Java/Spring Boot |
- Supported mission-critical AWS environments for international enterprise clients; drove SLO/SLI optimization using Datadog and Grafana.
- Built Python automation scripts for alert validation and monitoring health checks, improving team efficiency and reducing noise-driven false escalations.
- Analyzed system logs and cloud deployment patterns to identify recurring failure modes; implemented targeted fixes reducing incident recurrence.
- Coordinated production readiness reviews for new payment services; improved cross-team onboarding documentation and operational runbooks.
๐ Personal Projects
- Built an end-to-end observability stack with custom Datadog dashboards, SLO tracking, log pipelines, and APM traces for a personal microservices environment.
- Replicated production-grade alerting patterns to validate and refine monitor configurations โ achieving a 40% reduction in alert noise.
- Authored runbooks and incident playbooks as part of an open learning initiative โ publicly available at iamdinesh.xyz.
๐ ๏ธ Technical Stack
๐ Monitoring & Observability

๐ซ Ticketing Systems

โ๏ธ Cloud & Infrastructure

๐ป Programming Languages

๐๏ธ Databases

๐ CI/CD

๐ Practices & Frameworks

๐ฏ Java Ecosystem

๐ฅ๏ธ Operating Systems

๐ Education
Master of Business Administration (MBA) โ JNTU Anantapur (2017 โ 2019)
Transitioned into Site Reliability Engineering through self-directed cloud study, hands-on Java/SQL lab work, and professional on-the-job experience.
๐ Certifications
- ๐ AWS Certified Solutions Architect โ Associate (In Progress; Exam Scheduled 2026)
If youโd like to collaborate, ask a question, or just say hello โ feel free to drop a message!