Site Reliability Engineer • Observability • Production Support • Technical Support
🎯 Career Objective
- Overall, 3+ years of experience in Site Reliability Engineering & Observability platforms & IT Infrastructure & Applications Production Support and Java Support Engineer
- Experienced Observability Monitoring Engineer with over 3 years in administrative roles, specializing in providing 24/7 support for global customers in production environments.
- Proficient in APM monitoring tools such as DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, and Sitescope. Skilled in managing SLOs, SLIs, and SLAs, and well-versed in ITIL frameworks including incident, change,
major, and problem management. Proven ability in Datadog administration, dashboard creation, and monitoring services in production environments.
- API Development: Engineered secure and robust API endpoints for CRUD operations, ensuring data integrity and correct performance.
- Debugging & Maintenance: Adept at bug fixing and debugging complex applications to maintain system health.
- Frameworks: I have good knowledge in developing and troubleshooting applications using Spring Boot and Spring MVC.
- Timely Resolution: Committed to diagnosing and resolving system issues to minimize downtime and impact.
Monitoring & Observability
Proficient in the end-to-end administration of a comprehensive APM and monitoring stack, including:




Tools: Datadog |
Grafana |
Kibana |
New Relic |
- Datadog Administration: Onboarding services, configuring agents, and tuning metrics collection.
- Visualization: Designing and building insightful dashboards tailored to SLOs/SLIs and business KPIs.
- Alerting: Implementing and managing alert policies to reduce noise and improve MTTR.
Process & Framework
- Service Management: Skilled in managing SLOs, SLIs, and SLAs to align IT services with business goals.
- ITIL Practices: Well-versed in ITIL frameworks for Incident, Change, Major Incident, and Problem Management.
🔑 Key Skills
- ITIL: Incident, Change, Major Incident, Problem Management; SLOs, SLIs, SLAs (metrics, traces, logs).
- Alerting: success/error/composite alerts, threshold tuning, refinement, noise/toil reduction.
- App monitoring: triage in production, dev collaboration via JIRA, runbooks, dashboards, reporting.
- Tooling: Grafana (error insights), Kibana (log analysis), Datadog admin (monitors, dashboards), PagerDuty (on-call).
- Process: onboarding services to monitoring, gap analysis, RCA participation, weekly/monthly reporting.
- Programming: Java, Python (custom metrics, light instrumentation).
Professional Experience
DXC Technology, Bangalore — Site Reliability Engineer (Dec 2022 – Present)
Client: Qatar Airways — Payments Monitoring Group
- Provided 24/7 support to global customers for payments applications in production environments.
- Managed and administered the full observability stack: DataDog, Grafana, Kibana, Dynatrace, Splunk, OMI, Tidal, Sitescope.
- Implemented SLOs, SLIs, SLAs to ensure performance and reliability goals were met and measured.
- I involve ITIL frameworks for Incident, Change, Major, and Problem Management.
- Created and maintained comprehensive DataDog dashboards & monitors for real-time application performance tracking.
- Onboarded new application services into production environments and performed gap analysis to ensure monitoring coverage.
- Developed and refined alerts for KPIs such as success rate, error rate, and composite metrics to reduce noise and improve MTTR.
- Collaborated with development teams via JIRA for ticket creation, escalation, and resolution tracking.
- Configured and monitored alerts with PagerDuty to ensure timely incident response and on-call rotations.
- Performed advanced observability tasks: custom dashboards, widgets, panels in DataDog; threshold tuning; noise reduction in alerts.
- Analyzed and exported observability data from DataDog into Google Sheets, reporting key insights and trends to business stakeholders.
- Monitored applications, services, and jobs across DataDog, Grafana, Kibana.
- Prepared detailed incident checklists and shared structured, client-facing updates.
- Worked extensively on SLA & SLI definitions for critical payments services in production systems.
- Configured JIRA dashboards as per project requirements for enhanced visibility and reporting.
Wipro, Bangalore — Site Reliability Engineer (Apr 2022 – Nov 2022)
Client: HSBC — Payments Monitoring
- Provided 24/7 L1/L2 support to global customers for critical payments applications in production environments.
- Managed and administered the APM/Monitoring stack: Datadog, Grafana, Kibana, OMI, Tidal, SiteScope.
- Configured and tuned alert thresholds, significantly reducing noise from ineffective alerts and improving signal clarity.
- Monitored and supported applications, services, and batch jobs across multiple platforms to ensure system health.
- Created and escalated JIRA tickets to development teams for faster incident resolution and tracking.
- Prepared structured incident checklists and runbooks, sharing clear documentation with clients and business teams.
- Defined and monitored SLA/SLI metrics for payment services using Datadog to uphold service quality agreements.
- Built and customized JIRA dashboards based on project requirements to streamline workflow and visibility.
- Configured PagerDuty for effective alerting and implementing escalation workflows to ensure on-call responsiveness.
- Performed detailed incident analysis and engaged with Root Cause Analysis (RCA) teams to drive long-term fixes.
- Generated and shared daily, weekly, and monthly status reports with business stakeholders to communicate system health and incidents.
- Conducted basic front-end troubleshooting of applications and engaged next-level support teams for complex issues.
- Provided front-line and second-level IT operations support, ensuring outstanding client service delivery.
- Supported weekend server patching activities, including comprehensive pre- and post-patching validation checks.
🛠️ Technical Stack
📊 Monitoring & Observability

🎫 Ticketing Systems


💻 Programming Languages

🗄️ Databases

🔄 CI/CD

📋 Practices & Frameworks

🖥️ Operating Systems


🎯 Java Ecosystem

If you’d like to collaborate, ask a question, or just say hello — feel free to drop a message!