AI-Augmented ITSM: Autonomous Incident Triage
DOI:
https://doi.org/10.15662/IJRAI.2024.0702001Keywords:
AI-Augmented ITSM, Autonomous Incident Triage, Incident Classification, DeepTriage,, SoftNER, DeCaf, Multi-modal Analysis, Machine Learning in ITSM, Root-Cause Analysis, Knowledge ExtractionAbstract
Modern IT environments face increasing complexity, with high volumes of incidents making manual triage inefficient and error-prone. AI-augmented ITSM (IT Service Management) systems offer promise by automating classification, prioritization, routing, and resolution of incidents—ultimately reducing mean time to resolution (MTTR) and enhancing service reliability. This paper examines prior-2022 advancements in autonomous incident triage, focusing on AI methods that support decision-making within ITSM. We analyze DeepTriage (Microsoft Azure), an ensemble of gradient-boosted trees, clustering, and deep networks deployed in cloud incident categorization, achieving high F1 scores (82.9%) in production environments across thousands of teams arXiv. SoftNER, used at Microsoft, extracts structured knowledge (entities like system components and error codes) from incident reports via BiLSTM-CRF, improving downstream triage accuracy arXiv. DeCaf, another Microsoft system, automates diagnosis and triaging of KPI-based performance regressions using machine learning and pattern mining, effectively surfacing root causes from log data arXiv. Additionally, research has demonstrated that multi-modal analysis—incorporating images along with text—enhances routing and resolution outcomes in IT support arXiv. We synthesize these contributions into a unified methodology: leveraging multi-modal input, entity extraction, predictive routing, and root-cause diagnosis in an autonomy-capable ITSM pipeline. Advantages include higher triage speed, consistency, and scalable performance under high incident loads. Disadvantages include model trust and explainability challenges, data quality dependencies, integration hurdles, and monitoring needs. The study shows that while fully autonomous incident handling remains aspirational, AI-augmented triage systems have already delivered significant operational improvements. Future directions involve enhancing explainability, expanding multimodal understanding, integrating real-time monitoring (AIOps), and supporting closed-loop automation.
References
1. Pham, P., Jain, V., Dauterman, L., Ormont, J., & Jain, N. (2020). DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services. arXiv (turn0academia13).
2. Shetty, M., Bansal, C., Kumar, S., Rao, N., Nagappan, N., & Zimmermann, T. (2020). Neural Knowledge Extraction From Cloud Service Incidents. arXiv (turn0academia15).
3. Bansal, C., Renganathan, S., Asudani, A., Midy, O., & Janakiraman, M. (2019). DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services. arXiv (turn0academia14).
4. Mandal, A., Agarwal, S., Malhotra, N., Sridhara, G., Ray, A., & Swarup, D. (2019). Improving IT Support by Enhancing Incident Management Process with Multi-modal Analysis. arXiv (turn0academia12).