Self-Healing AI-Native Real-Time Data Pipelines: Autonomous Resilience For Large-Scale Streaming Systems

Yogesh Pugazhendhi Duraisamy Rajamani

doi:10.63278/jicrcr.vi.3653

Authors

Yogesh Pugazhendhi Duraisamy Rajamani

DOI:

https://doi.org/10.63278/jicrcr.vi.3653

Abstract

In large streaming platforms today, there are common operational issues, such as data drift, throughput degradation, partition imbalance, and cascading failures, that impact availability and performance. Existing monitoring and rule-based automatic remediation solutions are unsuitable for workloads with millisecond-level latency and high availability needs. This article introduces a fully self-healing AI-native real-time data pipeline that integrates machine learning into the control plane of the streaming platform. It presents an end-to-end architecture that leverages graph neural networks and transformers for hybrid anomaly detection, LSTM-based predictive fault modeling, and reinforcement learning-based agents that autonomously select the best remediation policy (e.g., dynamic resource scaling, partition rebalancing, and dataflow rerouting). The framework implements continuous healing based on the detect-diagnose-predict-decide-act-verify-learn loop. Evaluating the framework with synthetic and real-world high-throughput streaming workloads shows improvements in downtime, latency, fault domains, and resource utilization to establish a new model of autonomous stream processing infrastructures that can continue to operate mission-critical workloads in cloud, hybrid, and edge environments.

Self-Healing AI-Native Real-Time Data Pipelines: Autonomous Resilience For Large-Scale Streaming Systems

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Imprint

Current Issue

Information

Indexing