Deep Learning Models for Predicting Hardware Failures Using Large-Scale Telemetry Data
Main Article Content
Abstract
The contemporary computing infrastructure is producing masses of telemetry data through system logs, sensors and health metrics. The use of such data through deep learning is potentially useful in predicting hardware failures even before they happen and reduce downtime and maintenance expenses. This paper explores the Long Short-Term Memory (LSTM) networks and Autoencoder models to predict failures based on large scale telemetry data. The suggested LSTM model is trained with time dynamics leading to failures, whereas an LSTM-based autoencoder identifies anomalous behavior by rebuilding usual sequences and indicating anomalies. These models are trained and tested with multi-source telemetry data resulting in high accuracy and precision in prediction. The LSTM model, especially, also has good results in predicting the precursor signals of failures, and it is better than a baseline random forest classifier. The findings indicate that deep learning can also be used to take advantage of telemetry properties (CPU usage, temperature, disk health indicators etc.) to give early notifications of hardware problems. We talk about model accuracy, precision recall qualities and trade-offs between false alarms and missed detections. The paper identifies the potential of deep learning-based predictive maintenance to improve the reliability of systems.
Article Details

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.