Predictive Maintenance#
Predictive maintenance estimates when a machine component will fail, enabling maintenance scheduling after most useful life is consumed but before failure occurs. The distinction from reactive maintenance (fix after failure) and preventive maintenance (replace on a fixed schedule regardless of condition) is that predictive maintenance uses actual sensor measurements to forecast the remaining useful life (RUL) of a component, optimizing the trade-off between maintenance cost and downtime risk.
The economic motivation is straightforward. Replacing a bearing at 50% of its useful life (preventive) wastes 50% of the bearing’s value. Waiting until failure (reactive) causes unplanned downtime, potential collateral damage to adjacent components, and safety risk. Predicting failure 2โ4 weeks in advance allows scheduling maintenance during a planned downtime window, consuming 80โ95% of the component’s useful life with near-zero unplanned downtime.
Remaining Useful Life (RUL) Estimation#
RUL estimation is a regression problem: given the current sensor data and optionally the operating history, predict the time (in hours, cycles, or operational units) until the component reaches a failure threshold. The output is a scalar โ estimated hours until failure โ or a probability distribution over failure time.
Two fundamental approaches:
Physics-based models: Derive RUL from a physical degradation equation. For bearing fatigue, the L10 life calculation from ISO 281 estimates bearing life based on load and speed. Paris’ law models crack growth rate under cyclic loading. These models require accurate knowledge of operating conditions and material properties but provide interpretable, transferable predictions.
Data-driven models: Learn the relationship between sensor observations and RUL from historical run-to-failure data. No physics equations needed โ the model learns the degradation pattern empirically. This approach requires sufficient historical failure data, and the model is only valid for operating conditions represented in the training data.
In practice, the most robust systems combine both: physics-based models provide the structural form of the degradation curve, and data-driven methods calibrate the model parameters from sensor observations.
Condition-Based vs Predictive#
Condition-based maintenance (CBM) and predictive maintenance (PdM) are related but distinct:
- Condition-based: Monitor a health indicator (vibration RMS, bearing temperature, oil particulate count). When the indicator crosses a threshold, schedule maintenance. This is reactive to the current condition โ it tells that the component is degraded but not how quickly it is degrading or when it will fail.
- Predictive: Forecast when the health indicator will cross the threshold based on its trajectory. This provides lead time: “At the current degradation rate, bearing vibration will exceed the ISO 10816 alarm threshold in approximately 3 weeks.” Lead time enables scheduling maintenance during the next planned shutdown rather than triggering an immediate unplanned stop.
The additional value of prediction over condition monitoring is the lead time โ and the confidence interval around that estimate. A narrow confidence interval (“failure in 15โ20 days”) enables precise scheduling. A wide interval (“failure in 5โ60 days”) provides less scheduling value than simple condition monitoring.
Degradation Modeling#
Degradation modeling tracks a health indicator over the operational lifetime and fits a model to the degradation trajectory. The health indicator should be:
- Monotonically trending (at least on average) as degradation progresses
- Measurable from the available sensors
- Sensitive to the failure mode of interest
- Distinguishable from operating condition variation (load, speed, temperature effects)
Common degradation curve shapes:
Linear degradation: Health indicator changes at a constant rate. RUL estimation is straightforward extrapolation: RUL = (threshold - current_value) / rate. Appropriate for steady wear mechanisms like abrasive wear, corrosion, or slow thermal degradation.
Exponential degradation: Degradation accelerates as the component approaches failure. Common in fatigue failure (crack growth accelerates as the crack lengthens) and bearing degradation (increased vibration creates more damage, which increases vibration further). The health indicator follows:
H(t) = H0 * exp(beta * t)where H0 is the initial health indicator value and beta is the degradation rate. RUL estimation requires estimating beta from the observed trajectory, which improves as more data points accumulate.
Two-phase degradation: Normal operation with minimal change, followed by accelerating degradation after a transition point. Bearings often exhibit this pattern โ vibration RMS is stable for months, then increases rapidly over days to weeks as a spall develops. Detecting the transition from phase 1 to phase 2 is itself a valuable signal, as it indicates the component has entered its end-of-life region.
Health Indicators from Sensors#
Vibration#
Vibration monitoring is the most widely used and most mature technology for rotating machinery predictive maintenance. The key health indicators:
| Indicator | What It Measures | Sensitivity |
|---|---|---|
| RMS acceleration | Overall vibration energy | General health โ increases with any defect |
| Peak acceleration | Maximum vibration amplitude | Impulsive events โ bearing defects, gear tooth damage |
| Crest factor | Peak / RMS ratio | Impulsiveness โ high for early bearing faults, decreases as fault spreads |
| Kurtosis | Tailedness of amplitude distribution | Early fault detection โ more sensitive than RMS for incipient defects |
| FFT spectral bands | Energy at specific frequencies | Fault-type identification from characteristic frequencies |
| Envelope spectrum | Amplitude demodulation of high-frequency resonance | Bearing defect frequency extraction โ gold standard for bearing diagnostics |
Bearing defect frequencies are calculable from bearing geometry (number of rolling elements, pitch diameter, contact angle) and shaft speed:
- BPFO (Ball Pass Frequency Outer race): defects on outer race โ most common failure mode
- BPFI (Ball Pass Frequency Inner race): defects on inner race
- BSF (Ball Spin Frequency): defects on rolling elements
- FTF (Fundamental Train Frequency): cage defects
These frequencies are available from bearing manufacturer datasheets or calculable from dimensions. An increase in spectral energy at BPFO, for example, specifically indicates an outer race defect.
For MCU-based vibration monitoring, an ADXL356 (ยฑ40 g, 1.5 kHz bandwidth) or ADXL1002 (ยฑ50 g, 11 kHz bandwidth) provides the raw data. The MCU samples via SPI or ADC at 12.8โ25.6 kHz, computes a 1024 or 2048-point FFT using CMSIS-DSP’s arm_rfft_fast_f32 or arm_rfft_q15, extracts spectral features, and computes health indicators. The STM32F4 series (Cortex-M4F at 168 MHz) handles a 2048-point float32 FFT in approximately 1.5 ms.
Temperature#
Temperature monitoring provides a complementary view of component health:
- Absolute temperature: Bearing temperature above 80 C indicates insufficient lubrication or excessive load. Motor winding temperature above rated insulation class indicates insulation degradation risk.
- Rate of change: A sudden temperature rise (> 5 C/minute) indicates an acute failure event โ loss of lubrication, sudden increase in friction.
- Deviation from expected: At a given load and ambient temperature, a component has an expected operating temperature. Deviation from this expected value (thermal model residual) isolates degradation from operating condition variation.
Temperature sensors (thermocouples, RTDs, or infrared pyrometers) sample at low rates (0.1โ1 Hz). The health indicator is computed over windows of minutes to hours โ temperature degradation patterns evolve on the timescale of hours to days, much slower than vibration-based indicators.
Current Draw#
Motor current signature analysis (MCSA) extracts health information from the motor supply current:
- RMS current under load: Increasing RMS current at constant load indicates increased mechanical resistance (bearing wear, misalignment) or electrical degradation (winding short).
- Current spectrum: Broken rotor bars produce sidebands at (1 ยฑ 2s) ร line frequency, where s is the motor slip. Eccentricity produces sidebands at specific frequencies related to pole count and slip.
- Startup current profile: The transient current during motor startup follows a characteristic shape. Changes in this shape โ extended startup time, abnormal oscillation pattern โ indicate mechanical issues.
Current monitoring uses Hall-effect sensors (ACS712, ACS723) or current transformers on the motor supply line. Sampling at 1โ10 kHz captures the current waveform including harmonic content. An FFT of the current waveform reveals the spectral features for MCSA.
Acoustic Emission#
Ultrasonic acoustic emission sensors detect stress waves at frequencies above 20 kHz โ inaudible to human hearing but produced by crack growth, friction, and impact events inside mechanical components. Acoustic emission detects faults earlier than vibration monitoring because the ultrasonic stress waves precede the low-frequency vibration that standard accelerometers measure.
Acoustic emission monitoring requires specialized sensors (piezoelectric AE sensors with frequency response up to 100 kHz or higher) and high-speed sampling (100 kHzโ1 MHz). This pushes the data acquisition beyond typical MCU capabilities into FPGA or dedicated DSP territory.
Machine Learning Approaches to RUL#
Linear and Exponential Regression on Features#
The simplest ML approach: extract health indicators from sensor data at regular intervals, fit a linear or exponential regression to the trajectory, and extrapolate to the failure threshold. This requires minimal compute and fits on any MCU.
# Linear RUL estimation
degradation_rate = (H_current - H_initial) / (t_current - t_initial)
RUL = (H_threshold - H_current) / degradation_rateFor exponential degradation, a log-transform linearizes the problem:
log(H(t)) = log(H0) + beta * t
beta = slope of log(H) vs t regression
RUL = (log(H_threshold) - log(H_current)) / betaThese models require storing the regression coefficients and the health indicator history (or sufficient statistics for online regression). Total storage: less than 1 KB. Inference: a few arithmetic operations.
Neural RUL Models#
For SBC-class devices (Raspberry Pi, ESP32-S3 with PSRAM), neural network models operating on feature sequences provide more accurate RUL estimation than simple regression, particularly when the degradation pattern is nonlinear and varies across units.
1D-CNN for RUL: Input is a sequence of health indicator vectors โ for example, the last 50 time steps of a 20-feature health indicator vector, shaped (50, 20). A 1D-CNN processes the sequence to produce a scalar RUL estimate. Model size: 50โ200 KB. Inference: 5โ20 ms on Cortex-A53.
LSTM for RUL: An LSTM processes the health indicator sequence step by step, maintaining hidden state that captures the degradation trajectory. LSTM models are well-suited to RUL because the degradation history (not just the current state) is informative. Model size: 100โ500 KB. Inference: 10โ50 ms depending on sequence length.
The NASA C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) turbofan dataset is the standard benchmark for neural RUL models. It contains run-to-failure trajectories of simulated turbofan engines with 21 sensor channels. Published neural RUL models on C-MAPSS achieve RMSE of 12โ15 cycles on the FD001 subset (single operating condition) and 20โ25 cycles on FD004 (six operating conditions) โ compared to 40+ cycles for simple linear regression.
Edge vs Cloud Split#
For fleet-level predictive maintenance systems, the computation is typically split between edge and cloud:
| Function | Location | Rationale |
|---|---|---|
| Sensor data acquisition | Edge (MCU) | Real-time, high-rate data collection |
| Feature extraction | Edge (MCU) | Reduces data volume by 100โ1000x |
| Anomaly detection | Edge (MCU/SBC) | Low-latency safety alerts, works offline |
| Health indicator trending | Edge (SBC) | Local visualization and short-term RUL |
| Neural RUL estimation | Cloud or SBC | More compute, richer model, cross-machine data |
| Fleet analytics | Cloud | Cross-machine correlation, aggregated statistics |
| Model retraining | Cloud | GPU resources for training |
The edge device publishes health indicators and anomaly scores to the cloud via MQTT. The cloud aggregates data from the fleet, runs more computationally intensive RUL models, and identifies fleet-wide patterns (e.g., all machines in building B are degrading faster than building A โ investigate environmental differences).
MQTT Alerting Integration#
MQTT is the standard messaging protocol for publishing health indicators and maintenance alerts from edge devices to a cloud or on-premises monitoring platform.
A typical MQTT topic structure for a predictive maintenance system:
plant/line1/machine3/bearing_DE/vibration_rms
plant/line1/machine3/bearing_DE/anomaly_score
plant/line1/machine3/bearing_DE/rul_estimate
plant/line1/machine3/bearing_DE/alertThe edge device publishes health indicators at regular intervals (every 1โ60 seconds depending on the degradation time scale) and publishes alerts when thresholds are crossed. An MQTT broker (Mosquitto, EMQX, HiveMQ) routes messages to subscribers โ a cloud dashboard (Grafana, Node-RED, custom web application), a rules engine for alert processing, and a time-series database (InfluxDB, TimescaleDB) for historical storage and trend analysis.
Alert messages include structured payloads:
{
"machine_id": "line1_machine3",
"component": "bearing_DE",
"timestamp_utc": "2026-02-28T14:30:00Z",
"severity": "warning",
"indicator": "vibration_rms",
"value": 4.2,
"threshold": 4.5,
"rul_hours": 312,
"message": "Vibration trending toward alarm threshold"
}Alert Design#
Effective alerting for predictive maintenance requires more nuance than a single binary threshold.
Multiple severity levels:
| Level | Condition | Action |
|---|---|---|
| Normal | All indicators within baseline | No action, log data |
| Watch | Indicator trending upward, not yet at warning threshold | Increase monitoring frequency |
| Warning | Indicator above warning threshold or RUL < 30 days | Schedule maintenance during next planned shutdown |
| Alarm | Indicator above alarm threshold or RUL < 7 days | Schedule urgent maintenance |
| Danger | Indicator above danger threshold or RUL < 24 hours | Immediate shutdown recommended |
Hysteresis: Prevent alert flapping when an indicator oscillates near a threshold. The warning activates at vibration RMS > 4.5 mm/s but deactivates only when RMS falls below 3.6 mm/s (80% of the activation threshold). This 20% dead-band prevents rapid on/off toggling from minor fluctuations.
Rate-of-change alerts: In addition to absolute thresholds, alert on the degradation rate. A vibration RMS of 3.0 mm/s (below the warning threshold) that increased from 2.0 mm/s in the last week warrants attention even though the absolute value is acceptable โ the trend predicts threshold crossing within 1โ2 weeks.
Alert suppression during known transients: Suppress alerts during machine startup (first 30โ60 seconds), shutdown, and planned maintenance activities. These transients produce indicator values that would trigger false alarms but are expected and benign. The suppression logic runs on the edge device or in the rules engine.
ISO 10816 Vibration Severity#
ISO 10816 provides vibration severity ranges for rotating machinery, widely used as initial alert thresholds:
| Group | Machine Type | Good (mm/s RMS) | Acceptable | Alert | Not Permissible |
|---|---|---|---|---|---|
| 1 | Small machines (< 15 kW) | < 0.71 | 0.71โ1.8 | 1.8โ4.5 | > 4.5 |
| 2 | Medium machines (15โ75 kW) | < 1.12 | 1.12โ2.8 | 2.8โ7.1 | > 7.1 |
| 3 | Large machines on rigid foundations | < 1.12 | 1.12โ2.8 | 2.8โ7.1 | > 7.1 |
| 4 | Large machines on flexible foundations | < 1.8 | 1.8โ4.5 | 4.5โ11.2 | > 11.2 |
These values apply to broadband velocity measured in mm/s RMS. They serve as reasonable starting thresholds for a new installation before machine-specific baselines are established from historical data.
Tips#
- Vibration RMS trending is the simplest and most effective starting point for rotating machinery predictive maintenance. A single accelerometer, an FFT for basic spectral features, and a linear trend extrapolation provide substantial value before investing in neural models. Many industrial predictive maintenance programs operate successfully on this foundation alone.
- Sample the vibration sensor at 2.5x the highest frequency of interest. For bearing analysis on a machine with shaft speed up to 3600 RPM (60 Hz), bearing defect frequencies can reach 500 Hz with harmonics to 3 kHz โ a sampling rate of 8 kHz minimum, 12.8 kHz practical, 25.6 kHz generous.
- Calculate health indicators locally on the edge device and transmit only the indicators (not raw waveforms) over MQTT. A 25.6 kHz accelerometer produces 76.8 KB/s of raw data per axis. The corresponding health indicators (RMS, kurtosis, spectral band energies) are 10โ50 scalar values per measurement window โ a compression ratio of 1000x or more.
- Use ISO 10816 vibration severity ranges as initial alert thresholds for new installations. Machine-specific thresholds should replace these generic values after 3โ6 months of baseline data collection.
- Store health indicator history on the edge device (SD card or external flash) in addition to transmitting to the cloud. This provides data continuity during network outages and enables local trend visualization for on-site maintenance technicians.
Caveats#
- Run-to-failure data is rare and expensive to collect. Machines are typically maintained before failure (by design โ that is the point of maintenance), so failure events are uncommon in historical records. Simulated or accelerated life test data can supplement, but the degradation dynamics of accelerated tests may differ from field conditions. The NASA C-MAPSS dataset is widely used for research but does not represent real-world sensor noise, missing data, or variable operating conditions.
- RUL models trained on one machine type do not transfer to different machines. Even nominally identical machines exhibit different degradation patterns due to manufacturing variation, installation differences, load profiles, and environmental conditions. Transfer learning (pre-train on fleet data, fine-tune on target machine) reduces the data requirement for individual machines but does not eliminate it.
- Operating condition changes (load, speed, temperature) affect health indicators independently of degradation. A machine running at higher load has higher vibration RMS โ this is normal operating variation, not degradation. Health indicators must be normalized for operating conditions, or the RUL model must include operating conditions as input features. Without this normalization, the RUL estimate fluctuates with every load change.
- Predictive maintenance ROI requires enough historical failure data to validate predictions. Demonstrating that a system predicted a failure 3 weeks in advance (enabling a planned repair that avoided $50,000 in unplanned downtime) requires that the failure actually occurs (or would have occurred without intervention). This creates a validation paradox: successful predictive maintenance prevents the failures that would validate the predictions.
In Practice#
- Vibration RMS increases steadily over weeks, then jumps sharply. This is the classic bearing degradation pattern โ steady wear followed by accelerating failure as a spall develops and propagates. The sharp jump indicates the transition from gradual to rapid degradation. Maintenance should be scheduled immediately upon detecting the acceleration, as the time from acceleration to failure can be days rather than weeks.
- Health indicator oscillates daily without a long-term trend. Ambient temperature variation causes thermal expansion that shifts vibration signatures and current draw. Adding temperature as a model input or normalizing the health indicator by ambient temperature eliminates the oscillation. A common sanity check is to plot the health indicator against ambient temperature โ if the correlation is high, temperature compensation is needed.
- RUL estimate fluctuates wildly between consecutive measurements. The input features are not stable enough for reliable extrapolation. Increasing the averaging window for health indicator computation (from 1-minute windows to 10-minute windows) smooths the inputs and stabilizes the RUL estimate. Alternatively, the RUL model itself may be overfit to the training data โ a simpler model with fewer input features often produces more stable predictions.
- MQTT alerts fire repeatedly, then stop, then fire again. Alert hysteresis is not configured, and the health indicator is oscillating near the threshold boundary. Adding a dead-band (alert activates at threshold T, deactivates at 0.8T) prevents the toggling. If the indicator is genuinely fluctuating around the warning level, this itself is informative โ the component is approaching the maintenance threshold.
- RUL model predicts 500 hours, but component fails in 100 hours. The model was trained on gradual degradation data and the actual failure was a sudden event (impact damage, lubrication loss, contamination) that bypasses the gradual degradation pattern. Predictive maintenance based on degradation modeling inherently cannot predict sudden failures โ anomaly detection provides the complementary capability for detecting abrupt deviations from normal.