Unified Availability Model (UAM): A Normalization-Based Framework for Measuring Availability Across Heterogeneous Information Systems
ORCID: 0009-0002-7724-5762
26 November 2025
Original language of the article: English
Abstract
The Unified Availability Model (UAM) proposes a formal, extensible framework for evaluating the availability of heterogeneous information systems whose operational characteristics, performance indicators, and failure modes differ fundamentally across layers and functional domains. Unlike traditional SRE- and API-centric approaches that assume homogeneous metric spaces and rely primarily on latency–error abstractions, UAM introduces a generalizable normalization methodology capable of transforming diverse metric types into a unified, comparable scale.
The model is based on four foundational constructs — statistical baselines, SLA thresholds, physical system limits, and acceptable deviation ranges — each represented as a mathematically defined normalization operator. By combining these operators through a weighted linear composition, UAM enables consistent availability scoring for systems as diverse as REST/gRPC microservices, payment gateways, cryptographic HSM-signing modules, batch/ETL pipelines, ERP platforms (Oracle EBS, SAP), and EDI/IDoc integration channels.
At the higher level, UAM introduces the concept of a business-process contour — a structured composite of heterogeneous subsystems — and defines contour availability as a weighted aggregation of subsystem availability scores. This provides a unified and interpretable indicator of end-to-end business operability, compatible with real-world observability tools such as Prometheus, Zabbix, Grafana, and VictoriaMetrics.
The proposed model offers both theoretical novelty — by formalizing cross-system metric normalization — and practical applicability, providing engineering teams with a consistent methodology for measuring, comparing, and governing availability across complex IT landscapes. UAM constitutes a step toward a generalized mathematical foundation for reliability assessment in multi-layered enterprise systems.
This document represents the revised and extended Version 2.0 of the Unified Availability Model (UAM), incorporating hierarchical coherence models and a formal treatment of fuzzy logic and neural-network limitations.
Introduction
Modern IT landscapes consist of heterogeneous and multi-layered systems, including (but not limited to) the following categories:
API services and microservices,
payment gateways,
banking integrations,
cryptographic and HSM services,
batch processes and ETL pipelines,
ERP systems (such as SAP S/4HANA, Oracle EBS),
reporting and analytics systems.
Each of these categories has its own operation principles, workload patterns, and metric types. For this reason, a single universal availability metric is not possible.
However, it is possible to create a unified methodological approach in which:
each system is evaluated using its own native metrics,
these metrics are normalized to a common scale,
the final availability score is calculated as a weighted sum.
This approach makes it possible to:
fairly compare different types of systems,
build an integrated availability score for an entire process contour,
unify reporting and monitoring,
easily extend the model to new classes of systems.
Key Definitions
System Availability
System availability \(A_i\) is a normalized score that represents the state of system \(i\) over a selected observation period. It takes a value from 0 to 1, or from 0% to 100%.
Metric
A metric \(M_{ij}\) is a quantitative indicator that describes the state of system \(i\) by parameter \(j\).
Normalization
Normalization is the transformation of a raw metric \(M_{ij}\) into a normalized value \(N_{ij} \in [0,1]\) according to the rules defined for the specific system category.
Metric Weight
\(w_{ij}\) is a weight coefficient that defines the contribution of metric \(j\) to the final availability of system \(i\), where: \[\sum_j w_{ij} = 1.\]
Contour Availability
Contour availability \(A_{\text{contour}}\) is a weighted sum of the availability values of the systems that form a business-process chain.
Unified Availability Model: General Structure
Each system \(S_i\) has a set of metrics: \[\{M_{i1}, M_{i2}, \dots, M_{ik}\}.\]
Each metric is normalized according to: \[N_{ij} = f_{norm}(M_{ij}),\] where \(N_{ij} \in [0,1]\).
Each normalized metric has an assigned weight \(w_{ij}\).
System Availability Formula
\[A_i = \sum_{j=1}^{k} w_{ij} \cdot N_{ij}.\]
The final value of \(A_i\) is expressed either in the range \([0,1]\) or as a percentage.
Contour Availability Formula
Let the contour consist of \(n\) systems. Then: \[A_{\text{contour}} = \sum_{i=1}^{n} W_i \cdot A_i,\] where \(W_i\) is the weight of system \(i\) in the contour, and \(\sum_i W_i = 1\).
Metric Normalization
In the Unified Availability Model (UAM), each metric \(M_{ij}\) is normalized to a value \(N_{ij} \in [0,1]\). The normalization is based on four key concepts:
stable baseline,
SLA thresholds,
physical limits,
acceptable deviations.
Below are the formal definitions and normalization rules that describe how raw metrics are transformed into normalized values.
Stable Baseline
Definition. A stable baseline for metric \(M\) is a statistically stable characteristic that reflects the typical system behavior under normal conditions, without anomalies.
The baseline shows the “normal” operation mode and is used as a reference point for detecting degradation. It is derived from historical data.
Normalization rules.
Let \(B\) be the baseline, calculated using one of the statistical methods:
median: \(B = \mathrm{median}(M_{\mathrm{hist}})\),
75th percentile: \(B = \mathrm{quantile}_{0.75}\),
exponential smoothing: \(B = \mathrm{EWMA}(M_{\mathrm{hist}})\),
seasonal baselines (by hour, weekday, etc.).
Let \(k\) be a tolerance coefficient (usually \(1.5\)–\(2\)). Then:
\[\begin{equation} N_{\text{baseline}}(M) = \begin{cases} 1, & M \le kB, \\[6pt] \frac{kB}{M}, & M > kB. \end{cases} \end{equation}\]
SLA Thresholds
Definition. An SLA is a target value that defines the boundary between “acceptable” and “unacceptable” service quality.
Normalization rules.
Formally, SLA is expressed as:
\[M \le SLA \quad \text{(for latency)}, \qquad M \ge SLA \quad \text{(for success rate)}.\]
\[\begin{equation} N_{\text{SLA}}(M) = \begin{cases} 1, & M \text{ does not violate SLA}, \\[6pt] 1 - \alpha (M - SLA), & M \text{ violates SLA}, \end{cases} \end{equation}\]
where \(\alpha\) is a penalty coefficient.
Physical Limits
Definition. Physical limits are hardware or architectural constraints that do not depend on SLA.
Typical examples include:
network bandwidth,
maximum number of connections in a pool,
CPU or IO limits,
architectural RPS ceiling.
Normalization rules.
Let the physical range be:
\[M_{\min}^{\mathrm{phys}} \le M \le M_{\max}^{\mathrm{phys}}.\]
Then:
\[\begin{equation} N_{\text{phys}}(M) = \frac{M_{\max}^{\mathrm{phys}} - M} {M_{\max}^{\mathrm{phys}} - M_{\min}^{\mathrm{phys}}}. \end{equation}\]
Acceptable Deviations
Definition. Acceptable deviations represent a range of values where small metric fluctuations are not considered degradation.
\[D_{\min} \le M \le D_{\max}.\]
Normalization rules.
\[\begin{equation} N_{\text{dev}}(M) = \begin{cases} 1, & D_{\min} \le M \le D_{\max}, \\[6pt] 1 - \beta (M - D_{\max}), & M > D_{\max}, \\[6pt] 1 - \beta (D_{\min} - M), & M < D_{\min}, \end{cases} \end{equation}\]
where \(\beta\) is the penalty coefficient.
Final Normalization
For each metric, the final normalization is a weighted composition of the approaches:
\[\begin{equation} N(M) = w_b N_{\text{baseline}}(M) + w_s N_{\text{SLA}}(M) + w_p N_{\text{phys}}(M) + w_d N_{\text{dev}}(M), \end{equation}\]
where \(w_b + w_s + w_p + w_d = 1\).
Typical Normalization Strategies
Linear Normalization
\[N = 1 - \frac{M - M_{\min}}{M_{\max} - M_{\min}}.\]
SLA-Based Normalization
\[N = \begin{cases} 1, & M \leq SLA, \\ 1 - k (M - SLA), & M > SLA. \end{cases}\]
Normalization with a Saturating Function
\[N = e^{-\alpha M}.\]
Example: Latency Normalization
As an example, consider the normalization of a latency metric. \[N_{\text{latency}} = \begin{cases} 1, & L \leq L_{\text{baseline}},\\ \frac{L_{\text{max}} - L}{L_{\text{max}} - L_{\text{baseline}}}, & L > L_{\text{baseline}}. \end{cases}\]
Examples of Metric Normalization for Different System Types
The following examples show how normalization can be applied to four important classes of systems: API services, cryptographic services, batch processes, and ERP platforms such as Oracle EBS. These examples demonstrate how the concepts of baseline, SLA, physical limits, and acceptable deviations are used for real observability metrics.
API Services and Payment Gateways
For API services, the main metrics include:
\(M_1\): latency \(p99\),
\(M_2\): success rate,
\(M_3\): share of 5xx errors,
\(M_4\): queue depth or backlog.
Baseline example: \[B_{latency} = \mathrm{median}(L_{\mathrm{hist}}) = 85\ \mathrm{ms}.\]
SLA example: \[\mathrm{latency}_{p99} \le 300\ \mathrm{ms}.\]
Physical limit: \[L_{\max}^{\mathrm{phys}} = 2000\ \mathrm{ms} \quad (\text{Nginx kernel timeout}).\]
Acceptable deviation: \[D_{\max} = 2.0 \cdot B_{latency}.\]
Latency normalization: \[N_{\text{API-lat}} = \begin{cases} 1, & L \le 2B_{latency},\\ \frac{2B_{latency}}{L}, & L > 2B_{latency}. \end{cases}\]
Cryptographic Services (HSM, Signing)
Key metrics:
\(M_1\): HSM availability (online/offline),
\(M_2\): document signing time,
\(M_3\): signing error rate,
\(M_4\): key slot availability.
Baseline example: \[B_{\mathrm{sign}} = \mathrm{median}(T_{\mathrm{sign}}) = 42\ \mathrm{ms}.\]
SLA: \[T_{\mathrm{sign}} \le 150\ \mathrm{ms}.\]
Physical limits: \[T_{\min}^{\mathrm{phys}} = 15\ \mathrm{ms}, \quad T_{\max}^{\mathrm{phys}} = 500\ \mathrm{ms}.\]
Acceptable deviation: \[D_{\max} = 1.5 \cdot B_{\mathrm{sign}}.\]
Normalization: \[N_{\mathrm{sign}} = \begin{cases} 1, & T \le D_{\max},\\[6pt] 1 - 0.01 (T - D_{\max}), & T > D_{\max}. \end{cases}\]
Batch Processes and Reporting
Key metrics:
\(M_1\): job execution time,
\(M_2\): completion within the time window,
\(M_3\): execution errors,
\(M_4\): backlog or queue size.
Baseline: \[B_{\mathrm{job}} = \mathrm{median}(t_{\mathrm{job, hist}}) = 17\ \mathrm{min}.\]
SLA: \[t_{\mathrm{job}} \le 30\ \mathrm{min}.\]
Physical limit: \[t_{\max}^{\mathrm{phys}} = 120\ \mathrm{min} \quad (\text{batch window overflow}).\]
Acceptable deviation: \[D_{\max} = 1.2 \cdot B_{\mathrm{job}}.\]
Normalization: \[N_{\mathrm{batch}} = \begin{cases} 1, & t \le D_{\max},\\[6pt] \frac{D_{\max}}{t}, & t > D_{\max}. \end{cases}\]
ERP / SAP S/4HANA
Key metrics:
\(M_1\): SAP Dialog Response Time (DB Time, CPU Time, Wait Time),
\(M_2\): Work Process Utilization (DIA, BTC, UPD, SPO),
\(M_3\): Queue Length (SM50/SM66),
\(M_4\): HANA DB latency and lock waits.
Baseline: \[B_{\mathrm{SAP}} = \mathrm{median}(T_{\mathrm{dialog}}) = 280\ \mathrm{ms}.\]
SAP SLA: \[T_{\mathrm{dialog}} \le 1000\ \mathrm{ms}.\]
Physical limits: \[T_{\min}^{\mathrm{phys}} = 50\ \mathrm{ms}, \qquad T_{\max}^{\mathrm{phys}} = 5000\ \mathrm{ms}.\]
Acceptable deviation: \[D_{\max} = 2.0 \cdot B_{\mathrm{SAP}}.\]
Normalization: \[N_{\mathrm{SAP}} = \begin{cases} 1, & T \le D_{\max}, \\[6pt] 1 - 0.0005\,(T - D_{\max}), & T > D_{\max}. \end{cases}\]
ERP / Oracle EBS
Key metrics:
\(M_1\): database wait events (TX, TM, enq:...),
\(M_2\): active Concurrent Managers,
\(M_3\): workflow lag depth,
\(M_4\): session pool usage.
Baseline: \[B_{\mathrm{TX}} = \mathrm{median}(\mathrm{wait\_TX}) = 8\ \mathrm{ms}.\]
SLA: \[\mathrm{wait\_TX} \le 40\ \mathrm{ms}.\]
Physical limit: \[\mathrm{wait\_TX}^{\max} = 500\ \mathrm{ms}.\]
Acceptable deviation: \[D_{\max} = 3 \cdot B_{\mathrm{TX}}.\]
Normalization: \[N_{\mathrm{EBS}} = \begin{cases} 1, & W_{\mathrm{TX}} \le D_{\max}, \\[6pt] 1 - 0.005(W_{\mathrm{TX}} - D_{\max}), & W_{\mathrm{TX}} > D_{\max}. \end{cases}\]
SAP IDoc / EDI Integrations
Key metrics:
\(M_1\): IDoc processing time (end-to-end latency),
\(M_2\): share of IDoc errors (status 51/68),
\(M_3\): inbound/outbound queue depth,
\(M_4\): lock conflicts during processing.
Baseline: \[B_{\mathrm{IDoc}} = \mathrm{median}(T_{\mathrm{idoc}}) = 1.8\ \mathrm{s}.\]
SLA: \[T_{\mathrm{idoc}} \le 5\ \mathrm{s}.\]
Physical limits: \[T_{\min}^{\mathrm{phys}} = 0.5\ \mathrm{s}, \qquad T_{\max}^{\mathrm{phys}} = 30\ \mathrm{s}.\]
Acceptable deviation: \[D_{\max} = 2.0 \cdot B_{\mathrm{IDoc}}.\]
Normalization: \[N_{\mathrm{IDoc}} = \begin{cases} 1, & T \le D_{\max}, \\[6pt] 1 - 0.02\,(T - D_{\max}), & T > D_{\max}. \end{cases}\]
Integration with Elasticsearch / OpenSearch
Elasticsearch and OpenSearch can serve as complementary data sources for UAM because many systems produce structured operational logs rather than metrics. UAM relies on normalized numerical inputs, and search engines can supply these inputs in the form of aggregated event counts, latency distributions, or error classifications extracted from logs.
Why Elasticsearch/OpenSearch are useful for UAM
Structured log fields allow deriving metrics not exposed by Prometheus (e.g., workflow errors, business failures, SAP IDoc status).
High-volume indexing supports massive enterprise landscapes.
Powerful aggregations (terms, date histogram, percentiles) allow reproducing UAM baselines and SLA distributions.
Long-term storage is ideal for multi-month baseline calculation.
Example: Extracting error ratios
A Kibana/OpenSearch DSL query can compute error counts:
GET logs-*/_search
{
"size": 0,
"query": {
"match": { "level": "ERROR" }
},
"aggs": {
"errors": { "value_count": { "field": "message" } }
}
}
Converted into a normalized metric:
\[N_{\text{err}} = \mathrm{clamp}\left(1 - \frac{\text{errors}}{\text{threshold}}, 0, 1\right)\]
Example: Log-derived latency distribution
"aggs": {
"latency_p95": {
"percentiles": { "field": "latency_ms", "percents": [95] }
}
}
This can feed into UAM latency normalization: \[N_{\text{lat}} = \frac{L_{\max} - L}{L_{\max} - B}\]
Baseline computation from logs
Elasticsearch/OpenSearch aggregations facilitate seasonal and multi-week baselines:
"aggs": {
"daily": {
"date_histogram": { "field": "@timestamp", "interval": "1d" },
"aggs": {
"median_latency": { "percentiles": { "field": "latency_ms",
"percents": [50] } }
}
}
}
Exporting metrics to UAM
Three export mechanisms are typical:
Elastic → Prometheus exporter (export aggregate values as Prometheus metrics)
Elastic → VictoriaMetrics via push (send normalized metrics directly)
Prometheus → Elastic agent (Elastic ingests Prometheus metrics + UAM metrics)
Once converted to metrics, Elasticsearch/OpenSearch derived values participate in the standard UAM formulas: \[A_i = \sum_j w_{ij} N_{ij}\] and \[A_{\text{contour}} = \sum_i W_i A_i.\]
Integration with Kafka and RabbitMQ
Messaging platforms such as Kafka and RabbitMQ play a critical role in enterprise contours. Many subsystems (SAP, banking, CRM, payment gateways) depend on reliable message flow. UAM can directly incorporate queue metrics, lag metrics, and throughput metrics from these brokers.
Why messaging systems matter for UAM
Message delays directly affect end-to-end contour availability.
Queue backlogs indicate partial failures or slow consumers.
Broker-level errors (rebalance storms, partition unavailability, consumer lag) become visible in normalized metrics.
Kafka/RabbitMQ provide precise throughput/latency data.
Kafka
Kafka exposes rich metrics via:
JMX exporters for Prometheus,
Client-side metrics (producer/consumer),
Broker/partition metrics.
Key Kafka metrics for UAM
Consumer Lag — \(M_1\)
Message Throughput — \(M_2\)
Rebalance Events — \(M_3\)
Failed Produce/Consume Attempts — \(M_4\)
Example normalization:
\[N_{\text{lag}} = \frac{D_{\max} - \mathrm{lag}}{D_{\max}}\]
\[N_{\text{fail}} = 1 - \frac{\mathrm{failures}}{\mathrm{threshold}}\]
Kafka-derived availability
\[A_{\text{kafka}} = 0.40 N_{\text{lag}} + 0.30 N_{\text{throughput}} + 0.20 N_{\text{fail}} + 0.10 N_{\text{rebalance}}\]
This integrates seamlessly with contour-level availability.
RabbitMQ
RabbitMQ provides AMQP queue and channel metrics via Prometheus exporters or the built-in management API.
Key RabbitMQ metrics for UAM
Queue Depth — \(M_1\)
Message Rate In/Out — \(M_2\)
Unacked Messages — \(M_3\)
Connection Errors — \(M_4\)
Example normalization:
\[N_{\text{queue}} = \mathrm{clamp}\left( 1 - \frac{\mathrm{queue\ depth}}{D_{\max}}, 0, 1 \right)\]
\[N_{\text{unacked}} = \frac{U_{\max} - U}{U_{\max}}\]
RabbitMQ availability:
\[A_{\text{rabbit}} = 0.35 N_{\text{queue}} + 0.25 N_{\text{rate}} + 0.25 N_{\text{unacked}} + 0.15 N_{\text{conn}}\]
Message broker availability becomes one of the \(A_i\) values in:
\[A_{\text{contour}} = \sum_i W_i A_i.\]
Use cases in enterprise contours
SAP IDoc → Kafka → microservices → ERP posting.
Payment gateway → RabbitMQ → accounting.
Workflow engines using Kafka for orchestration.
In all such cases, broker degradation directly impacts the business contour, which UAM captures via normalized queue, lag, and error metrics.
Comparison Table of Normalization Approaches
| Parameter | Baseline | SLA | Physical Limits / Deviations |
|---|---|---|---|
| Source of value | Historical data | Contract / obligation | Architectural / hardware limits |
| Control type | Detecting degradation | Binary check of “normal” | Stability and resource limit control |
| Orientation | System behavior | Client expectations | Physical and operational limits |
| API example | p50/p75 latency baseline | latency\(_{p99} \le 300\) ms | Nginx timeout = 2000 ms |
| Batch example | median job time | job \(\le\) SLA window | max job = 120 min |
| EBS example | median TX wait | TX wait \(\le\) SLA | session pool = [0, 300] |
| Role in UAM | Basis for degradation detection | Strict penalties for violations | Soft penalty zone before failure |
System Types and Recommended Metrics
The tables below present recommended metrics and weight coefficients for different categories of information systems. These values are templates and may be adjusted for a specific IT landscape.
API Services and Payment Gateways
| Metric | Symbol | Weight \(w_j\) |
|---|---|---|
| Success Rate | \(N_{\text{succ}}\) | 0.40 |
| Latency \(p99\) | \(N_{\text{lat}}\) | 0.30 |
| 5xx Error Ratio | \(N_{\text{err}}\) | 0.20 |
| Queue Length / Backlog | \(N_{\text{queue}}\) | 0.10 |
\[A_{\text{API}} = 0.4 N_{\text{succ}} + 0.3 N_{\text{lat}} + 0.2 N_{\text{err}} + 0.1 N_{\text{queue}}.\]
Banking APIs
| Metric | Symbol | Weight |
|---|---|---|
| Success Rate | \(N_{\text{succ}}\) | 0.50 |
| Timeout Ratio | \(N_{\text{timeout}}\) | 0.20 |
| Connection Failures | \(N_{\text{conn}}\) | 0.20 |
| Backlog / Queue Size | \(N_{\text{queue}}\) | 0.10 |
Cryptographic Services
| Metric | Symbol | Weight |
|---|---|---|
| HSM Online State | \(N_{\text{hsm}}\) | 0.40 |
| Success Rate | \(N_{\text{succ}}\) | 0.30 |
| Signing Time | \(N_{\text{sign}}\) | 0.20 |
| Key Slot Availability | \(N_{\text{slot}}\) | 0.10 |
Batch Processes
| Metric | Symbol | Weight |
|---|---|---|
| Completed Within Processing Window | \(N_{\text{deadline}}\) | 0.60 |
| Job Errors | \(N_{\text{error}}\) | 0.20 |
| Backlog / Queue Size | \(N_{\text{queue}}\) | 0.20 |
ERP / SAP S/4HANA
| Metric | Symbol | Weight |
|---|---|---|
| Dialog Response Time | \(N_{\text{dialog}}\) | 0.40 |
| Work Process Utilization | \(N_{\text{wp}}\) | 0.25 |
| Queue Length (SM50/SM66) | \(N_{\text{queue}}\) | 0.20 |
| HANA DB Latency / Locks | \(N_{\text{hana}}\) | 0.15 |
ERP / Oracle E-Business Suite
| Metric | Symbol | Weight |
|---|---|---|
| Active Concurrent Managers | \(N_{\text{cm}}\) | 0.35 |
| Database Wait Events | \(N_{\text{dbwait}}\) | 0.25 |
| Session Pool Usage | \(N_{\text{sess}}\) | 0.20 |
| Workflow Lag | \(N_{\text{wf}}\) | 0.20 |
SAP IDoc / EDI Integrations
| Metric | Symbol | Weight |
|---|---|---|
| IDoc Processing Time (latency) | \(N_{\text{idoc}}\) | 0.45 |
| IDoc Error Ratio (status 51/68) | \(N_{\text{err}}\) | 0.30 |
| IDoc Queue Size | \(N_{\text{queue}}\) | 0.15 |
| Lock Conflicts / DB Blocking | \(N_{\text{lock}}\) | 0.10 |
Elasticsearch / OpenSearch (Log-Derived Metrics)
| Metric | Symbol | Weight |
|---|---|---|
| Log-Derived Error Rate | \(N_{\text{err}}\) | 0.40 |
| Log-Derived Latency (p95/p99) | \(N_{\text{lat}}\) | 0.30 |
| Event Volume Consistency | \(N_{\text{vol}}\) | 0.20 |
| Indexing/Query Lag | \(N_{\text{lag}}\) | 0.10 |
Kafka Message Broker
| Metric | Symbol | Weight |
|---|---|---|
| Consumer Lag | \(N_{\text{lag}}\) | 0.40 |
| Message Throughput (in/out) | \(N_{\text{rate}}\) | 0.30 |
| Produce/Consume Failures | \(N_{\text{fail}}\) | 0.20 |
| Rebalance / Partition Stability | \(N_{\text{rebalance}}\) | 0.10 |
Contour Availability
Definition of a Contour
In the Unified Availability Model (UAM), a contour is defined as a connected set of information systems, services, and processes that together form a single business flow, in which a user or an external partner obtains a final business result.
A contour has the following properties:
Functional integrity — all components contribute to one business function;
Logical sequence — systems are called or accessed in a defined order;
Technical interdependence — the failure of one system reduces the availability of the entire contour;
Measurability — each component has a measurable availability value \(A_i \in [0,1]\).
Thus, a contour is not an infrastructure structure by itself, but a business-oriented chain of dependencies.
Contour Composition
Each contour consists of three layers:
Front systems (front-line layer):
API endpoints, web methods, integration gateways;
mobile or desktop user interfaces;
external REST/gRPC services.
Mid-layer services (logical layer):
CRM/ERP modules;
banking APIs, payment processors, billing nodes;
cryptographic services (signing, encryption);
workflow engines and message queues.
Back-office processes (background layer):
batch jobs;
reporting pipelines;
scheduled calculations;
ERP background transactions.
Each component of the contour has its own availability \(A_i\), calculated according to UAM normalization rules.
Weight Coefficient of a System in a Contour
Every system has a contour weight \(W_i\) — the coefficient that reflects its impact on the final operability of the contour.
Weights are determined based on:
Component criticality — can the contour work without this system?;
Temporal dependency — does the system participate in the online path or only in final calculations?;
Usage frequency — how many user or system requests pass through it;
Position in the chain — closer to the user or deeper in the core;
Connectivity — how many other subsystems depend on it.
Formally: \[\sum_{i=1}^{n} W_i = 1, \qquad W_i \ge 0.\]
This ensures that \(A_{\text{contour}}\) can be interpreted as the availability of the entire business process.
Contour Availability Formula
\[A_{\text{contour}} = \sum_{i=1}^{n} W_i A_i,\]
where:
\(A_i\) — availability of a subsystem (normalized in UAM),
\(W_i\) — weight of the subsystem in the contour.
Example of a Contour Structure
Consider the payment processing contour:
Payment Gateway — 0.25
Banking API — 0.25
Document Signing (HSM) — 0.15
Batch/Reporting (statement generation) — 0.20
ERP (Oracle EBS / SAP) — 0.15
The contour availability is:
\[A_{\text{pay}} = 0.25 A_{\text{gateway}} + 0.25 A_{\text{bankAPI}} + 0.15 A_{\text{sign}} + 0.20 A_{\text{batch}} + 0.15 A_{\text{ERP}}.\]
Why Contour-Level Availability Is More Important
In real business processes:
a client does not care if ERP is at 99.99% if the banking API works at 80%;
the failure of even a “non-critical” component (like batch) affects the entire chain;
weight coefficients allow a fair distribution of influence among components;
\(A_{\text{contour}}\) provides management with a business-value indicator rather than a purely technical one.
Therefore, the contour is the main aggregation level in the Unified Availability Model.
End-to-End Example of Contour Availability Calculation
This section presents a synthetic but realistic example of applying the Unified Availability Model (UAM) to a business process that consists of three heterogeneous systems:
Web site (frontend + backend),
Bank–client API (REST banking gateway),
ERP system SAP S/4HANA (document processing and posting).
Such a contour is typical for scenarios where a user creates a document or payment on a web site, the web service then communicates with the banking API, and the final data is transferred into SAP for further processing.
Subsystem Metrics
Web Site
| Metric | Symbol | Weight |
|---|---|---|
| Success Rate | \(N_{\text{succ}}\) | 0.40 |
| Latency \(p95\) | \(N_{\text{lat}}\) | 0.30 |
| 5xx Errors | \(N_{\text{err}}\) | 0.20 |
| DB/Cache Backlog | \(N_{\text{backlog}}\) | 0.10 |
Observed values: \[\mathrm{succ}=98.2\%,\quad p95=420\,\mathrm{ms},\quad 5xx=1.8\%,\quad \mathrm{backlog}=250.\]
Baseline / SLA: \[B_{\text{lat}}=210\ \mathrm{ms},\quad SLA=350\ \mathrm{ms},\quad L_{\max}=2000\ \mathrm{ms},\quad D_{\max}=1.8\,B_{\text{lat}}=380.\]
Normalization: \[N_{\text{succ}}=0.82,\qquad N_{\text{lat}}=\frac{380}{420}=0.90,\] \[N_{\text{err}}\approx 0.60,\qquad N_{\text{backlog}}=0.75.\]
Final web availability: \[A_{\text{web}} = 0.40\cdot0.82 + 0.30\cdot0.90 + 0.20\cdot0.60 + 0.10\cdot0.75 = 0.79.\]
Bank–Client API
| Metric | Symbol | Weight |
|---|---|---|
| Success Rate | \(N_{\text{succ}}\) | 0.50 |
| Timeout Ratio | \(N_{\text{timeout}}\) | 0.20 |
| Connection Failures | \(N_{\text{conn}}\) | 0.20 |
| Queue / Backpressure | \(N_{\text{queue}}\) | 0.10 |
Observed values: \[\mathrm{succ}=99.1\%,\quad \mathrm{timeout}=0.7\%,\quad \mathrm{connfail}=0.3\%,\quad \mathrm{queue}=120.\]
Baseline / SLA: \[B_{\text{succ}}=99.6\%,\quad SLA_{\text{succ}}=99\%,\quad SLA_{\text{timeout}}=1.0\%.\]
Normalization (short form): \[N_{\text{succ}} = 0.93,\qquad N_{\text{timeout}} = 1,\qquad N_{\text{conn}} = 0.92,\qquad N_{\text{queue}} = 0.85.\]
API availability: \[A_{\text{api}}= 0.5\cdot0.93 + 0.2\cdot1 + 0.2\cdot0.92 + 0.1\cdot0.85 = 0.93.\]
SAP S/4HANA ERP
| Metric | Symbol | Weight |
|---|---|---|
| Dialog Response Time Issues | \(N_{\text{dialog}}\) | 0.35 |
| HANA DB Wait Events | \(N_{\text{wait}}\) | 0.25 |
| Background Job Lag | \(N_{\text{lag}}\) | 0.20 |
| IDoc/Queue Processing Lag | \(N_{\text{queue}}\) | 0.20 |
Observed values: \[\mathrm{dialog}=1260\,\mathrm{ms},\quad \mathrm{wait}=45\,\mathrm{ms},\quad \mathrm{lag}=6\,\mathrm{min},\quad \mathrm{queue}=240.\]
Baseline / SLA: \[B_{\text{wait}}=14\,\mathrm{ms},\quad SLA_{\text{wait}}=50\,\mathrm{ms},\quad T_{\max}=300\,\mathrm{ms}.\]
Simplified normalization: \[N_{\text{dialog}}=0.78,\qquad N_{\text{wait}}=0.92,\qquad N_{\text{lag}}=0.88,\qquad N_{\text{queue}}=0.75.\]
SAP availability: \[A_{\text{sap}}= 0.35\cdot0.78 + 0.25\cdot0.92 + 0.20\cdot0.88 + 0.20\cdot0.75 = 0.83.\]
Contour Availability Calculation
System weights: \[W_{\text{web}}=0.30,\quad W_{\text{api}}=0.40,\quad W_{\text{sap}}=0.30.\]
Final contour availability: \[A_{\text{contour}} = 0.30\cdot0.79 + 0.40\cdot0.93 + 0.30\cdot0.83 = 0.854.\]
Interpretation
The resulting value \[A_{\text{contour}} = 0.854\] indicates a noticeable degradation of the contour, mostly driven by the Web subsystem. The bank–client API is the most stable part of the contour, while SAP S/4HANA shows moderate risk due to elevated queue and job lag.
This example demonstrates the practical applicability of UAM for systems that cannot be combined under a single SLA or metric scale in traditional models.
Implementation in Monitoring Systems
The Unified Availability Model (UAM) can be implemented using standard observability platforms without introducing custom software. This section describes practical approaches for Prometheus, Zabbix, and Grafana.
Prometheus Implementation
Prometheus is the most natural environment for UAM, because normalization, weighting, and aggregation can be implemented directly as PromQL expressions.
Key mechanisms used in UAM
Recording rules for calculating normalized metrics (\(N_{ij}\)), system availability (\(A_i\)), and contour availability (\(A_{\text{contour}}\)).
PromQL functions:
clamp(x, min, max)— bounding normalized values in [0,1];rate()andirate()— for latency/error trends;scalar()— converting constants into PromQL expressions;max(),min(),avg_over_time()— for baseline calculation.
Vector arithmetic for weighted sums.
Example: Baseline normalization rule
record: uam:web_latency_norm
expr: clamp((baseline_lat * 1.8) / web_latency_p95, 0, 1)
Example: System availability
record: uam:A_web
expr: 0.4 * uam:web_success_norm +
0.3 * uam:web_latency_norm +
0.2 * uam:web_errors_norm +
0.1 * uam:web_backlog_norm
Contour availability rule
record: uam:A_contour_payment
expr: 0.30 * uam:A_web +
0.40 * uam:A_api +
0.30 * uam:A_sap
This produces a real-time contour availability score directly in Prometheus, which can be queried by Grafana or Alertmanager.
Alerting
Warning:
A_contour < 0.85Major:
A_contour < 0.70Critical:
A_contour < 0.50
Zabbix Implementation
Zabbix does not support vector math directly, but UAM can be implemented using:
Dependent items — raw metrics feed “parent items”; normalization formulas are applied in dependent items.
User-defined items — using Zabbix’s expression language to compute \(N_{ij}\) and \(A_i\).
Triggers for availability degradation
Warning: \(A_i < 0.8\)
High: \(A_i < 0.6\)
Disaster: \(A_i < 0.5\)
Example: Dependent item formula
norm_latency =
iif(last("latency_p95") < baseline * 1.8,
1,
(baseline * 1.8) / last("latency_p95")
)
Example: System availability item
A_web =
0.4 * last("norm_success") +
0.3 * last("norm_latency") +
0.2 * last("norm_errors") +
0.1 * last("norm_backlog")
Dashboards
Zabbix dashboards can render:
Availability trend of systems,
Weekly availability reports,
Top-3 degraded systems by UAM score.
VictoriaMetrics Implementation
VictoriaMetrics is fully compatible with UAM because it supports PromQL and recording rules. It is particularly suitable for large installations where high-cardinality metrics are collected from many heterogeneous systems.
Advantages for UAM
High ingestion rate — suitable for enterprise landscapes with thousands of normalized metrics.
Efficient storage — reduces cost of long-term UAM history (baselines benefit from long retention).
Downsampling — useful for weekly/monthly contour reports.
Full PromQL compatibility — all UAM formulas work without changes.
Baseline computation in VictoriaMetrics
Thanks to low storage cost, VictoriaMetrics allows long retention windows (e.g., 180–365 days), which improves baseline stability.
Example: rolling median baseline:
record: uam:web_latency_baseline
expr: quantile_over_time(0.5, web_latency_p95[30d])
Seasonal baselines (per hour of day):
record: uam:web_latency_hourly_baseline
expr: avg_over_time(web_latency_p95[30d])
unless on(hour()) (vector(1))
Contour availability in VictoriaMetrics
Same as in Prometheus:
record: uam:A_contour_payment
expr: 0.30 * uam:A_web +
0.40 * uam:A_api +
0.30 * uam:A_sap
Alerting
VictoriaMetrics Alert (vmalert) supports the same rules as Alertmanager.
Typical UAM alerts:
Warning:
A_contour < 0.85Major:
A_contour < 0.70Critical:
A_contour < 0.50
Dashboards
VictoriaMetrics integrates seamlessly with Grafana, so the visualization layer described in previous sections applies without modifications.
Loki and Log-based Normalization
Loki provides log aggregation and indexing, and although Loki does not support numeric vector math like PromQL, it can supply raw event counts that feed into UAM normalization.
Loki is especially useful for systems where:
latency is not instrumented as a metric,
errors exist only in logs,
business events (e.g., “payment posted”, “HSM slot error”) appear only in log streams.
Using Loki for UAM Metrics
Loki queries (LogQL) can be transformed into numeric metrics using:
count_over_time()— error/event rates;rate()— sliding-window frequency;sum by(...)— aggregation across pods/components;label_replace()— mapping log fields to metric labels.
Example: extract 5xx-equivalent failures from logs:
sum(rate(({app="web"} |= "HTTP 5") [5m]))
Transform to normalized error ratio:
record: uam:web_errors_norm
expr: clamp(1 - error_rate / 0.02, 0, 1)
Extracting Business Events
Example: normalize “payment posted” success ratio from logs:
payment_success =
sum(rate(({job="sap"} |= "PAYMENT_POSTED") [5m]))
payment_failure =
sum(rate(({job="sap"} |= "PAYMENT_FAILED") [5m]))
record: uam:sap_payment_norm
expr: clamp(payment_success /
(payment_success + payment_failure), 0, 1)
Integrating Loki with Prometheus or VM
Loki metrics are usually exported via:
Promtail → Prometheus, or
Loki ruler → VictoriaMetrics.
These derived metrics then participate in the standard UAM formulas for:
\[A_i = \sum_j w_{ij} N_{ij}\]
and
\[A_{\text{contour}} = \sum_i W_i A_i.\]
Loki Dashboards in Grafana
Recommended UAM-oriented panels:
Error-rate log timeline (for \(N_{\text{err}}\)),
Business-event throughput,
Derived metric gauges,
Correlation panels (“error spikes → SAP delay → decreased \(A_{\text{sap}}\)”).
Loki adds value where metric-based observability is incomplete — it closes the gap between logs and normalized metric availability.
Grafana Visualization
Grafana provides the most expressive layer for UAM. It displays normalized metrics, system-level availability, and contour values.
Recommended Panels
Gauge panels for \(A_i\) and \(A_{\text{contour}}\).
Composite SLA panels — combining latency, errors, and success rate in one view.
Time-series panels for each normalized metric \(N_{ij}\).
Status blocks for top priority systems.
Four-level color logic
Green — \(A \ge 0.90\) (healthy)
Yellow — \(0.80 \le A < 0.90\) (minor degradation)
Orange — \(0.60 \le A < 0.80\) (major impact)
Red — \(A < 0.60\) (critical)
Contour dashboard layout
Row 1: Contour availability (large gauge)
Row 2: Three system gauges (Web, API, SAP)
Row 3: Drill-down to \(N_{ij}\) metrics for each subsystem
Row 4: Raw metrics (latency p95, timeouts, wait events)
This structure gives both technical engineers and managers a clear picture of where the degradation originates.
Conclusion
The Unified Availability Model (UAM) provides a formal and extensible approach for calculating the availability of heterogeneous information systems. The methodology makes it possible to:
compare different classes of applications in a fair and consistent way;
build a unified availability indicator for an entire business-process contour;
integrate the model with Prometheus, Zabbix, Grafana, and other observability tools;
adapt the framework to any new system type without changing its core principles.
UAM is intended for practical use in high-load and business-critical IT landscapes, where systems differ significantly in architecture, performance characteristics, and operational behaviour.
Abstract to Appendix A
Appendix A reviews alternative approaches to aggregating metrics in heterogeneous IT systems. It explains why fuzzy logic is applicable only within individual levels (Infra, App, Biz), where boundaries of “normal” behaviour are inherently vague, and why its use between levels is unacceptable due to strict hierarchical dependency: an upper level cannot be more available than the level it depends on.
The appendix also justifies the need for dynamic reweighting, introduces soft and hard hierarchical coherence models, and briefly analyses why neural-network-based approaches are unsuitable due to retraining requirements and low interpretability.
This appendix provides the methodological foundation for UAM v2 and formalises the engineering logic behind a multi-layer availability model.
Appendix A. Fuzzy Logic and Neural Networks in Availability Assessment: Scope of Application and Limitations
Note. Appendix A is introduced in Version 2.0 and formalises the methodological justification for hierarchical coherence and fuzzy-logic constraints.
A.1. Why fuzzy logic is acceptable within levels (Infra / App / Biz)
A.1.1. Motivation
Low-level metrics often have vague or ambiguous definitions of “normal”. Typical examples include:
daytime and nighttime latency may differ by a factor of 2–3 while still being considered normal;
acceptable ping values may vary across operational teams;
backlog thresholds cannot be unified across workload types.
General problem: the boundaries between “good”, “normal”, and “bad” are interval-based rather than strict.
Fuzzy logic applies naturally when there are:
soft or shifting thresholds,
expert-driven assessments,
ambiguous operating regimes.
Thus, fuzzy logic resolves primarily an epistemic (human-interpretive) problem, not a mathematical one.
A.1.2. Benefits of fuzzy logic inside a level
Reduction of subjective debates. Ambiguous thresholds become explicit formal ranges.
Smooth response. Values change continuously rather than jumping.
Noise resistance. Small fluctuations do not trigger false alarms.
Better preprocessing. Fuzzy logic is applied before computing \(N_{ij}\).
Flexibility without loss of control. UAM remains strict; fuzzy logic is only a normalisation tool.
Thus, fuzzy logic is applicable strictly as an intra-level smoothing mechanism for Infra, App, and Biz metrics.
A.2. Why fuzzy logic must not be used between levels (Infra \(\rightarrow\) App \(\rightarrow\) Biz)
A.2.1. Motivation for rejection
Level hierarchy is characterised by strict dependency:
infrastructure degradation inevitably degrades applications;
application unavailability blocks business operations;
an upper level cannot exceed the availability of the lower one.
Fundamental principle:
the coherence of an upper level is bounded by the coherence of the lower level.
Applying fuzzy logic between levels leads to:
blurred dependencies,
contradictory evaluations,
a risk of “healthy-looking Biz” while Infra is degraded,
loss of interpretability,
recursive instability in recalculations.
Fuzzy logic effectively makes the levels “too equal”, which is structurally incorrect for a dependency hierarchy.
A.2.2. Benefits of rejecting fuzzy logic between levels
Strict hierarchical constraints. A level cannot override the limitations of the level below.
Business-transparent semantics. Managers do not need to understand fuzzy-membership values.
Correct propagation of degradations. If Infra is red, App and Biz cannot be green.
Honest availability model. \(\min(\cdot)\) and multiplicative operators maintain realism.
Compatibility with UAM and COE principles.
Elimination of recursive instability. Inter-level fuzzy logic causes uncontrolled renormalisation.
A.3. Why neural networks are not used
Neural-network-based prediction and aggregation models were evaluated, but intentionally rejected.
Reasons:
changes in workload profiles require full retraining, which is unacceptable for a governance-level availability metric;
neural networks have low interpretability;
strict repeatability and auditability are required, which is impossible with stochastic weights;
neural networks optimise prediction, not SLA responsibility.
Thus, neural networks are unsuitable for executive-level availability metrics.
A.4. Conclusion
Fuzzy logic may be applied only for normalising individual metrics within levels (Infra, App, Biz) — and must not be used for aggregating levels.
This is an intentional engineering decision:
metrics have fuzzy boundaries → fuzzy is appropriate;
levels form a strict hierarchy → fuzzy is inappropriate.
99 A.A. Nekludoff, , Zenodo (2025). doi:10.5281/zenodo.17646288.
A.A. Nekludoff, , Zenodo (2025). doi:10.5281/zenodo.17646888.
A.A. Nekludoff, , Zenodo (2025). doi:10.5281/zenodo.17721164.
Version History
| Version | Description |
|---|---|
| v1.0 (2025-11-25) | Initial release of the Unified Availability Model (UAM). Introduced the core framework: metric normalization, system-level availability, and contour-level aggregation. Included baseline/SLA normalization rules and examples for API, batch, ERP, cryptographic services, and SAP/IDoc integrations. |
| v2.0 (2025-11-26) | Revised and extended edition. Added the Hierarchical Coherence Model (HCM) for multi-level availability, including:
Added Appendix A on fuzzy-logic applicability and neural-network limitations. Refined definitions, improved terminology, expanded examples, and added cross-system applicability notes. |