Wire Node-RED update events to structured update log file

2026-04-13 06:12:59 +10:00
parent 9f9cfaf4be
commit 232fdfbb36
4 changed files with 367 additions and 7 deletions
@@ -0,0 +1,160 @@
+# Node-RED update logging for Grafana
+
+This guide adds structured update-event logging to your existing Node-RED + Telegraf + Prometheus + Grafana stack without introducing Loki.
+
+## Goal
+
+Track and surface (in Grafana) the latest update attempts from Node-RED, including:
+
+- when an update attempt started,
+- target container/project,
+- success/failure,
+- optional failure reason,
+- elapsed duration.
+
+## 1) Add a reusable logger function in Node-RED
+
+Create a **Function** node named `Build update log event` and use:
+
+```javascript
+const nowIso = new Date().toISOString();
+const startedAt = msg.update_started_at || Date.now();
+const durationMs = Math.max(0, Date.now() - startedAt);
+
+const payload = msg.payload || {};
+const labels = payload.labels || {};
+
+const status = (msg.update_status || payload.status || "unknown").toString().toLowerCase();
+const success = status === "success" ? 1 : 0;
+const failed = status === "failed" ? 1 : 0;
+
+msg.payload = {
+  ts: nowIso,
+  flow: "docker-updates",
+  event: msg.update_event || "attempt",
+  container: msg.container || labels.container || "unknown",
+  project: labels.com_docker_compose_project || msg.project || "unknown",
+  host: msg.host || "unknown",
+  status,
+  success,
+  failed,
+  duration_ms: durationMs,
+  code: Number.isFinite(Number(payload.code)) ? Number(payload.code) : 0,
+  error: (msg.update_error || payload.error || "").toString().slice(0, 300)
+};
+
+// one JSON line per event for file output
+msg.payload = JSON.stringify(msg.payload);
+return msg;
+```
+
+### Wiring recommendation
+
+Use the same logger function in these branches:
+
+- before a pull/update command (`update_status=started`, `update_event=attempt`),
+- success path (`update_status=success`, `update_event=completed`),
+- failure path (`update_status=failed`, `update_event=completed`, and include `msg.update_error`).
+
+Then route each branch into a **File** node configured as:
+
+- Filename: `/data/update-events.ndjson`
+- Action: append to file
+- Add newline: enabled
+
+## 2) Make update state explicit in existing update flow
+
+In your current update flow (already present in `flows.json`), add/change **Change** nodes around your shell/docker nodes:
+
+- At update start:
+  - `msg.update_started_at = $millis()`
+  - `msg.update_status = "started"`
+  - `msg.update_event = "attempt"`
+- At success:
+  - `msg.update_status = "success"`
+  - `msg.update_event = "completed"`
+- At failure:
+  - `msg.update_status = "failed"`
+  - `msg.update_event = "completed"`
+  - `msg.update_error = msg.payload.stderr` (or equivalent error field)
+
+## 3) Let Telegraf ingest Node-RED event logs
+
+Append this to `monitoring/telegraf/telegraf.conf`:
+
+```toml
+[[inputs.tail]]
+  files = ["/var/log/node-red/update-events.ndjson"]
+  from_beginning = false
+  name_override = "node_red_update_event"
+  data_format = "json_v2"
+
+  [[inputs.tail.json_v2]]
+    measurement_name = "node_red_update_event"
+
+    [[inputs.tail.json_v2.tag]]
+      path = "flow"
+    [[inputs.tail.json_v2.tag]]
+      path = "event"
+    [[inputs.tail.json_v2.tag]]
+      path = "container"
+    [[inputs.tail.json_v2.tag]]
+      path = "project"
+    [[inputs.tail.json_v2.tag]]
+      path = "host"
+    [[inputs.tail.json_v2.tag]]
+      path = "status"
+
+    [[inputs.tail.json_v2.field]]
+      path = "success"
+      type = "int"
+    [[inputs.tail.json_v2.field]]
+      path = "failed"
+      type = "int"
+    [[inputs.tail.json_v2.field]]
+      path = "duration_ms"
+      type = "int"
+    [[inputs.tail.json_v2.field]]
+      path = "code"
+      type = "int"
+```
+
+And mount the Node-RED data directory into Telegraf (read-only) in `monitoring/prometheus/docker-compose.yml` under `telegraf.volumes`:
+
+```yaml
+      - ${PROJECT_ROOT}/monitoring/node-red/data:/var/log/node-red:ro
+```
+
+## 4) Prometheus scrape (already in place)
+
+No Prometheus scrape change is required as long as it already scrapes Telegraf (`telegraf:9273`).
+
+## 5) Grafana queries to start with
+
+Use your Prometheus data source and try:
+
+- Latest success/failure by container:
+  - `last_over_time(node_red_update_event_success[24h])`
+  - `last_over_time(node_red_update_event_failed[24h])`
+- Failed updates in the last 24h:
+  - `sum by (container, project) (increase(node_red_update_event_failed[24h]))`
+- Average update duration in last 24h:
+  - `avg by (container, project) (avg_over_time(node_red_update_event_duration_ms[24h]))`
+
+Recommended panels:
+
+- **Table**: container, project, status (last value), duration_ms (last value)
+- **Time series**: failed count over time
+- **Stat**: total failed updates in last 24h
+
+## 6) Validation checklist
+
+1. Trigger a known update path (including one failure if possible).
+2. Check Node-RED log file:
+   - `tail -n 20 monitoring/node-red/data/update-events.ndjson`
+3. Check Telegraf metrics endpoint for `node_red_update_event_` metrics.
+4. Confirm Grafana panel values match the latest Node-RED run.
+
+## Optional next step
+
+If you want searchable raw log text and richer log UX, add Loki + Promtail later. Keep this structured metrics path for high-signal alerting even after adding logs.