Glassmkr Documentation

Q: What happens if connectivity is lost?

The server_unreachable rule fires after the server misses 2 consecutive check-ins, about 2 minutes at the default 60-second interval. When connectivity resumes, the agent continues pushing snapshots.

From zero to monitoring in about a minute. One agent, opinionated alert rules, no inbound ports.

#Getting started

Install the agent, see your first alert, route notifications where your team lives, and understand the billing model.

#What Glassmkr monitors

Glassmkr is a monitoring agent for bare metal and dedicated servers. The agent collects hardware and OS metrics every 60 seconds and pushes them to the Glassmkr dashboard, where a library of alert rules evaluates each snapshot automatically.

Hardware

IPMI sensors (temperature, fan speed, voltage, power draw), IPMI SEL event log, ECC memory errors, PSU redundancy status

Storage

SMART health and wear level, disk space and inodes, RAID array status, ZFS pool health and scrub errors, filesystem read-only detection, I/O errors and latency

Network

Interface errors and drops, link speed negotiation, bandwidth saturation, bond slave status, conntrack table usage

OS

CPU per-core utilization and iowait, load averages, RAM and swap, OOM kills, clock drift, NTP sync, systemd failed units, file descriptor exhaustion, unexpected reboots

Security

SSH root password authentication, firewall status, pending security updates, kernel vulnerabilities, reboot required flag, unattended upgrades configuration

The full rule library evaluates on every collection cycle. All rules included on every plan, including Free.

#Installation

Docker (recommended)

# 1. Create config directory
sudo mkdir -p /etc/glassmkr

# 2. Add your collector key (get it from glassmkr.com after signing up)
sudo tee /etc/glassmkr/crucible.yaml << 'EOF'
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
interval: 300
EOF

# 3. Download and start
curl -O https://raw.githubusercontent.com/glassmkr/crucible/main/docker-compose.yml
docker compose up -d

# 4. Verify
docker compose logs glassmkr-crucible

The container runs with --privileged and network_mode: host for IPMI, SMART, and bond monitoring. See Trust for details.

npm alternative

npm install -g @glassmkr/crucible
sudo glassmkr-crucible --config /etc/glassmkr/crucible.yaml

Requires Node.js 24+. System packages smartmontools, ipmitool, dmidecode needed for full hardware monitoring.

A whole fleet

For Ansible, cloud-init, or a post-install script, glassmkr-crucible enroll lets every host register itself from one write-scoped account key, no per-host collector key to mint or copy. Each host derives a stable machine ID, self-registers, and gets its own collector key; re-runs and re-images map back to the same server instead of duplicating it. See Automated fleet onboarding for Ansible and cloud-init examples.

Your server appears in the dashboard within 1 minute.

#First alert

After install, the agent pushes its first snapshot within 1 minute and the dashboard evaluates the rule library against it. On bare-metal hardware, something is usually mildly degraded: a disk that's been running for years, kernel updates pending, a fan running near its threshold. Expect 0-3 alerts on first contact, most P3 or warning-level.

Each alert opens to a detail page with remediation guidance (the command to run, what to verify after) and a Furnace assistant annotation. See Furnace for how that works.

#Notification channels

Email

Alerts delivered from [email protected].

Bot messages with alert details and direct links.

Slack

Block Kit formatted messages with severity colors.

Discord

Rich embeds posted to a channel via incoming webhook.

PagerDuty

Events API v2; maps priority to PagerDuty severity.

Webhooks

POST JSON to any URL you configure.

All six channels are on every plan, free or Pro, with no cap on how many you configure. Every channel supports per-priority filtering (P1 to P4). Agent update notifications send major version alerts to everyone; patch notifications are opt-in. Routing detail under Multi-channel alerting.

#Pricing and billing

Free

Up to 3 nodes
All 65 alert rules
All six notification channels, unlimited
Full read+write API
Predictive trend warnings
7 days data retention
One trial AI analysis per server
No credit card required

Pro $3/node/month

Everything in Free, and it lifts exactly three limits:

More than 3 nodes (first 3 free, then $3/node/month)
90 days data retention (audit log 365 days)
Unlimited AI health analysis
Email support

3 nodes: $0/mo 10 nodes: $21/mo 25 nodes: $66/mo 50 nodes: $141/mo

Enterprise

Custom pricing and configuration. Contact [email protected].

#Concepts

Glassmkr's vocabulary: what a node is, what an alert rule is, what Furnace does, what trend warnings are, and how notifications get routed.

#Nodes and servers

One Linux server is one node. Billing, alerts, and notifications all attach to nodes. A server you stop reporting from for over 2 minutes triggers the server_unreachable alert and stays in your fleet count until you remove it from the dashboard. See server_unreachable.

#Alert rules

An alert rule is an evaluator function that runs on every snapshot. The shipped rules cover hardware faults, capacity, security posture, and service health. Each rule emits structured evidence (the metric values that caused the fire) and resolves automatically when the underlying condition clears. The full reference catalog is at /docs/rules. Per-alert remediation guidance is rendered inside the dashboard on the alert detail page.

#Furnace (AI assistance)

Furnace is the AI assistant that annotates alert detail pages. It reads the alert's evidence and the rule's remediation steps and produces context-specific notes. It runs on a self-hosted Gemma 4 26B model on a single NVIDIA L4 GPU in Amsterdam; no third-party LLM APIs (no OpenAI, no Anthropic, no Google). Your alert data does not leave EU jurisdiction.

Furnace is conservative by design: hedges interpretive claims, doesn't autocomplete shell commands, says "I don't know" when it doesn't. Full philosophy in the Furnace introduction blog post.

#Trend warnings

A predictive early-warning feature, on every plan, that surfaces warnings when metrics show degradation trends before the alert thresholds fire. Separate from the snapshot-driven alert rules (which fire on current state).

Trend warnings run in a 6-hour batch process on Glassmkr's backend. They analyze up to 90 days of metric history per server, apply correlation rules that require two independent signals, and optionally consult an internal ranking model trained on Backblaze's public drive failure dataset. The result is a small number of high-confidence warnings per server, not noisy anomaly detection on every metric.

What gets monitored

Signal	What we watch for	Example warning
SMART reallocated sectors (5)	Growth over 7-30 days	Drive /dev/sda SMART 5 grew from 0 to 14 over 30 days
SMART reported uncorrectable (187)	Any appearance above zero	Drive /dev/sda: SMART 187 is now 3
SMART command timeouts (188)	Repeated growth	Drive /dev/sda: 4 command timeouts in last 7 days
SMART pending/offline uncorrectable (197, 198)	Step change from zero	Drive /dev/sda: pending sectors appeared 3 days ago
SMART high fly writes (189)	Burst patterns	Drive /dev/sda: 8 high fly write events in 24h
NVMe critical_warning	Any bit set	NVMe /dev/nvme0n1: critical_warning bit 2 set (reliability degraded)
NVMe available_spare	Approaching or below threshold	NVMe /dev/nvme0n1: available_spare now 12%, threshold is 10%
NVMe media_errors	Growing rapidly	NVMe /dev/nvme0n1: media_errors increased from 0 to 4 in 7 days
NVMe p99 latency (planned, v1.1)	Sustained drift without IO volume change	NVMe /dev/nvme0n1: p99 read latency sustained 2.3x above baseline
Disk space per partition	Projected fill via linear regression	/data partition projected to hit 85% in 12 days at current growth
ECC correctable errors (planned, v1.1)	Bursts per DIMM location	DIMM CPU1_DIMM_A2: 15 correctable errors in 24h
PSU rail voltages	Drift 2-3% from nominal	PSU 1: 12V rail at 11.62V (drift 3.2%)
Fan RPM	Decline paired with temp rise in same zone	Fan SYS_FAN2: RPM dropped 25% and chassis zone temp rising
NIC errors	CRC/frame errors (TCP retransmit correlation planned, v1.1)	eth0: 47 CRC errors
ZFS checksum/read errors	Paired with matching SMART signal on same device	Drive /dev/sda: ZFS reported 7 checksum errors corroborating SMART 5 growth

Severity tiers

Imminent (red): projected failure within 7 days, or critical pattern (SMART 187 appearance, NVMe critical_warning). Push notification immediately.
Soon (orange): projected within 30 days, or high-severity evidence. Push notification once.
Scheduled (blue): projected within 90 days, or medium-severity. Dashboard only.
Watch (grey): low confidence or more than 90 days out. Dashboard collapsed.

Correlation requirement

Where two signals exist on the same device, correlation is required before a notification. Several v1 categories fire on a single high-confidence signal because the underlying source is itself authoritative (a SMART step-change from zero, NVMe critical_warning bits, a PSU rail at 11.62V). The asymmetry is deliberate.

Multi-signal categories shipped in v1:

Signals (on the same device)	Diagnosis
Drive SMART signal and ZFS errors	Storage device degradation
Fan RPM decline and chassis temp rise (same zone)	Cooling failure

Multi-signal categories planned for v1.1 once the underlying collector work lands:

Signals (on the same device)	Diagnosis
NVMe health signal and p99 latency inflation	NVMe pre-failure (fail-slow)
NIC CRC errors and TCP retransmits (same interface)	NIC hardware failure
ECC burst and MCE entries (same DIMM)	DIMM pre-failure

This approach trades some recall (failures that only show one signal) for high precision. Google's FAST 2007 study found roughly 40-50% of drive failures in the field show no SMART-visible warning, so trend warnings are a meaningful reduction of surprise failures, not a guarantee that every failure becomes predictable.

What we explicitly don't do

No general-purpose anomaly detection on every metric. Netdata's own docs demote their anomaly ML to "investigation aid, not alert source." We agree.
No per-customer model training. With 3-50 servers per account, customer-specific models are base-rate-dominated. We use global thresholds plus an offline-trained ranker on Backblaze's public dataset.
No LLM-based trend classification. Linear regression, CUSUM, and first-differences do this job better and cheaper. We use AI only to narrate deterministic findings in plain English.
No confident failure predictions. We say "likely within 7-14 days", never "will fail on Tuesday." The underlying signals carry real uncertainty and we surface it.

Data requirements

Trend warnings are on every plan. What differs by plan is how much metric history each signal can see, which is set by your retention window (Free keeps 7 days, Pro keeps 90):

The longer-horizon signals (SMART, NVMe, ECC, cooling, PSU, NIC) draw on up to 90 days of history, so they reach full sensitivity at Pro's 90-day retention.
Disk space projection works on 7-day data, so it runs the same on Free.
A server needs at least 3 days of contiguous data to receive any warnings. Freshly added servers are in an observation period.

Self-audit

The dashboard shows the feature's own track record: how many warnings were sent in the last 90 days, how many users confirmed were valuable, how many were dismissed, and how many were followed by a matching alert firing within 30 days. No other monitoring tool surfaces this, and it exists so you can audit whether trend warnings are actually earning their keep for your fleet.

#Multi-channel alerting

Each alert routes to channels based on the rules in your dashboard. Group by team, by server, by severity. Suppress during planned maintenance windows. The alerting layer is unopinionated; route alerts wherever your team already pays attention.

Channel types in Notification channels. Per-priority filter (P1-P4) and per-rule mutes operate independently of channel selection.

#Alert rules

65 rules across 9 categories, tuned for bare-metal failure modes. Per-rule catalog pages at /docs/rules show the title, summary, priority, and category, plus the quick-check command + verdict prior (recoverable / investigation / vendor-side) for each rule. Per-alert remediation guidance (full FIX content: prerequisites, safe-mode diagnostic, fix command, validation, rollback, blast-radius impact) lives in the dashboard on the alert detail page. Every rule ships with deep FIX content; 30+ are verified end-to-end on real hardware. The summary tables below group rules by category.

Storage (8 rules)

Rule	Trigger	Severity
`disk_space_high`	≥ 85% warning, ≥ 95% critical. Configurable.	Warning / Critical
`disk_fill_projection`	Trend warning: projected to fill within N days (cross-snapshot)	Warning
`smart_failing`	Reallocated/pending sectors or health != PASSED	Critical
`nvme_wear_high`	≥ 85% wear warning, ≥ 95% critical. NVMe Critical Warning bits also decoded.	Warning / Critical
`raid_degraded`	Any degraded or failed RAID array (mdadm + hardware RAID via storcli/perccli/ssacli/arcconf)	Critical
`disk_latency_high`	Average latency > 100ms	Warning
`disk_io_errors`	I/O errors detected in dmesg (structured event match)	Critical
`inode_high`	≥ 90% inodes used	Warning

ZFS (3 rules)

Rule	Trigger	Severity
`zfs_pool_unhealthy`	Pool state != ONLINE; severity matrix by vdev redundancy class	Warning / Critical
`zfs_scrub_errors`	Scrub detected errors, or pool has never been scrubbed (fresh-pool reminder)	Warning
`zfs_slog_faulted`	SLOG vdev faulted (write-cache reliability impact)	Critical

Filesystem (4 rules)

Rule	Trigger	Severity
`filesystem_readonly`	Mounted filesystem remounted read-only (kernel I/O error path)	Critical
`fd_exhaustion`	> 80% of system or per-process file descriptors used	Warning
`lvm_thinpool_metadata_high`	LVM thin-pool data or metadata > 80% used	Warning / Critical
`systemd_service_failed`	Any systemd unit in failed state; classified by Result code	Warning

Memory & CPU (9 rules)

Rule	Trigger	Severity
`ram_high`	≥ 90% used, ≥ 95% critical. Configurable.	Warning / Critical
`swap_high`	> 50% swap used	Warning
`oom_kills`	Any OOM kill detected	Critical
`cpu_high`	≥ 90% utilization, ≥ 98% critical	Warning / Critical
`load_high`	Load average > 1x core count warning, > 2x critical	Warning / Critical
`cpu_iowait_high`	≥ 20% iowait. Configurable.	Warning
`cpu_pressure_high`	PSI cpu.some / cpu.full stall > threshold (kernel ≥ 4.20)	Warning
`mem_pressure_high`	PSI memory.some / memory.full stall > threshold	Warning
`io_pressure_high`	PSI io.full stall > threshold (companion to cpu_iowait_high)	Warning

Network (10 rules)

Rule	Trigger	Severity
`interface_errors`	Hardware errors > 0 per interval, drops > 500	Warning
`link_speed_mismatch`	Interface negotiated ≥ 2x below highest advertised mode	Warning
`interface_saturation`	≥ 90% of negotiated link speed sustained	Warning
`bond_slave_down`	A bond member interface is down	Critical
`lacp_partner_lost`	LACP partner state lost on a bond member	Warning
`conntrack_exhaustion`	> 80% of conntrack table used, or insert_failed rate spiking	Warning
`listen_overflow`	Listening socket backlog overflows detected	Warning
`accept_backlog_or_syn_flood`	Accept backlog or SYN-flood pattern (cross-snapshot)	Warning
`softnet_drops`	Per-CPU softnet queue drops	Warning
`tcp_retrans_high`	TCP retransmit rate above threshold	Warning

Hardware / BMC (7 rules)

Rule	Trigger	Severity
`cpu_temperature_high`	> 80°C warning, > 90°C critical	Warning / Critical
`ecc_errors`	Correctable > 0 warning, uncorrectable > 0 critical. EDAC + IPMI SEL sources.	Warning / Critical
`psu_redundancy_loss`	PSU redundancy state degraded or lost	Critical
`ipmi_sel_critical`	Critical SEL entries detected. Vendor parsers: Dell/Supermicro/HPE fleet-tested; Lenovo/Cisco/OpenBMC parser_quality stub.	Critical
`ipmi_fan_failure`	Fan speed below minimum threshold or fan failure SEL event	Critical
`cmos_battery_low`	CMOS / RTC battery voltage below threshold (clock drift and BIOS reset risk)	Warning
`service_flapping`	Cross-snapshot: same systemd unit restarting repeatedly	Warning

GPU (8 rules; NVIDIA)

Rule	Trigger	Severity
`gpu_xid_critical`	Critical NVIDIA XID event (e.g. XID 79 fall-off-the-bus)	Critical
`gpu_thermal_critical`	Temperature ≥ 90°C, or hw_thermal_slowdown / sw_thermal_slowdown active. Note: not reachable on healthy L4 cooling under synthetic load; fires on real cooling-system issues.	Critical
`gpu_uncorrected_ecc`	Uncorrected ECC error on GPU memory	Critical
`gpu_corrected_ecc_storm`	Corrected ECC errors above rate threshold	Warning
`gpu_power_cap_throttling`	Sustained power-cap throttling event	Warning
`gpu_pcie_link_degraded`	PCIe link width or generation below advertised; cross-checked against ASPM idle state	Warning
`nvlink_link_down`	NVLink peer link down (multi-GPU systems)	Critical
`gpu_driver_drift`	NVIDIA driver version drift across the fleet	Info

Time & services (4 rules)

Rule	Trigger	Severity
`clock_drift`	Offset > 1 second	Warning
`ntp_not_synced`	NTP daemon not running or clock not synced	Warning
`unexpected_reboot`	Server restarted unexpectedly; reboot evidence (pstore / kdump / wtmp) classifies cause	Event
`server_unreachable`	Server missed 2+ check-ins (server-side watchdog)	Critical

Security & patching (9 rules)

Rule	Trigger	Severity
`ssh_root_password`	Root login with password enabled	Warning
`no_firewall`	No active firewall detected	Warning
`pending_security_updates`	> 0 security updates pending	Info
`kernel_vulnerabilities`	Active kernel vulnerabilities. Severity demotes to info when kernel software mitigation is engaged ("Clear CPU buffers attempted").	Info / Warning
`kernel_needs_reboot`	Kernel update requires reboot	Info
`unattended_upgrades_disabled`	Auto-updates not configured	Info
`tls_certificate_expiring`	TLS cert expiring within 30 days	Warning
`weak_root_password_policy`	Root password policy weak or absent	Warning
`cve_critical_unpatched`	Critical CVE detected as unpatched on the host's package versions	Warning

State alerts auto-resolve when the condition clears. Event alerts (unexpected_reboot) stack occurrences and have a Resolve button. Acknowledged alerts still auto-resolve.

#Operations

Day-to-day tasks: managing nodes, tuning thresholds when a rule is too sensitive or not sensitive enough, scheduling maintenance windows, acknowledging alerts, and confirming or dismissing trend warnings so the ranker learns from your fleet.

#Managing nodes

Add a node by generating a collector key in the dashboard and pasting it into /etc/glassmkr/crucible.yaml on the target server (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either). The new server reports within 1 minute. Remove a node from the dashboard's Servers page; the slot is released for billing on the next proration cycle. A server that stops reporting is not auto-removed; it surfaces a server_unreachable alert instead so unintentional silence is visible.

#Tuning thresholds (config reference)

The agent's full configuration lives in /etc/glassmkr/crucible.yaml (legacy installs: /etc/glassmkr/collector.yaml; the agent reads either, and glassmkr-crucible init migrates the file in place). Most fleets only ever set collector_key. The other fields exist for hostname overrides, faster collection on short-window debugging, or disabling a module when the underlying tool isn't present.

# Required
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE

# Optional
interval: 60           # Collection interval in seconds (default: 60)
# hostname: my-server  # Override auto-detected hostname
# modules:             # Disable specific collection modules
#   ipmi: false
#   smart: false
#   zfs: false
#   security: false

server_url: The Glassmkr ingest endpoint. Always https://app.glassmkr.com for the hosted service.
collector_key: Your server's authentication token. Generated when you add a server in the dashboard. Prefixed with gmk_cru_live_ (older keys may still use the legacy col_ prefix until rotated).
interval: How often (in seconds) the agent collects and pushes a snapshot. Default is 60 seconds. Minimum is 60.
hostname: Override the auto-detected hostname. Useful when the system hostname is generic or changes between reboots.
modules: Disable individual collection modules. Set any module to false to skip it. The agent will not attempt to read sensors for disabled modules.

Per-rule numeric thresholds (the percentage at which disk_space_high fires, the iowait floor for cpu_iowait_high, and similar) live in the dashboard, not in the agent's YAML. Open a rule's settings page to adjust them. Defaults are chosen for bare-metal fleets; tune up for noisy storage or down for capacity-tight servers.

#Maintenance windows

Schedule a planned-reboot or service window from a server's detail page. Alerts that fire during the window are suppressed at the notification layer (they still appear in the dashboard for audit). The unexpected_reboot event is treated as expected during a planned-reboot window. Windows accept a duration or an explicit end time.

#Acknowledging alerts

Click Acknowledge on an alert detail page to silence further notifications for that alert while you are working on it. State alerts still auto-resolve when the condition clears. Event alerts (like unexpected_reboot) stack subsequent occurrences under the acknowledged alert and expose a Resolve button when you are done.

#Trend warning feedback

Each trend warning has Confirm and Dismiss buttons. Confirm marks the warning as a true positive (typically because the underlying part was replaced); dismiss marks it as a false positive. The dashboard surfaces the feature's running track record under the trend warnings self-audit (see Trend warnings). Feedback also flows back into the ranker as labelled training signal across the fleet.

#API

Glassmkr exposes a small REST API for server, alert, and notification-channel management. Full machine-readable corpus at /llms-full.txt; LLM-first index at /llms.txt.

#Authentication

API calls authenticate with an account token. Generate one in the dashboard under Account > API tokens; tokens are prefixed gmk_acct_live_. Pass it as Authorization: Bearer <token>. Tokens carry the full permissions of the account; rotate any token that has been exposed.

curl -H "Authorization: Bearer gmk_acct_live_YOUR_TOKEN" \
  https://app.glassmkr.com/api/v1/servers

#Servers

GET /api/v1/servers lists all servers in the account with last-seen timestamps, hardware identifiers, and the most recent snapshot's headline metrics. GET /api/v1/servers/{id} returns a single server's full latest snapshot. DELETE /api/v1/servers/{id} removes a server. Adding a server is done from the dashboard so that the collector key can be issued and displayed once.

#Alerts

GET /api/v1/alerts returns currently open alerts. Filter with ?status=open|acknowledged|resolved and ?server_id=<id>. POST /api/v1/alerts/{id}/acknowledge and POST /api/v1/alerts/{id}/resolve are the two mutating endpoints. Webhook deliveries (configured per-channel) carry the same payload shape.

#Notification channels

GET /api/v1/channels lists configured channels; POST /api/v1/channels creates one; PATCH /api/v1/channels/{id} updates routing rules. The full schema, including per-priority filter syntax and webhook payload shape, is documented in /llms-full.txt.

#Reference

System requirements, architecture, a vocabulary glossary, the per-metric definitions you'll see in alert evidence, and Crucible's release history.

#System requirements

Operating system: Linux with systemd. Tested on Debian 11/12, Ubuntu 20.04 to 24.04, Rocky 8/9, AlmaLinux 8/9.
Runtime: Docker (recommended) or Node.js 24+.
Privileges: Root access required for IPMI, SMART, and /proc system reads.
Network: Outbound HTTPS on port 443 to app.glassmkr.com. No inbound ports needed.
Resource usage: Around 110 MB resident memory (RSS), under 1% of host RAM on every host we tested. Measured on Crucible 0.13.6 across all 10 validation hosts at steady state: median 108 MB, range 81 to 116 MB (varies mainly with the bundled Node version). Effectively 0% CPU at the default 60-second snapshot interval. Random-read I/O delta under 1.5% under fio saturation.
Optional packages (npm install only): smartmontools, ipmitool, dmidecode for full hardware monitoring. Missing packages are silently skipped.

#Architecture

Your server

The agent reads /proc, /sys, smartctl, ipmitool

CPU RAM Disk SMART IPMI Network ZFS Security

HTTPS / TLS
every 60s

Glassmkr

Dashboard Rule library Notifications AI analysis

PostgreSQL + ClickHouse on EU dedicated servers

The agent is MIT open source: github.com/glassmkr/crucible
Agent pushes outbound only, opens no inbound ports
Snapshots contain hardware metrics only, no user data
Dashboard runs on EU dedicated servers, no cloud providers
AI analysis runs on a self-hosted GPU, no external AI providers

#Glossary

Agent / Crucible: The collection process that runs on each monitored server and pushes snapshots to the dashboard. MIT-licensed at github.com/glassmkr/crucible.
Snapshot: One push from the agent. Contains hardware sensor readings, OS counters, SMART data, and software state at a point in time.
Alert rule: An evaluator function that runs against every snapshot and fires when the rule's condition is met. The full catalog is at /docs/rules.
Trend warning: A feature on every plan that surfaces metric degradation trends before a threshold-based rule would fire.
Furnace: The self-hosted AI assistant that annotates alert detail pages. Gemma 4 26B on an NVIDIA L4 in Amsterdam.
Dashboard: The SaaS surface at app.glassmkr.com. Hosts alert evaluation, notification routing, billing, the API, and Furnace.

#Metric definitions

Alert evidence references metric names that map to specific source files and counters. Headline definitions:

cpu.utilization_percent: From /proc/stat, computed as (1 - idle_delta / total_delta) over the collection interval. Excludes iowait.
cpu.iowait_percent: From /proc/stat, the iowait counter delta over total delta.
memory.used_percent: From /proc/meminfo, computed as (MemTotal - MemAvailable) / MemTotal.
load.avg_1m / avg_5m / avg_15m: From /proc/loadavg.
disk.used_percent: Per-mount from statvfs(). Excluded mounts (tmpfs, snap squashfs) are not collected.
smart.attr.<id>: Vendor-attribute values from smartctl -A, keyed by attribute number (5 = reallocated_sector_ct, 187 = reported_uncorrect, etc.).
ipmi.sensor.<name>: Numeric readings from ipmitool sdr elist with status flags. Includes fan RPM, voltage rails, temperature zones, PSU watts.

The complete metric inventory is in /llms-full.txt.

#Release history

Crucible release notes are tagged on GitHub: github.com/glassmkr/crucible/releases. The npm dist-tag latest always points at the version recommended for new installs; the dashboard's install snippet pulls from it. Major version notifications go to all configured channels by default; patch notifications are opt-in.

#Troubleshooting

Common failure modes during install and operation, and what to check first.

#Agent not reporting

If the server stops appearing in the dashboard, check in this order:

docker compose ps (or systemctl status glassmkr-crucible) confirms the process is running.
docker compose logs --tail=100 glassmkr-crucible shows the most recent push attempt. Look for HTTP status codes other than 200.
Outbound connectivity to app.glassmkr.com:443 from the host: curl -I https://app.glassmkr.com.
The collector_key in /etc/glassmkr/crucible.yaml (legacy installs: /etc/glassmkr/collector.yaml) matches the key shown for the server in the dashboard. A rotated key invalidates the old one.

#Alerts firing too often

If a rule's threshold doesn't suit your fleet (a storage server that legitimately runs at 92% disk consistently, a database under sustained 85% memory pressure by design), adjust the threshold on the rule's settings page. See Tuning thresholds. If the noise is from a single host with known degraded hardware, an acknowledge plus a planned-maintenance window is usually a better fit than disabling the rule fleet-wide.

#Notifications not arriving

An alert that appears in the dashboard but never reaches a notification channel almost always points at the channel configuration:

Email: check the spam folder; [email protected] sets SPF, DKIM, and DMARC, but corporate filters sometimes still hold first contact.
Telegram: confirm the bot is still a member of the chat and that the chat ID matches the value stored in the channel.
Slack: rotate the incoming-webhook URL if it has been revoked, and verify the channel still exists.
Webhooks: open the channel's delivery history to see HTTP response codes from your endpoint. 4xx and 5xx responses are retried with backoff but not indefinitely.
Per-priority filtering: a P3 alert routed to a P1-only channel by design will not deliver.

#Installation issues

The docker install path is the most predictable. If you are on the npm path, the agent silently skips modules whose backing tool is missing; install smartmontools, ipmitool, and dmidecode to enable the full hardware path. If IPMI sensors don't appear after install, run ipmitool sdr as root: if that errors, the BMC is unreachable from the host and Glassmkr cannot help until that is fixed. For anything else, mail [email protected] with the most recent 100 lines of agent logs.

#FAQ

Do I need to open any inbound ports?

No. The agent initiates all connections outbound over HTTPS (port 443). Your firewall rules do not need to change.

Does the agent work without IPMI?

Yes. If ipmitool is not installed or the BMC is not reachable, the IPMI module is silently skipped. All other monitoring continues normally.

What happens if connectivity is lost?

The server_unreachable rule fires after the server misses 2 consecutive check-ins, about 2 minutes at the default 60-second interval. When connectivity resumes, the agent continues pushing snapshots.

Can I self-host the dashboard?

The agent is MIT-licensed and fully open source. The dashboard and alert evaluation engine are SaaS-only.

How does pricing work mid-month?

Proration. Add a server mid-month and you are charged proportionally for the remaining days. Remove a server and the next bill reflects the change.

Is my data stored in the EU?

Yes. All infrastructure, including the database servers and AI GPU, runs on dedicated servers in EU data centers.

#Links

GitHub Trust Status Dashboard Support

Glassmkr Documentation

#Getting started

#What Glassmkr monitors

Hardware

Storage

Network

OS

Security

#Installation

Docker (recommended)

npm alternative

A whole fleet

#First alert

#Notification channels

Email

Telegram

Slack

Discord

PagerDuty

Webhooks

#Pricing and billing

Free

Pro $3/node/month

Enterprise

#Concepts

#Nodes and servers

#Alert rules

#Furnace (AI assistance)

#Trend warnings

What gets monitored

Severity tiers

Correlation requirement

What we explicitly don't do

Data requirements

Self-audit

#Multi-channel alerting

#Alert rules

Storage (8 rules)

ZFS (3 rules)

Filesystem (4 rules)

Memory & CPU (9 rules)

Network (10 rules)

Hardware / BMC (7 rules)

GPU (8 rules; NVIDIA)

Time & services (4 rules)

Security & patching (9 rules)

#Operations

#Managing nodes

#Tuning thresholds (config reference)

#Maintenance windows

#Acknowledging alerts

#Trend warning feedback

#API

#Authentication

#Servers

#Alerts

#Notification channels

#Reference

#System requirements

#Architecture

#Glossary

#Metric definitions

#Release history

#Troubleshooting

#Agent not reporting

#Alerts firing too often

#Notifications not arriving

#Installation issues

#FAQ

#Links