Glassmkr Documentation
From zero to monitoring in 5 minutes. One agent, 38 alert rules, no inbound ports.
#What Glassmkr Monitors
Glassmkr is a monitoring agent for bare metal and dedicated servers. The agent collects hardware and OS metrics every 5 minutes and pushes them to the Glassmkr dashboard, where 38 alert rules evaluate each snapshot automatically.
Hardware
IPMI sensors (temperature, fan speed, voltage, power draw), IPMI SEL event log, ECC memory errors, PSU redundancy status
Storage
SMART health and wear level, disk space and inodes, RAID array status, ZFS pool health and scrub errors, filesystem read-only detection, I/O errors and latency
Network
Interface errors and drops, link speed negotiation, bandwidth saturation, bond slave status, conntrack table usage
OS
CPU per-core utilization and iowait, load averages, RAM and swap, OOM kills, clock drift, NTP sync, systemd failed units, file descriptor exhaustion, unexpected reboots
Security
SSH root password authentication, firewall status, pending security updates, kernel vulnerabilities, reboot required flag, unattended upgrades configuration
38 alert rules evaluate on every collection cycle. All rules included on every plan, including Free.
#Quick Start
Docker (recommended)
# 1. Create config directory
sudo mkdir -p /etc/glassmkr
# 2. Add your collector key (get it from glassmkr.com after signing up)
sudo tee /etc/glassmkr/collector.yaml << 'EOF'
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
interval: 300
EOF
# 3. Download and start
curl -O https://raw.githubusercontent.com/glassmkr/crucible/main/docker-compose.yml
docker compose up -d
# 4. Verify
docker compose logs glassmkr-crucible The container runs with --privileged and network_mode: host for IPMI, SMART, and bond monitoring. See Security for details.
npm alternative
npm install -g @glassmkr/crucible
sudo glassmkr-crucible --config /etc/glassmkr/collector.yaml Requires Node.js 24+. System packages smartmontools, ipmitool, dmidecode needed for full hardware monitoring.
Your server appears in the dashboard within 5 minutes.
Sign up free to get your collector key.
#Alert Rules Reference
OS (9 rules)
| Rule | Trigger | Severity |
|---|---|---|
ram_high | ≥ 90% used, ≥ 95% critical. Configurable threshold. | Warning / Critical |
cpu_high | ≥ 90% utilization, ≥ 98% critical | Warning / Critical |
load_high | Load average > 2x core count | Warning |
cpu_iowait_high | ≥ 20% iowait. Configurable. | Warning |
oom_kills | Any OOM kill detected | Critical |
clock_drift | Offset > 1 second | Warning |
swap_high | > 50% swap used | Warning |
ntp_not_synced | NTP daemon not running or clock not synced | Warning |
unexpected_reboot | Server restarted unexpectedly | Event |
Storage (8 rules)
| Rule | Trigger | Severity |
|---|---|---|
disk_space_high | ≥ 85% warning, ≥ 95% critical. Configurable. | Warning / Critical |
smart_failing | Reallocated/pending sectors or health != PASSED | Critical |
nvme_wear_high | ≥ 85% wear warning, ≥ 95% critical. Configurable. | Warning / Critical |
raid_degraded | Any degraded or failed RAID array | Critical |
disk_latency_high | Average latency > 100ms | Warning |
filesystem_readonly | Any mounted filesystem is read-only (excluding expected ones) | Critical |
inode_high | ≥ 90% inodes used | Warning |
disk_io_errors | I/O errors detected in dmesg | Critical |
Network (5 rules)
| Rule | Trigger | Severity |
|---|---|---|
interface_errors | Hardware errors > 0 per interval, drops > 500 | Warning |
link_speed_mismatch | Interface negotiated below expected speed | Warning |
interface_saturation | > 80% of link capacity utilized | Warning |
conntrack_exhaustion | > 80% of conntrack table used | Warning |
bond_slave_down | A bond member interface is down | Critical |
Hardware / IPMI (5 rules)
| Rule | Trigger | Severity |
|---|---|---|
cpu_temperature_high | > 80C warning, > 90C critical | Warning / Critical |
ecc_errors | Correctable > 0 warning, uncorrectable > 0 critical | Warning / Critical |
psu_redundancy_loss | PSU status not OK | Critical |
ipmi_sel_critical | Critical SEL entries detected | Critical |
ipmi_fan_failure | Fan speed below minimum threshold | Critical |
ZFS (2 rules)
| Rule | Trigger | Severity |
|---|---|---|
zfs_pool_unhealthy | Pool state != ONLINE | Critical |
zfs_scrub_errors | Scrub detected errors | Warning |
Security (6 rules)
| Rule | Trigger | Severity |
|---|---|---|
ssh_root_password | Root login with password enabled | Warning |
no_firewall | No active firewall detected | Warning |
pending_security_updates | > 0 security updates pending | Info |
kernel_vulnerabilities | Active kernel vulnerabilities | Warning |
kernel_needs_reboot | Kernel update requires reboot | Info |
unattended_upgrades_disabled | Auto-updates not configured | Info |
Service Health (3 rules)
| Rule | Trigger | Severity |
|---|---|---|
systemd_service_failed | Any systemd unit in failed state | Warning |
fd_exhaustion | > 80% of system file descriptors used | Warning |
server_unreachable | Server missed 2+ check-ins (server-side watchdog) | Critical |
State alerts auto-resolve when the condition clears. Event alerts (unexpected_reboot) stack occurrences and have a Resolve button. Acknowledged alerts still auto-resolve.
#Trend Warnings
A Pro-tier feature that surfaces early warnings when metrics show degradation trends before the alert thresholds fire. Separate from the 38 alert rules (which fire on current state).
Trend warnings run in a 6-hour batch process on Glassmkr's backend. They analyze up to 90 days of metric history per server, apply correlation rules that require two independent signals, and optionally consult an internal ranking model trained on Backblaze's public drive failure dataset. The result is a small number of high-confidence warnings per server, not noisy anomaly detection on every metric.
What gets monitored
| Signal | What we watch for | Example warning |
|---|---|---|
| SMART reallocated sectors (5) | Growth over 7-30 days | Drive /dev/sda SMART 5 grew from 0 to 14 over 30 days |
| SMART reported uncorrectable (187) | Any appearance above zero | Drive /dev/sda: SMART 187 is now 3 |
| SMART command timeouts (188) | Repeated growth | Drive /dev/sda: 4 command timeouts in last 7 days |
| SMART pending/offline uncorrectable (197, 198) | Step change from zero | Drive /dev/sda: pending sectors appeared 3 days ago |
| SMART high fly writes (189) | Burst patterns | Drive /dev/sda: 8 high fly write events in 24h |
| NVMe critical_warning | Any bit set | NVMe /dev/nvme0n1: critical_warning bit 2 set (reliability degraded) |
| NVMe available_spare | Approaching or below threshold | NVMe /dev/nvme0n1: available_spare now 12%, threshold is 10% |
| NVMe media_errors | Growing rapidly | NVMe /dev/nvme0n1: media_errors increased from 0 to 4 in 7 days |
| NVMe p99 latency (planned, v1.1) | Sustained drift without IO volume change | NVMe /dev/nvme0n1: p99 read latency sustained 2.3x above baseline |
| Disk space per partition | Projected fill via linear regression | /data partition projected to hit 85% in 12 days at current growth |
| ECC correctable errors (planned, v1.1) | Bursts per DIMM location | DIMM CPU1_DIMM_A2: 15 correctable errors in 24h |
| PSU rail voltages | Drift 2-3% from nominal | PSU 1: 12V rail at 11.62V (drift 3.2%) |
| Fan RPM | Decline paired with temp rise in same zone | Fan SYS_FAN2: RPM dropped 25% and chassis zone temp rising |
| NIC errors | CRC/frame errors (TCP retransmit correlation planned, v1.1) | eth0: 47 CRC errors |
| ZFS checksum/read errors | Paired with matching SMART signal on same device | Drive /dev/sda: ZFS reported 7 checksum errors corroborating SMART 5 growth |
Severity tiers
- Imminent (red): projected failure within 7 days, or critical pattern (SMART 187 appearance, NVMe critical_warning). Push notification immediately.
- Soon (orange): projected within 30 days, or high-severity evidence. Push notification once.
- Scheduled (blue): projected within 90 days, or medium-severity. Dashboard only.
- Watch (grey): low confidence or more than 90 days out. Dashboard collapsed.
Correlation requirement
Where two signals exist on the same device, correlation is required before a notification. Several v1 categories fire on a single high-confidence signal because the underlying source is itself authoritative (a SMART step-change from zero, NVMe critical_warning bits, a PSU rail at 11.62V). The asymmetry is deliberate.
Multi-signal categories shipped in v1:
- Drive SMART signal + ZFS errors on the same device = storage device degradation
- Fan RPM decline + chassis temp rise in the same zone = cooling failure
Multi-signal categories planned for v1.1 once the underlying collector work lands:
- NVMe health signal + p99 latency inflation = NVMe pre-failure (fail-slow)
- NIC CRC errors + TCP retransmits on the same interface = NIC hardware failure
- ECC burst + MCE entries for the same DIMM = DIMM pre-failure
This approach trades some recall (failures that only show one signal) for high precision. Google's FAST 2007 study found roughly 40-50% of drive failures in the field show no SMART-visible warning, so trend warnings are a meaningful reduction of surprise failures, not a guarantee that every failure becomes predictable.
What we explicitly don't do
- No general-purpose anomaly detection on every metric. Netdata's own docs demote their anomaly ML to "investigation aid, not alert source." We agree.
- No per-customer model training. With 3-50 servers per account, customer-specific models are base-rate-dominated. We use global thresholds plus an offline-trained ranker on Backblaze's public dataset.
- No LLM-based trend classification. Linear regression, CUSUM, and first-differences do this job better and cheaper. We use AI only to narrate deterministic findings in plain English.
- No confident failure predictions. We say "likely within 7-14 days", never "will fail on Tuesday." The underlying signals carry real uncertainty and we surface it.
Data requirements
- Trend warnings for SMART, NVMe, ECC, cooling, PSU, and NIC require 90-day metric retention (Pro feature).
- Disk space projection works on 7-day data (available on Free tier).
- A server needs at least 3 days of contiguous data to receive any warnings. Freshly added servers are in an observation period.
Self-audit
The dashboard shows the feature's own track record: how many warnings were sent in the last 90 days, how many users confirmed were valuable, how many were dismissed, and how many were followed by a matching alert firing within 30 days. No other monitoring tool surfaces this, and it exists so you can audit whether trend warnings are actually earning their keep for your fleet.
#Configuration Reference
# Required
server_url: https://app.glassmkr.com
collector_key: gmk_cru_live_YOUR_KEY_HERE
# Optional
interval: 300 # Collection interval in seconds (default: 300)
# hostname: my-server # Override auto-detected hostname
# modules: # Disable specific collection modules
# ipmi: false
# smart: false
# zfs: false
# security: false server_url- The Glassmkr ingest endpoint. Always
https://app.glassmkr.comfor the hosted service. collector_key- Your server's authentication token. Generated when you add a server in the dashboard. Prefixed with
gmk_cru_live_(older keys may still use the legacycol_prefix until rotated). interval- How often (in seconds) the agent collects and pushes a snapshot. Default is 300 (5 minutes). Minimum is 60.
hostname- Override the auto-detected hostname. Useful when the system hostname is generic or changes between reboots.
modules- Disable individual collection modules. Set any module to
falseto skip it. The agent will not attempt to read sensors for disabled modules.
#System Requirements
- Operating System
- Linux with systemd. Tested on Debian 11/12, Ubuntu 20.04 to 24.04, Rocky 8/9, AlmaLinux 8/9.
- Runtime
- Docker (recommended) or Node.js 24+.
- Privileges
- Root access required for IPMI, SMART, and /proc system reads.
- Network
- Outbound HTTPS on port 443 to
app.glassmkr.com. No inbound ports needed. - Resource usage
- Approximately 90 MB RSS. Varies by hardware; servers with more IPMI sensors or drives use slightly more.
- Optional packages (npm install only)
smartmontools,ipmitool,dmidecodefor full hardware monitoring. Missing packages are silently skipped.
#How It Works
every 5 min
- The agent is MIT open source: github.com/glassmkr/crucible
- Agent pushes outbound only, opens no inbound ports
- Snapshots contain hardware metrics only, no user data
- Dashboard runs on EU dedicated servers, no cloud providers
- AI analysis runs on a self-hosted GPU, no external AI providers
#Notification Channels
Free + Pro. Alerts delivered from [email protected].
Telegram
Free + Pro. Bot messages with alert details and direct links.
Slack
Pro only. Block Kit formatted messages with severity colors.
Webhooks
Pro only. POST JSON to any URL you configure.
All channels support per-priority filtering (P1 to P4). Agent update notifications send major version alerts to everyone; patch notifications are opt-in.
#Pricing
Free
- Up to 3 servers
- All 38 alert rules
- Email + Telegram notifications
- 7 days data retention
- No credit card required
Pro $3/node/month
- First 3 nodes free, then $3/node/month
- Unlimited servers
- 90 days data retention
- Slack + webhooks
- AI health analysis
- MCP API access
- Email support
Enterprise
Custom pricing and configuration. Contact [email protected].
#FAQ
Do I need to open any inbound ports?
No. The agent initiates all connections outbound over HTTPS (port 443). Your firewall rules do not need to change.
Does the agent work without IPMI?
Yes. If ipmitool is not installed or the BMC is not reachable, the IPMI module is silently skipped. All other monitoring continues normally.
What happens if connectivity is lost?
The server_unreachable rule fires after the server misses 2 consecutive check-ins, roughly 10 minutes at the default interval. When connectivity resumes, the agent continues pushing snapshots.
Can I self-host the dashboard?
The agent is MIT-licensed and fully open source. The dashboard and alert evaluation engine are SaaS-only.
How does pricing work mid-month?
Proration. Add a server mid-month and you are charged proportionally for the remaining days. Remove a server and the next bill reflects the change.
Is my data stored in the EU?
Yes. All infrastructure, including the database servers and AI GPU, runs on dedicated servers in EU data centers.