Blog - Glassmkr

Terminal probe output highlighting three alert-docs gap patterns surfaced by a constrained AI run

May 2026 Engineering

We used an AI as a controlled probe of our alert documentation

We forbade an AI from using its training data and made it resolve real infrastructure alerts using only the guidance our own dashboard produces. Three gap patterns surfaced. All three fixed in the same week.

Training next to Gemma: top output showing python train.py at 2200% CPU alongside llama-server at 9% CPU on l4-ams-01

April 2026 Engineering

Training a drive-failure model on a GPU server's CPU

We retrained our drive-failure predictor on 2 years of Backblaze data (222M drive-days) on the CPU of our L4 inference server. Gemma stayed resident in VRAM. 59 minutes, no new compute, 5.8% inference overhead. Plus the feature-importance surprise: SMART 197 beat SMART 187.

Glassmkr terminal preview: crucible fleet --status showing 3 servers, 38 rules evaluated, all healthy

April 2026 Launch

Introducing Glassmkr: bare metal monitoring built by operators

Three tools, one philosophy: Dashboard (SaaS dashboard), Bench (MCP servers for AI agents), and Crucible (open-source agent). Built by operators with a decade of bare metal experience across 67 global locations.

Qwen3.6 vs Gemma 4 benchmark: thinking mode cost 7x latency for no material quality gain

April 2026 Engineering

We benchmarked Qwen3.6 against our production Gemma 4 on an L4. Here's what actually mattered.

Three-way benchmark of Gemma 4 26B-A4B, Qwen3.6 35B-A3B no-think, and Qwen3.6 35B-A3B thinking on a production infrastructure health analysis prompt. Real wall-clock numbers, VRAM footprints, and the quality-latency tradeoff that matters for narration.

ipmitool sensor output showing CPU2 at 89C critical, FAN1 at 0 RPM, mixed PSU status

April 2026 Operations

IPMI diagnostics for bare metal: what to monitor and how to read it

A practical guide to monitoring IPMI sensors, SEL logs, and BMC health on Dell, Supermicro, and HPE servers. Covers kipmi0 CPU issues, vendor quirks, and what to alert on.

Gemma 4 26B-A4B (3.8B active) shipped; dense models 8B, 32B, and 70B did not make it

April 2026 Engineering

What We Learned Running Gemma 4 on an L4 GPU for Production Server Analysis

How we deployed Gemma 4 26B on an NVIDIA L4 for AI health analysis of bare metal servers. Covers model selection, why vLLM failed, quantization choices, and prompting for structured infrastructure output.

Priority-ordered hardware alerts: P1 SMART failing, P2 RAID degraded, P3 reallocated sectors rising, P4 ECC errors

April 2026 Operations

IPMI, SMART, and RAID: The Hardware Layer Your Cloud Monitoring Tool Ignores

Most monitoring tools stop at the OS. Below it sits an entire hardware layer: disk firmware predicting its own failure, fans at 0 RPM, ECC memory correcting silent errors. Here is what to monitor and why.

Bare metal failure modes across storage, hardware, network, OS - none of which a cloud APM sees

April 2026 Operations

Why bare metal monitoring is different

Cloud monitoring tools were built for ephemeral workloads. They track HTTP latency and container restarts. But when you run physical servers, the failure modes are fundamentally different: drives wear out, DIMM slots develop bit errors, fans fail silently, and RAID arrays degrade without anyone noticing.