There’s an open secret on this planet of DevOps: No person trusts the CMDB. The Configuration Administration Database (CMDB) is meant to be the “supply of reality”—the central map of each server, service, and utility in your enterprise. In concept, it’s the muse for safety audits, price evaluation, and incident response. In observe, it’s a piece of fiction. The second you populate a CMDB, it begins to rot. Engineers deploy a brand new microservice however overlook to register it. An autoscaling group spins up 20 new nodes, however the database solely information the unique three. . . 

We name this configuration drift, and for many years, our trade’s answer has been to throw extra scripts on the downside. We write large, brittle ETL (Extract-Remodel-Load) pipelines that try to scrape the world and shove it right into a relational database. It by no means works. The “world”—particularly the trendy cloud native world—strikes too quick.

We realized we couldn’t resolve this downside by writing higher scripts. We needed to change the elemental structure of how we sync knowledge. We stopped making an attempt to boil the ocean and repair your entire enterprise directly. As a substitute, we centered on one notoriously troublesome atmosphere: Kubernetes. If we may construct an autonomous agent able to reasoning in regards to the complicated, ephemeral state of a Kubernetes cluster, we may show a sample that works in all places else. This text explores how we used the newly open-sourced Codex CLI and theMannequin Context Protocol (MCP) to construct that agent. Within the course of, we moved from passive code technology to lively infrastructure operation, remodeling the “stale CMDB” downside from a knowledge entry activity right into a logic puzzle.

The Shift: From Code Technology to Infrastructure Operation with Codex CLI and MCP

The explanation most CMDB initiatives fail is ambition. They attempt to monitor each swap port, digital machine, and SaaS license concurrently. The result’s a knowledge swamp—an excessive amount of noise, not sufficient sign. We took a distinct strategy. We drew a small circle round a selected area: Kubernetes workloads. Kubernetes is the proper testing floor for AI brokers as a result of it’s high-velocity and declarative. Issues change consistently. Pods die; deployments roll over; companies change selectors. A static script struggles to tell apart between a CrashLoopBackOff (a short lived error state) and a purposeful scale-down. We hypothesized that a big language mannequin (LLM), appearing as an operator, may perceive this nuance. It wouldn’t simply copy knowledge; it will interpret it.

The Codex CLI turned this speculation right into a tangible structure by enabling a shift from “code technology” to “infrastructure operation.” As a substitute of treating the LLM as a junior programmer that writes scripts for people to assessment and run, Codex empowers the mannequin to execute code itself. We offer it with instruments—executable capabilities that act as its fingers and eyes—by way of the Model Context Protocol. MCP defines a transparent interface between the AI mannequin and the skin world, permitting us to reveal high-level capabilities like cmdb_stage_transaction with out educating the mannequin the complicated inside API of our CMDB. The mannequin learns to make use of the software, not the underlying API.

The structure of company

Our system, which we name k8s-agent, consists of three distinct layers. This isn’t a single script operating high to backside; it’s a cognitive structure.

The cognitive layer (Codex + contextual directions): That is the Codex CLI operating a selected system immediate. We don’t fine-tune the mannequin weights. Infrastructure strikes too quick for fine-tuning: A mannequin educated on Kubernetes v1.25 can be hallucinating by v1.30. As a substitute, we use context engineering—the artwork of designing the atmosphere through which the AI operates. This entails software design (creating atomic, deterministic capabilities), immediate structure (structuring the system immediate), and knowledge structure (deciding what data to cover or expose). We feed the mannequin a persistent context file (AGENTS.md) that defines its persona: “You’re a meticulous infrastructure auditor. Your aim is to make sure the CMDB precisely displays the state of the Kubernetes cluster. You will need to prioritize security: Don’t delete information until you’ve gotten constructive affirmation; they’re orphans.”

The software layer: Utilizing MCP, we expose deterministic Python capabilities to the agent.

  • Sensorsk8s_list_workloadscmdb_query_servicek8s_get_deployment_spec
  • Actuatorscmdb_stage_createcmdb_stage_updatecmdb_stage_delete

Observe that we monitor workloads (Deployments, StatefulSets), not Pods. Pods are ephemeral; monitoring them in a CMDB is an antipattern that creates noise. The agent understands this distinction—a semantic rule that’s exhausting to implement in a inflexible script.

The state layer (the protection web): LLMs are probabilistic; infrastructure should be deterministic. We bridge this hole with a staging sample. The agent by no means writes on to the manufacturing database. It writes to a staged diff. This permits a human (or a coverage engine) to assessment the proposed adjustments earlier than they’re dedicated.

The OODA Loop in Motion

How does this differ from a typical sync script? A script follows a linear path: Join → Fetch → Write. If any step fails or returns sudden knowledge, the script crashes or corrupts knowledge. Our agent follows the Observe-Orient-Decide-Act (OODA) loop, popularized by navy strategists. Not like a linear script that executes blindly, the OODA loop forces the agent to pause and synthesize data earlier than taking motion. This cycle permits it to deal with incomplete knowledge, confirm assumptions, and adapt to altering circumstances—traits important for working in a distributed system.

Let’s stroll by an actual situation we encountered throughout our pilot, the Ghost Deployment, to discover the advantages of utilizing an OODA loop. A developer had deleted a deployment named payment-processor-v1 from the cluster however forgot to take away the file from the CMDB. An ordinary script would possibly pull the listing of deployments, see payment-processor-v1 is lacking, and instantly challenge a DELETE to the database. The danger is clear: What if the API server was simply timing out? What if the script had a bug in its pagination logic? The script blindly destroys knowledge primarily based on the absence of proof. 

The agent strategy is basically totally different. First, it observes: Calling k8s_list_workloads and cmdb_query_service, noticing the discrepancy. Second, it orients: Checking its context directions to “confirm orphans earlier than deletion” and deciding to name k8s_get_event_history. Third, it decides: Seeing a “delete” occasion within the logs, it causes that the useful resource is lacking and that there’s been a deletion occasion. Lastly, it acts: Calling cmdb_stage_delete with a remark confirming the deletion. The agent didn’t simply sync knowledge; it investigated. It dealt with the anomaly that often breaks automation.

Fixing the “Semantic Hole”

This particular Kubernetes use case highlights a broader downside in IT operations: the “semantic hole.” The info in our infrastructure (JSON, YAML, logs) is stuffed with implicit that means. A label “env: manufacturing” adjustments the criticality of a useful resource. A standing CrashLoopBackOff means “damaged,” however Accomplished means “completed efficiently.” Conventional scripts require us to hardcode each permutation of this logic, leading to 1000’s of strains of unmaintainable if/else statements. With the Codex CLI, we substitute these 1000’s of strains of code with a couple of sentences of English within the system immediate: “Ignore jobs which have accomplished efficiently. Sync failing Jobs so we are able to monitor instability.” The LLM bridges the semantic hole. It understands what “instability” implies within the context of a job standing. We’re describing our intent, and the agent is dealing with the implementation.

Scaling Past Kubernetes

We began with Kubernetes as a result of it’s the “exhausting mode” of configuration administration. In a manufacturing atmosphere with 1000’s of workloads, issues change consistently. An ordinary script sees a snapshot and infrequently will get it fallacious. An agent, nonetheless, can work by the complexity. It’d run its OODA loop a number of instances to unravel a single challenge—by checking logs, verifying dependencies, and confirming guidelines earlier than it ever makes a change. This capacity to attach reasoning steps permits it to deal with the dimensions and uncertainty that breaks conventional automation.

However the sample we established, agentic OODA Loops by way of MCP, is common. As soon as we proved the mannequin labored for Pods and Companies, we realized we may lengthen it. For legacy infrastructure, we may give the agent instruments to SSH into Linux VMs. For SaaS administration, we may give it entry to Salesforce or GitHub APIs. For cloud governance, we are able to ask it to audit AWS Safety Teams. The fantastic thing about this structure is that the “mind” (the Codex CLI) stays the identical. To assist a brand new atmosphere, we don’t must rewrite the engine; we simply hand it a brand new set of instruments. Nonetheless, shifting to an agentic mannequin forces us to confront new trade-offs. Essentially the most speedy is price versus context. We realized the exhausting means that you just shouldn’t give the AI the uncooked YAML of a Kubernetes deployment—it consumes too many tokens and distracts the mannequin with irrelevant particulars. As a substitute, you create a software that returns a digest—a simplified JSON object with solely the fields that matter. That is context optimization, and it’s the key to operating brokers cost-effectively.

Conclusion: The Human within the Cockpit

There’s a worry that AI will substitute the DevOps engineer. Our expertise with the Codex CLI suggests the alternative. This expertise doesn’t take away the human; it elevates them. It promotes the engineer from a “script author” to a “mission commander.” The stale CMDB was by no means actually a knowledge downside; it was a labor downside. It was merely an excessive amount of work for people to manually monitor and too complicated for easy scripts to automate. By introducing an agent that may cause, we lastly have a mechanism able to maintaining with the cloud. 

We began with a small Kubernetes cluster. However the vacation spot is an infrastructure that’s self-documenting, self-healing, and basically intelligible. The period of the brittle sync script is over. The period of infrastructure as intent has begun!


Source link