How Kemon Runs AI Agents on Live Production Servers

How Kemon Runs AI Agents on Live Production Servers

Kengyew Tham·April 14, 2026·6 min read

How Kemon Runs AI Agents on Live Production Servers

Keywords: claude code production server, ai agent devops, mcp server automation, ai-driven ops agency


Introduction

Most AI implementations treat agents as chat tools or code generators. At Kemon, we operate AI agents as production infrastructure operators. They run directly on live servers, manage deployments, fix infrastructure, and execute diagnostics that would otherwise require days of developer time. This article explains how we've built the safety protocols, operational patterns, and tooling to make it work at scale.

What It Means to Run AI Agents on Production

Running an AI agent on live infrastructure is fundamentally different from using AI to draft code or generate content. You're asking the agent to:

  • SSH into production servers and edit configuration files
  • Deploy code and restart services
  • Analyze multi-million-line production logs
  • Make decisions about infrastructure that affect customers directly

This requires accountability. Our agents operate under the Live Site Protocol — a set of risk assessments, backup procedures, approval gates, and rollback patterns that ensure no production change is irreversible.

The Infrastructure Stack

We manage three classes of production infrastructure:

Client E-Commerce Platform (DigitalOcean)

  • Rails backend with Passenger application server
  • Nginx reverse proxy with cookie-gated routing
  • PostgreSQL database with Redis caching
  • SSH access to manage deployments, logs, and infrastructure

Kemon Infrastructure (Hostinger VPS)

  • Next.js frontend with PM2 process management
  • Email infrastructure (Dovecot IMAP, Postfix SMTP)
  • Systemd services and health checks
  • SSL certificate management via certbot

Cloud Services (Browser Automation)

  • Cloudflare WAF rules and DDoS protection
  • Google Ads campaigns and bid strategy management
  • Google Analytics 4 measurement stack validation
  • Notion workspace and documentation management

Real Production Actions We've Shipped

Routing and Load Management

  • Implemented cookie-gated routing for Next.js product page A/B testing
  • Right-sized Passenger worker pools based on request queue analysis
  • Created Cloudflare WAF rules to block malicious traffic
  • Deployed autossh tunnels for infrastructure connectivity

Code Deployment

  • Deployed GTM measurement changes via SSH-editing Rails templates
  • Generated and deployed SSL certificates with DNS validation
  • Created DNS records (A, CNAME, MX) via hosting provider APIs
  • Deployed Next.js updates with zero-downtime restarts

Infrastructure Maintenance

  • Fixed Dovecot email authentication errors
  • Added systemd health-check cron jobs
  • Analyzed production logs to identify error patterns
  • Ran Rails database queries and migrations

Operations and Diagnostics

  • Extracted and analyzed production logs
  • Debugged session and cookie issues
  • Validated GA4 measurement stack integrity
  • Backtested TradingView trading indicators

The MCP Stack: Extending Agent Capabilities

Model Context Protocol (MCP) is how we extend AI agent capabilities beyond SSH and basic CLI tools:

  • TradingView MCP — Chart control, indicator management, backtesting, data extraction
  • Alpaca MCP — Trading API integration, market data, account management
  • Gmail MCP — Email management for alerts and handoffs
  • Notion MCP — Workspace and database management
  • Chrome MCP — DOM-aware browser automation for cloud services

This lets agents operate across infrastructure silos in a single coherent loop without custom integration code.

Safety Protocols: How We Prevent Catastrophes

The Live Site Protocol has three phases:

Before Production Changes

  • Risk classification (kill-switch reversibility, blast radius)
  • Config backups and database snapshots
  • Approval gates for medium/high-risk changes

During Execution

  • Full logging and audit trails
  • Kill-switch patterns (revert in 30 seconds)
  • No cascading failures

After Deployment

  • 5–10 minute monitoring period
  • Error rate and health check validation
  • Immediate rollback if anomalies emerge

The Competitive Advantage

What used to take a developer 3–5 days now takes 30 minutes. Infrastructure audits that required manual investigation now run autonomously. Code deployments that required coordinating multiple stakeholders now execute with a single approval gate.

The safety protocols aren't overhead — they're what make this possible at scale. Every guardrail (backups, logging, approval gates) eliminates risk and enables speed.

Why This Matters for Your Business

If your infrastructure operations are bottlenecked by:

  • Manual deployment steps
  • Context-switching between tools
  • Firefighting instead of proactive operations
  • Audit trails that don't exist

AI-driven operations can solve these problems — but only with proper guardrails in place.


FAQ

Q: Isn't it risky to run AI agents on production servers?

A: Yes, if you don't have guardrails. Our Live Site Protocol makes it safe: risk assessments before changes, backups before edits, approval gates for medium/high-risk changes, and kill-switch patterns that let us revert in 30 seconds. We've caught more problems in the backup and approval phases than we've ever had to roll back.

Q: What happens if the agent makes a mistake?

A: We catch it in two places: the approval phase (for medium/high-risk changes) and the monitoring phase (5–10 minutes post-deployment, watching error rates and health checks). If an issue emerges, we roll back immediately. We never push through anomalies hoping they resolve.

Q: How long does it take to set this up?

A: The infrastructure is custom to our stack. But the pattern is portable: you need SSH access, logging, backups, approval gates, and kill-switch patterns. The MCP stack (TradingView, Alpaca, Gmail, Notion, Chrome) plugs in via API, not custom code.

Q: Can you do this on Kubernetes / serverless / my cloud provider?

A: Yes. The pattern is: SSH access (or equivalent), infrastructure as code (Terraform, CloudFormation), version control, and approval gates. Kubernetes deployments follow the same protocol as Rails deployments. Serverless functions can be managed via API (Lambda, Cloud Functions).

Q: What about compliance and audit trails?

A: Every SSH command is logged. Every config change is version-controlled. Every deployment is timestamped and annotated. This creates a complete audit trail for compliance, debugging, and post-mortem analysis. We store logs for 90 days minimum.

Q: How much does AI-driven operations cost?

A: Less than hiring a DevOps engineer, and faster than the context-switching cost of ad-hoc operations. An agent can run 24/7 health checks and diagnostics without human intervention. Deployments that take a day of developer time take 30 minutes.

Claude CodeAI AgentsDevOpsMCPAutomation