How Kemon Runs AI Agents on Live Production Servers

Kengyew Tham·April 14, 2026·6 min read

How Kemon Runs AI Agents on Live Production Servers

Keywords: claude code production server, ai agent devops, mcp server automation, ai-driven ops agency

Introduction

Most AI implementations treat agents as chat tools or code generators. At Kemon, we operate AI agents as production infrastructure operators. They run directly on live servers, manage deployments, fix infrastructure, and execute diagnostics that would otherwise require days of developer time. This article explains how we've built the safety protocols, operational patterns, and tooling to make it work at scale.

What It Means to Run AI Agents on Production

Running an AI agent on live infrastructure is fundamentally different from using AI to draft code or generate content. You're asking the agent to:

SSH into production servers and edit configuration files
Deploy code and restart services
Analyze multi-million-line production logs
Make decisions about infrastructure that affect customers directly

This requires accountability. Our agents operate under the Live Site Protocol — a set of risk assessments, backup procedures, approval gates, and rollback patterns that ensure no production change is irreversible.

The Infrastructure Stack

We manage three classes of production infrastructure:

Client E-Commerce Platform (DigitalOcean)

Rails backend with Passenger application server
Nginx reverse proxy with cookie-gated routing
PostgreSQL database with Redis caching
SSH access to manage deployments, logs, and infrastructure

Kemon Infrastructure (Hostinger VPS)

Next.js frontend with PM2 process management
Email infrastructure (Dovecot IMAP, Postfix SMTP)
Systemd services and health checks
SSL certificate management via certbot

Cloud Services (Browser Automation)

Cloudflare WAF rules and DDoS protection
Google Ads campaigns and bid strategy management
Google Analytics 4 measurement stack validation
Notion workspace and documentation management

Real Production Actions We've Shipped

Routing and Load Management

Implemented cookie-gated routing for Next.js product page A/B testing
Right-sized Passenger worker pools based on request queue analysis
Created Cloudflare WAF rules to block malicious traffic
Deployed autossh tunnels for infrastructure connectivity

Code Deployment

Deployed GTM measurement changes via SSH-editing Rails templates
Generated and deployed SSL certificates with DNS validation
Created DNS records (A, CNAME, MX) via hosting provider APIs
Deployed Next.js updates with zero-downtime restarts

Infrastructure Maintenance

Fixed Dovecot email authentication errors
Added systemd health-check cron jobs
Analyzed production logs to identify error patterns
Ran Rails database queries and migrations

Operations and Diagnostics

Extracted and analyzed production logs
Debugged session and cookie issues
Validated GA4 measurement stack integrity
Backtested TradingView trading indicators

The MCP Stack: Extending Agent Capabilities

Model Context Protocol (MCP) is how we extend AI agent capabilities beyond SSH and basic CLI tools:

TradingView MCP — Chart control, indicator management, backtesting, data extraction
Alpaca MCP — Trading API integration, market data, account management
Gmail MCP — Email management for alerts and handoffs
Notion MCP — Workspace and database management
Chrome MCP — DOM-aware browser automation for cloud services

This lets agents operate across infrastructure silos in a single coherent loop without custom integration code.

Safety Protocols: How We Prevent Catastrophes

The Live Site Protocol has three phases:

Before Production Changes

Risk classification (kill-switch reversibility, blast radius)
Config backups and database snapshots
Approval gates for medium/high-risk changes

During Execution

Full logging and audit trails
Kill-switch patterns (revert in 30 seconds)
No cascading failures

After Deployment

5–10 minute monitoring period
Error rate and health check validation
Immediate rollback if anomalies emerge

The Competitive Advantage

What used to take a developer 3–5 days now takes 30 minutes. Infrastructure audits that required manual investigation now run autonomously. Code deployments that required coordinating multiple stakeholders now execute with a single approval gate.

The safety protocols aren't overhead — they're what make this possible at scale. Every guardrail (backups, logging, approval gates) eliminates risk and enables speed.

Why This Matters for Your Business

If your infrastructure operations are bottlenecked by:

Manual deployment steps
Context-switching between tools
Firefighting instead of proactive operations
Audit trails that don't exist

AI-driven operations can solve these problems — but only with proper guardrails in place.

FAQ

Q: Isn't it risky to run AI agents on production servers?

A: Yes, if you don't have guardrails. Our Live Site Protocol makes it safe: risk assessments before changes, backups before edits, approval gates for medium/high-risk changes, and kill-switch patterns that let us revert in 30 seconds. We've caught more problems in the backup and approval phases than we've ever had to roll back.

Q: What happens if the agent makes a mistake?

A: We catch it in two places: the approval phase (for medium/high-risk changes) and the monitoring phase (5–10 minutes post-deployment, watching error rates and health checks). If an issue emerges, we roll back immediately. We never push through anomalies hoping they resolve.

Q: How long does it take to set this up?

A: The infrastructure is custom to our stack. But the pattern is portable: you need SSH access, logging, backups, approval gates, and kill-switch patterns. The MCP stack (TradingView, Alpaca, Gmail, Notion, Chrome) plugs in via API, not custom code.

Q: Can you do this on Kubernetes / serverless / my cloud provider?

A: Yes. The pattern is: SSH access (or equivalent), infrastructure as code (Terraform, CloudFormation), version control, and approval gates. Kubernetes deployments follow the same protocol as Rails deployments. Serverless functions can be managed via API (Lambda, Cloud Functions).

Q: What about compliance and audit trails?

A: Every SSH command is logged. Every config change is version-controlled. Every deployment is timestamped and annotated. This creates a complete audit trail for compliance, debugging, and post-mortem analysis. We store logs for 90 days minimum.

Q: How much does AI-driven operations cost?

A: Less than hiring a DevOps engineer, and faster than the context-switching cost of ad-hoc operations. An agent can run 24/7 health checks and diagnostics without human intervention. Deployments that take a day of developer time take 30 minutes.

Claude CodeAI AgentsDevOpsMCPAutomation

← All posts