用 Agent Skills 构建完整的 CI/CD 自动化流水线:让 AI Agent 成为你的 DevOps 工程师
本文介绍了如何利用Agent Skills构建完整的CI/CD自动化流水线,将传统DevOps流程转化为智能化的AI代理执行。文章首先分析了传统CI/CD的三大痛点:配置复杂、知识断层和安全盲区,对比展示了Agent Skills的解决方案优势。随后详细阐述了由6个核心Skill组成的全链路流水线架构,包括CI审查、安全扫描、Docker构建、K8s部署、部署验证和回滚防护等环节。每个Skill既
用 Agent Skills 构建完整的 CI/CD 自动化流水线:让 AI Agent 成为你的 DevOps 工程师
上篇我们聊了 Agent Skills 的概念与入门实战——5 分钟写一个 Git 变更摘要 Skill,30 分钟搭一套代码审查 Skill。但那只是"单兵作战"。
真正的生产级 CI/CD 流水线涉及代码审查、安全扫描、构建测试、Docker 镜像、K8s 部署、回滚验证……六个环节,数十个检查点。过去你需要写几千行 YAML、维护十几个 Shell 脚本、祈祷周五部署不出事故。
今天,我们用 6 个 Agent Skill 把整条流水线装进 AI Agent 的大脑。一条命令,从
git push到生产环境健康检查,全链路自动化。
一、为什么 CI/CD 是 Agent Skills 最有价值的战场?
1.1 传统 CI/CD 的三大痛点
| 痛点 | 表现 | Agent Skill 如何解决 |
|---|---|---|
| 配置地狱 | 一个 GitHub Actions workflow 200+ 行 YAML,改一个检查点要翻半天 | 每个 Skill 只管一个环节,改哪里一目了然 |
| 知识断层 | 只有老王知道为什么 staging 要先跑数据库迁移,老人一走,流水线就炸 | 标准化 SOP 封装进 Skill,知识不随人走 |
| 安全盲区 | 写了 CI 忘了加安全扫描,或者加了 SAST 忘了加容器扫描 | Skill 内置安全检查清单,零遗漏 |
1.2 一个真实的场景对比
传统方式:
开发者 push 代码
→ 写 .github/workflows/ci.yml(200行+)
→ 写 Dockerfile(忘了多阶段构建)
→ 写 k8s/deployment.yaml(忘了加 resource limits)
→ 写 rollback.sh(忘了加健康检查验证)
→ 部署到生产
→ 凌晨 3 点被 PagerDuty 叫醒
Agent Skills 方式:
开发者 push 代码
→ /ci-review ← 自动审查 CI 配置
→ /security-scan ← SAST + 依赖扫描 + 容器扫描
→ /docker-build ← 多阶段构建 + 镜像扫描 + 推送
→ /k8s-deploy ← GitOps 部署 + 健康检查
→ /deploy-verify ← 烟雾测试 + 回滚就绪确认
→ 睡个好觉
1.3 核心设计理念:Skill 既是"手册"也是"检查员"
传统 CI/CD 的配置文件是"哑管道"——它只管执行,不管对错。你忘了加安全扫描?它不会提醒你。你用了 latest 标签?它照样部署。
Agent Skill 是"智能管道"——它既知道该怎么做,也知道不该怎么做。每一个 Skill 内置了 MUST DO 和 MUST NOT DO,就像一个 10 年经验的 DevOps 工程师在实时审查你的每一步操作。
二、流水线全景:6 个 Skill 覆盖完整 CI/CD
2.1 架构总览
┌─────────────────────────────────┐
│ 主 Agent (Claude) │
│ 统一调度 6 个 Skills │
└──────────┬──────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌─────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Stage 1 │ │ Stage 2 │ │ Stage 3 │
│ CI 审查 │ ───→ │ 安全扫描 │ ───→ │ Docker 构建 │
│ ci-review │ │security-scan│ │docker-build│
└───────────┘ └────────────┘ └────────────┘
│
┌────────────────────┼────────────────┘
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ Stage 4 │ │ Stage 5 │
│ K8s 部署 │ ───→ │ 部署验证 │
│k8s-deploy │ │deploy-verify│
└───────────┘ └────────────┘
│
┌──────▼──────┐
│ Stage 6 │
│ 回滚就绪 │
│rollback-guard│
└────────────┘
2.2 6 个 Skill 的职责分工
| Skill | 触发时机 | 核心职责 | 风险拦截 |
|---|---|---|---|
ci-review |
PR 提交 | 审查 CI 配置、检查依赖、验证构建脚本 | 阻止无效的 workflow 配置 |
security-scan |
CI 阶段 | SAST + 依赖漏洞 + 密钥泄露 + 容器扫描 | 阻止含漏洞/硬编码密钥的代码合入 |
docker-build |
构建阶段 | 多阶段构建、镜像扫描、标签策略 | 阻止 latest 标签、root 用户镜像 |
k8s-deploy |
部署阶段 | 生成 K8s 清单、GitOps 同步、灰度配置 | 阻止缺少 resource limits 的部署 |
deploy-verify |
部署后 | 烟雾测试、健康检查、指标验证 | 阻止异常部署继续放流 |
rollback-guard |
全程 | 回滚预案生成、回滚验证、回滚演练 | 阻止无回滚预案的部署 |
2.3 项目目录结构
my-project/
├── .claude/
│ └── skills/
│ ├── ci-review/
│ │ └── SKILL.md
│ ├── security-scan/
│ │ ├── SKILL.md
│ │ └── scripts/
│ │ ├── sast-scan.sh
│ │ ├── dependency-check.sh
│ │ └── secret-detect.sh
│ ├── docker-build/
│ │ ├── SKILL.md
│ │ └── assets/
│ │ └── Dockerfile.template
│ ├── k8s-deploy/
│ │ ├── SKILL.md
│ │ ├── references/
│ │ │ └── K8S-BEST-PRACTICES.md
│ │ └── assets/
│ │ └── deployment-template.yaml
│ ├── deploy-verify/
│ │ └── SKILL.md
│ └── rollback-guard/
│ ├── SKILL.md
│ └── scripts/
│ └── rollback-test.sh
├── .github/
│ └── workflows/
│ └── ci.yml
├── k8s/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── ingress.yaml
├── Dockerfile
└── src/
三、Stage 1:CI 配置审查 Skill
3.1 设计思路
CI 配置是最容易被忽视的环节——开发者往往复制粘贴一个模板,改改分支名就算完事。但糟糕的 CI 配置会导致:缓存失效拖慢构建、权限过大泄露密钥、缺少超时设置浪费资源。
3.2 SKILL.md
---
name: ci-review
description: >
Reviews CI/CD pipeline configurations for correctness, security,
and performance. Use when creating or modifying GitHub Actions,
GitLab CI, or Jenkins pipeline configs, or when the user asks
to review their CI setup.
metadata:
author: devops-team
version: "1.0"
---
# CI Configuration Review Skill
## Instructions
### Step 1: Identify CI System
Detect which CI system is in use:
- `.github/workflows/*.yml` → GitHub Actions
- `.gitlab-ci.yml` → GitLab CI
- `Jenkinsfile` → Jenkins
### Step 2: Structural Review
Check the following for **each** workflow file:
**Security Checks:**
- [ ] No hardcoded secrets (use `${{ secrets.* }}` instead)
- [ ] `permissions:` block is set to minimum required
- [ ] Third-party actions are pinned to SHA, not just tag
- [ ] `pull_request` workflows don't have `write` permissions
- [ ] No `AWS_ACCESS_KEY_ID` in env vars without secrets reference
**Performance Checks:**
- [ ] Caching is enabled (npm/pip/docker layer caching)
- [ ] Jobs run in parallel where possible (no unnecessary `needs:`)
- [ ] Timeout limits are set (`timeout-minutes: 15` default)
- [ ] Artifact retention is configured (not infinite)
**Reliability Checks:**
- [ ] Branch filters are correct (main/release only for deploy)
- [ ] Failure notifications are configured
- [ ] Retry logic for flaky steps (`retry-action` or self-authored)
- [ ] Build matrix covers all supported versions
### Step 3: Generate Report
Output a structured review:
| Category | Finding | Severity | File:Line | Fix |
|----------|---------|----------|-----------|-----|
Severity levels:
- 🔴 **Block**: Must fix before merge (security risk, will break)
- 🟡 **Warning**: Should fix soon (performance, reliability)
- 🟢 **Suggestion**: Nice to have (optimization, style)
### Step 4: Auto-Fix (if requested)
When the user says "fix it", apply all 🔴 fixes automatically
and present 🟡 fixes for confirmation.
## Current CI Configuration
!`find .github/workflows -name '*.yml' -o -name '*.yaml' 2>/dev/null | head -20`
!`cat .github/workflows/*.yml 2>/dev/null || echo "No GitHub Actions workflows found"`
四、Stage 2:安全扫描 Skill(含脚本)
4.1 设计思路
安全扫描是 CI/CD 流水线的"安检门"。我们设计三层扫描:SAST 静态代码扫描、依赖漏洞扫描、密钥泄露检测。三个脚本独立运行,Skill 统一编排。
4.2 SKILL.md
---
name: security-scan
description: >
Performs multi-layer security scanning: SAST code analysis,
dependency vulnerability check, and secret/credential leak detection.
Use before merging PRs, during CI pipeline, or when the user
asks for a security audit.
metadata:
author: security-team
version: "1.0"
allowed-tools: Bash(bandit:*) Bash(pip-audit:*) Bash(trivy:*) Bash(gitleaks:*) Read
---
# Security Scan Skill
## Overview
Three-layer security scanning for CI/CD pipelines:
1. **SAST** — Static code analysis for vulnerability patterns
2. **Dependency** — Known CVE scanning for third-party packages
3. **Secrets** — Credential and API key leak detection
## Execution
### Layer 1: SAST (Static Application Security Testing)
```bash
bash scripts/sast-scan.sh
The script detects the project language and runs the appropriate scanner:
- Python →
bandit - JavaScript/TypeScript →
semgrep - Go →
gosec - Java →
spotbugs
Layer 2: Dependency Vulnerability Scan
bash scripts/dependency-check.sh
Scans lockfiles and manifests for known CVEs:
- Python →
pip-audit/safety - Node.js →
npm audit/snyk - Go →
nancy/osv-scanner - Java →
owasp-dependency-check
Layer 3: Secret Detection
bash scripts/secret-detect.sh
Scans for:
- AWS keys, API tokens, private keys
- Database connection strings with passwords
- Hardcoded JWT secrets
.envfiles committed to the repo
Output Format
For each finding:
| # | Layer | Severity | File | Line | Issue | CVE/Rule | Fix |
|---|
Severity levels:
- 🔴 Critical: Exploitable vulnerability, leaked credentials → BLOCK merge
- 🟡 High: Known CVE with patch available → Warn, suggest fix
- 🟢 Medium/Low: Best practice, informational → Document only
Decision Logic
if Critical findings > 0:
→ BLOCK: "Security scan failed. Fix critical issues before merging."
→ Generate fix suggestions for each critical finding
elif High findings > 3:
→ WARN: "Multiple high-severity issues found. Review recommended."
→ List all high findings with fix suggestions
else:
→ PASS: "Security scan passed."
→ Summary of low/info findings
Edge Cases
- If no scanner is installed, provide installation instructions
- If scan times out (>5 min), report partial results and note which layers are incomplete
- For false positives, document the suppression rule (not just ignore)
- Never skip a layer entirely — if one fails, note it and continue with others
### 4.3 SAST 扫描脚本
```bash
#!/bin/bash
# scripts/sast-scan.sh — 静态代码安全扫描
set -euo pipefail
echo "=== SAST Scan Starting ==="
echo "Detecting project language..."
# 检测项目语言并运行对应扫描器
if [ -f "requirements.txt" ] || [ -f "pyproject.toml" ]; then
echo "→ Python project detected"
if command -v bandit &>/dev/null; then
bandit -r src/ -f json -o /tmp/sast-results.json 2>/dev/null || true
bandit -r src/ -f txt 2>/dev/null || true
else
echo "⚠ bandit not installed. Install: pip install bandit"
fi
elif [ -f "package.json" ]; then
echo "→ JavaScript/TypeScript project detected"
if command -v semgrep &>/dev/null; then
semgrep --config auto --json src/ -o /tmp/sast-results.json 2>/dev/null || true
semgrep --config auto src/ 2>/dev/null || true
else
echo "⚠ semgrep not installed. Install: pip install semgrep"
fi
elif [ -f "go.mod" ]; then
echo "→ Go project detected"
if command -v gosec &>/dev/null; then
gosec -fmt=json -out=/tmp/sast-results.json ./... 2>/dev/null || true
gosec -fmt=text ./... 2>/dev/null || true
else
echo "⚠ gosec not installed. Install: go install github.com/securego/gosec/v2/cmd/gosec@latest"
fi
elif [ -f "pom.xml" ] || [ -f "build.gradle" ]; then
echo "→ Java project detected"
echo "ℹ Java SAST requires SpotBugs. Consider CI integration."
fi
echo "=== SAST Scan Complete ==="
4.4 密钥泄露检测脚本
#!/bin/bash
# scripts/secret-detect.sh — 密钥与凭证泄露检测
set -euo pipefail
echo "=== Secret Detection Starting ==="
# gitleaks 扫描
if command -v gitleaks &>/dev/null; then
echo "→ Running gitleaks scan..."
gitleaks detect --source . --no-banner --report-format json \
--report-path /tmp/secret-results.json 2>/dev/null || true
# 检查是否有发现
if [ -s /tmp/secret-results.json ]; then
FINDING_COUNT=$(python3 -c "
import json
try:
data = json.load(open('/tmp/secret-results.json'))
print(len(data))
except:
print(0)
" 2>/dev/null || echo "0")
if [ "$FINDING_COUNT" -gt 0 ]; then
echo "🔴 CRITICAL: $FINDING_COUNT secret(s) detected!"
gitleaks detect --source . --no-banner 2>/dev/null || true
else
echo "✅ No secrets detected"
fi
else
echo "✅ No secrets detected"
fi
else
echo "⚠ gitleaks not installed. Install: brew install gitleaks"
echo " Alternative: docker run --rm -v $(pwd):/repo zricethezav/gitleaks detect --source /repo"
fi
# 额外检查:.env 文件是否被提交
if git ls-files --error-unmatch .env .env.* 2>/dev/null; then
echo "🔴 CRITICAL: .env file is tracked in git! Remove with: git rm --cached .env"
fi
echo "=== Secret Detection Complete ==="
五、Stage 3:Docker 构建 Skill(含模板)
5.1 设计思路
Docker 镜像是 CI/CD 的"交付物"。一个糟糕的 Dockerfile 可以让镜像膨胀到 2GB,也可以让攻击者拿到 root 权限。这个 Skill 强制执行:多阶段构建、非 root 用户、固定标签、镜像扫描。
5.2 SKILL.md
---
name: docker-build
description: >
Builds production-grade Docker images with multi-stage builds,
security scanning, and proper tagging. Use when writing or
reviewing Dockerfiles, building Docker images, or pushing to
container registries.
metadata:
author: devops-team
version: "1.0"
disable-model-invocation: false
allowed-tools: Bash(docker:*) Bash(trivy:*) Read Write
---
# Docker Build Skill
## Pre-Build Checklist
Before writing or reviewing any Dockerfile, verify:
- [ ] Uses multi-stage build (builder + runtime)
- [ ] Base image uses specific tag, NOT `latest`
- [ ] Non-root user is set (`USER nonroot`)
- [ ] HEALTHCHECK instruction is present
- [ ] `.dockerignore` exists and excludes: `.git`, `node_modules`, `__pycache__`, `.env`
- [ ] No secrets are COPY'd into the image
- [ ] Build context is minimal (no unnecessary files)
## Build Process
### Step 1: Review or Generate Dockerfile
If a Dockerfile exists, review it against the checklist above.
If not, generate one using the template at [Dockerfile.template](assets/Dockerfile.template).
### Step 2: Build the Image
```bash
# Build with build args for version injection
docker build \
--build-arg VERSION=$(git describe --tags --always) \
--build-arg COMMIT_SHA=$(git rev-parse HEAD) \
-t myapp:$(git rev-parse --short HEAD) \
-t myapp:latest-build \
.
Tag Policy:
- Git SHA tag (e.g.,
myapp:abc1234) → For exact version tracking - Branch tag (e.g.,
myapp:main) → For latest on branch - ❌ NEVER push a
latesttag to production registry
Step 3: Scan the Image
# Trivy vulnerability scan
trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:$(git rev-parse --short HEAD)
# If critical vulnerabilities found:
# 1. Check if base image update fixes them
# 2. If not, document the risk and get explicit approval
# 3. Never silently ignore critical vulnerabilities
Step 4: Push to Registry
docker tag myapp:abc1234 ghcr.io/org/myapp:abc1234
docker push ghcr.io/org/myapp:abc1234
MUST NOT DO
- ❌ Use
FROM node:latestor anylatesttag - ❌ Run as root user in production
- ❌ COPY .env or any secret file
- ❌ Push
latesttag to production registry - ❌ Skip image scanning before push
- ❌ Use
ADDinstead ofCOPY(unless extracting tar)
### 5.3 Dockerfile 模板
```dockerfile
# assets/Dockerfile.template — 多阶段构建模板
# 适用于 Python 项目,按需调整为 Node.js/Go/Java
# ============ Stage 1: Builder ============
FROM python:3.12-slim AS builder
WORKDIR /app
# 先复制依赖文件(利用 Docker 层缓存)
COPY requirements.txt .
# 安装依赖到独立目录
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# ============ Stage 2: Runtime ============
FROM python:3.12-slim
# 元数据
ARG VERSION=dev
ARG COMMIT_SHA=unknown
LABEL version="${VERSION}" \
commit-sha="${COMMIT_SHA}" \
maintainer="dev-team"
WORKDIR /app
# 从 builder 阶段复制已安装的依赖
COPY --from=builder /install /usr/local
# 复制应用代码
COPY . .
# 创建非 root 用户
RUN groupadd -r appuser && useradd -r -g appuser appuser
USER appuser
# 健康检查
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# 暴露端口
EXPOSE 8080
# 启动命令
CMD ["python", "main.py"]
六、Stage 4:Kubernetes 部署 Skill
6.1 SKILL.md
---
name: k8s-deploy
description: >
Generates and reviews Kubernetes deployment manifests following
production best practices. Use when deploying to K8s, writing
deployment manifests, or setting up GitOps with ArgoCD/Flux.
metadata:
author: devops-team
version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(helm:*) Read Write
---
# Kubernetes Deploy Skill
## Pre-Deploy Checklist
Every deployment MUST satisfy:
- [ ] Resource limits set (CPU + Memory for requests and limits)
- [ ] Liveness probe configured
- [ ] Readiness probe configured
- [ ] Image tag is specific SHA, NOT `latest`
- [ ] Secrets mounted from Secret objects, NOT environment variables
- [ ] Pod Disruption Budget defined (for production)
- [ ] Network policy configured (if cluster requires)
- [ ] Anti-affinity rules set (for HA, ≥2 replicas)
## Deployment Flow
### Step 1: Generate/Review Manifests
Check the following files:
- `k8s/deployment.yaml` — Main deployment
- `k8s/service.yaml` — Service exposure
- `k8s/ingress.yaml` — Ingress routing
- `k8s/hpa.yaml` — Horizontal Pod Autoscaler (recommended)
Refer to [K8S-BEST-PRACTICES.md](references/K8S-BEST-PRACTICES.md)
for the complete reference.
### Step 2: Validate Manifests
```bash
# Dry-run validation
kubectl apply --dry-run=client -f k8s/
# If using Helm:
helm template myapp ./chart | kubeval --strict
Step 3: GitOps Sync (ArgoCD)
# Push manifests to GitOps repo
git add k8s/
git commit -m "deploy: myapp $(git rev-parse --short HEAD)"
git push origin main
# ArgoCD auto-syncs. Monitor:
argocd app get myapp --refresh
Step 4: Deploy (Non-GitOps Fallback)
kubectl apply -f k8s/
kubectl rollout status deployment/myapp -n production --timeout=300s
Deployment Strategy Selection
| Strategy | When to Use | Risk Level |
|---|---|---|
| Rolling Update | Default, low-risk changes | Low |
| Blue-Green | Major version upgrades, DB schema changes | Medium |
| Canary | High-traffic services, gradual validation | Medium |
MUST NOT DO
- ❌ Deploy without resource limits
- ❌ Use
imagePullPolicy: Alwayswithlatesttag - ❌ Expose secrets as plain env vars
- ❌ Deploy single-replica workloads to production
- ❌ Deploy on Friday without on-call coverage
### 6.2 K8s 最佳实践参考
```markdown
<!-- references/K8S-BEST-PRACTICES.md -->
# Kubernetes Production Best Practices
## Resource Management
```yaml
resources:
requests:
cpu: "100m" # 最小保证
memory: "128Mi" # 最小保证
limits:
cpu: "500m" # 最大上限
memory: "512Mi" # 最大上限(OOM Kill 阈值)
关键原则:
- Requests 必须设置,否则 Pod 可能被调度到资源不足的节点
- Limits 必须设置,防止一个异常 Pod 吃掉整台节点资源
- Requests:Limits 比例建议 1:2 到 1:5
Probes
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
Security Context
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
### 6.3 Deployment 模板
```yaml
# assets/deployment-template.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
labels:
app: myapp
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: myapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: myapp
version: v1
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: myapp
image: ghcr.io/org/myapp:PLACEHOLDER_SHA
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: VERSION
value: "PLACEHOLDER_VERSION"
envFrom:
- secretRef:
name: myapp-secrets
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
七、Stage 5:部署验证 Skill
7.1 设计思路
部署不是终点,验证才是。太多团队点击"部署"后就认为万事大吉,结果 5 分钟后服务 503。这个 Skill 在部署后自动执行:烟雾测试、健康检查、指标验证、日志巡检。
7.2 SKILL.md
---
name: deploy-verify
description: >
Verifies deployment success through smoke tests, health checks,
metric validation, and log inspection. Use after deploying to
any environment, especially staging and production.
metadata:
author: devops-team
version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(curl:*) Read
---
# Deploy Verify Skill
## Verification Sequence
After any deployment, execute these checks **in order**:
### Phase 1: Infrastructure Health (0-30s)
```bash
# Pod 状态
kubectl get pods -n production -l app=myapp
# 等待 Rollout 完成
kubectl rollout status deployment/myapp -n production --timeout=120s
# 检查是否有 CrashLoopBackOff
kubectl get pods -n production -l app=myapp \
--field-selector=status.phase!=Running
Phase 2: Service Health (30s-2min)
# 健康检查端点
curl -sf https://myapp.example.com/health | jq .
# 就绪检查
curl -sf https://myapp.example.com/ready | jq .
# 指标端点(如果暴露了 /metrics)
curl -sf https://myapp.example.com/metrics | grep "up 1"
Phase 3: Smoke Tests (2-5min)
Execute critical path tests:
# API 核心功能测试
curl -sf -X POST https://myapp.example.com/api/v1/login \
-H "Content-Type: application/json" \
-d '{"user":"smoke-test","pass":"test"}' | jq .
# 数据库连接验证
curl -sf https://myapp.example.com/api/v1/ping | jq .
Phase 4: Metric Validation (5-10min)
Check for anomalies in the first 10 minutes:
| Metric | Threshold | Action if Breached |
|---|---|---|
| Error Rate (5xx) | < 1% | Alert + consider rollback |
| Latency P99 | < 500ms | Alert + investigate |
| Pod Restarts | 0 in 10min | Alert + check logs |
| CPU Usage | < 80% | Alert + scale up |
Phase 5: Log Inspection
# 检查最近的 ERROR 日志
kubectl logs -n production -l app=myapp --tail=100 | grep -i "error"
# 检查 OOM 事件
kubectl get events -n production --field-selector reason=OOMKilling
Decision Matrix
if Phase 1 fails:
→ 🔴 CRITICAL: Pods not running. Check logs immediately.
→ Do NOT proceed to Phase 2.
if Phase 2 fails:
→ 🔴 CRITICAL: Health checks failing. Consider rollback.
→ Run: /rollback-guard
if Phase 3 fails:
→ 🟡 WARNING: Functional test failures. Investigate before allowing full traffic.
if Phase 4 fails:
→ 🟡 WARNING: Performance anomaly detected. Monitor closely.
→ If persists >15min, consider rollback.
if all phases pass:
→ ✅ DEPLOYMENT VERIFIED
→ Schedule 1-hour monitoring check
Output
═══════════════════════════════════
DEPLOYMENT VERIFICATION REPORT
═══════════════════════════════════
App: myapp
Version: abc1234
Env: production
Timestamp: 2026-05-19T10:30:00Z
Phase 1 - Infrastructure: ✅ PASS
Phase 2 - Service Health: ✅ PASS
Phase 3 - Smoke Tests: ✅ PASS
Phase 4 - Metrics: 🟡 WARNING (latency P99: 620ms)
Phase 5 - Logs: ✅ PASS
Overall: 🟡 DEPLOYED WITH WARNINGS
Action: Monitor latency for next 15 minutes
═══════════════════════════════════
---
## 八、Stage 6:回滚守卫 Skill
### 8.1 设计思路
回滚是最后的防线。很多团队的回滚方案是这样的:"出问题了就 `kubectl rollout undo`"——但真正出事的时候,往往连回滚命令都找不到,或者回滚后忘了验证。
这个 Skill 做三件事:**部署前生成回滚预案**、**部署时验证回滚就绪**、**出事时一键回滚+验证**。
### 8.2 SKILL.md
```yaml
---
name: rollback-guard
description: >
Generates rollback plans, verifies rollback readiness before
deployment, and executes verified rollback with health checks.
Use before deploying to production or when a deployment fails.
disable-model-invocation: true
metadata:
author: devops-team
version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(curl:*) Read Write
---
# Rollback Guard Skill
## Overview
This skill ensures **no deployment goes to production without a
tested rollback plan**. It operates in three modes:
1. **Plan mode**: Generate a rollback plan before deploying
2. **Verify mode**: Confirm rollback is possible before deploying
3. **Execute mode**: Rollback and verify the rollback succeeded
## Mode 1: Plan (Before Deployment)
Generate a rollback plan:
### Step 1: Record Current State
```bash
# 当前部署版本
CURRENT_IMAGE=$(kubectl get deployment myapp -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current image: $CURRENT_IMAGE"
# 当前副本数
CURRENT_REPLICAS=$(kubectl get deployment myapp -n production \
-o jsonpath='{.spec.replicas}')
echo "Current replicas: $CURRENT_REPLICAS"
# 当前 revision
kubectl rollout history deployment/myapp -n production
Step 2: Generate Rollback Plan
Save to ROLLBACK-PLAN.md:
# Rollback Plan: myapp
**Generated**: {timestamp}
**Current Version**: {current_image}
**Target Version**: {new_image}
## Rollback Command
kubectl rollout undo deployment/myapp -n production
## Verification
kubectl rollout status deployment/myapp -n production --timeout=120s
curl -sf https://myapp.example.com/health
## Full Reset (if undo fails)
kubectl set image deployment/myapp myapp={current_image} -n production
kubectl rollout status deployment/myapp -n production --timeout=120s
## Database Rollback (if applicable)
{migration_rollback_command}
Mode 2: Verify (Before Deployment)
Confirm rollback readiness:
- Previous revision exists in rollout history
- Rollback command tested in staging (or has been used before)
- Database migration has a reversible
downmigration - Feature flags can disable new functionality
- On-call engineer is available
If any check fails → BLOCK deployment until resolved.
Mode 3: Execute (When Things Go Wrong)
Step 1: Rollback
kubectl rollout undo deployment/myapp -n production
Step 2: Verify Rollback
# 等待 rollback 完成
kubectl rollout status deployment/myapp -n production --timeout=120s
# 验证镜像版本回退
CURRENT=$(kubectl get deployment myapp -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Rolled back to: $CURRENT"
# 健康检查
curl -sf https://myapp.example.com/health | jq .
Step 3: Post-Rollback
- Confirm error rate returned to baseline
- Notify team in Slack/飞书: “Production rollback executed”
- Create incident ticket for root cause analysis
- Do NOT redeploy until root cause is identified
Critical Rules
- Never deploy without a rollback plan
- Never skip rollback verification after executing
- Never auto-rollback more than once (second failure = human intervention)
- Always log the rollback reason and timestamp
### 8.3 回滚测试脚本
```bash
#!/bin/bash
# scripts/rollback-test.sh — 在 staging 环境验证回滚流程
set -euo pipefail
NAMESPACE="${1:-staging}"
APP="${2:-myapp}"
echo "=== Rollback Readiness Test ==="
echo "Namespace: $NAMESPACE"
echo "App: $APP"
# 1. 检查 rollout history 是否存在
echo ""
echo "→ Checking rollout history..."
HISTORY=$(kubectl rollout history deployment/$APP -n $NAMESPACE 2>/dev/null | wc -l)
if [ "$HISTORY" -lt 2 ]; then
echo "🔴 FAIL: Insufficient rollout history (need ≥2 revisions, have $HISTORY)"
exit 1
fi
echo "✅ Rollout history: $HISTORY revisions"
# 2. 检查当前 Pod 状态
echo ""
echo "→ Checking current pod health..."
READY=$(kubectl get deployment $APP -n $NAMESPACE \
-o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0")
DESIRED=$(kubectl get deployment $APP -n $NAMESPACE \
-o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0")
if [ "$READY" != "$DESIRED" ]; then
echo "🟡 WARNING: Not all pods ready ($READY/$DESIRED)"
else
echo "✅ All pods ready ($READY/$DESIRED)"
fi
# 3. 模拟回滚
echo ""
echo "→ Simulating rollback (dry-run)..."
kubectl rollout undo deployment/$APP -n $NAMESPACE --dry-run=client
echo "✅ Dry-run rollback succeeded"
echo ""
echo "=== Rollback Readiness: PASS ==="
echo "This deployment has a viable rollback path."
九、GitHub Actions 完整流水线集成
9.1 将 6 个 Skill 编排为一套 CI/CD Workflow
以下是一个完整的 GitHub Actions workflow,对应我们的 6 个 Skill:
# .github/workflows/ci.yml
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ========== Stage 1: CI 审查 + 安全扫描 ==========
review-and-scan:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write
steps:
- uses: actions/checkout@v4
# SAST 扫描 (对应 security-scan Skill Layer 1)
- name: Run Semgrep SAST
uses: semgrep/semgrep-action@v1
with:
config: auto
# 依赖漏洞扫描 (对应 security-scan Skill Layer 2)
- name: Run pip-audit
run: |
pip install pip-audit
pip-audit -r requirements.txt
# 密钥泄露检测 (对应 security-scan Skill Layer 3)
- name: Run Gitleaks
uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# ========== Stage 2: 构建与测试 ==========
build-and-test:
needs: review-and-scan
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run tests
run: |
pytest --cov=src --cov-report=xml --junitxml=test-results.xml
- name: Upload coverage
uses: codecov/codecov-action@v4
if: always()
# ========== Stage 3: Docker 构建 ==========
docker-build:
needs: build-and-test
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
permissions:
packages: write
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
VERSION=${{ github.ref_name }}
COMMIT_SHA=${{ github.sha }}
# 容器安全扫描 (对应 docker-build Skill Step 3)
- name: Scan image with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
exit-code: '1'
severity: 'HIGH,CRITICAL'
# ========== Stage 4: K8s 部署 ==========
deploy:
needs: docker-build
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to Kubernetes
run: |
# 更新 deployment 镜像标签
kubectl set image deployment/myapp myapp=${{ needs.docker-build.outputs.image_tag }} \
-n production
# 等待 rollout 完成
kubectl rollout status deployment/myapp -n production --timeout=300s
# ========== Stage 5: 部署验证 ==========
verify:
needs: deploy
runs-on: ubuntu-latest
steps:
- name: Health Check
run: |
for i in $(seq 1 10); do
if curl -sf https://myapp.example.com/health; then
echo "✅ Health check passed (attempt $i)"
exit 0
fi
echo "⏳ Waiting for health check... (attempt $i/10)"
sleep 10
done
echo "🔴 Health check failed after 10 attempts"
exit 1
- name: Smoke Test
run: |
curl -sf https://myapp.example.com/api/v1/ping | jq .
- name: Verify Metrics
run: |
# 检查错误率
ERROR_RATE=$(curl -sf 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])' | jq '.data.result[0].value[1]' || echo "1")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "🔴 Error rate too high: $ERROR_RATE"
exit 1
fi
echo "✅ Error rate acceptable: $ERROR_RATE"
十、一条命令触发完整流水线:编排 Skill 的 Skill
10.1 元 Skill:一键流水线
现在我们有 6 个独立的 Skill,但真正的威力在于把它们串起来——用一个"元 Skill"统一调度:
---
name: pipeline
description: >
Executes the complete CI/CD pipeline: review, security scan,
docker build, K8s deploy, verification, and rollback readiness.
Use when the user wants to deploy, run the full pipeline,
or says "ship it".
disable-model-invocation: true
metadata:
author: devops-team
version: "1.0"
---
# Pipeline Skill — 一键 CI/CD
## 执行流程
When this skill is invoked, execute the following stages **in order**.
If any stage fails, STOP and report the failure. Do not proceed.
### Stage 1: CI Review
Load and execute the `ci-review` skill.
Review all CI/CD configurations for correctness and security.
**Gate**: Zero 🔴 findings required to proceed.
### Stage 2: Security Scan
Load and execute the `security-scan` skill.
Run all three scan layers (SAST, dependency, secrets).
**Gate**: Zero 🔴 findings required to proceed.
### Stage 3: Docker Build
Load and execute the `docker-build` skill.
Build the production image with security scanning.
**Gate**: Trivy scan passes (no HIGH/CRITICAL CVEs).
### Stage 4: K8s Deploy
Load and execute the `k8s-deploy` skill.
Deploy to the target environment with proper manifests.
**Gate**: `kubectl rollout status` succeeds.
### Stage 5: Deploy Verify
Load and execute the `deploy-verify` skill.
Run all five verification phases.
**Gate**: All phases PASS or WARNING (not CRITICAL).
### Stage 6: Rollback Guard
Load and execute the `rollback-guard` skill (plan + verify mode).
Confirm rollback readiness for this deployment.
**Gate**: Rollback plan exists and verified.
## Output Summary
After all stages complete:
╔══════════════════════════════════════════╗
║ CI/CD PIPELINE EXECUTION ║
╠══════════════════════════════════════════╣
║ Stage 1: CI Review ✅ PASS ║
║ Stage 2: Security Scan ✅ PASS ║
║ Stage 3: Docker Build ✅ PASS ║
║ Stage 4: K8s Deploy ✅ PASS ║
║ Stage 5: Deploy Verify 🟡 WARN ║
║ Stage 6: Rollback Guard ✅ READY ║
╠══════════════════════════════════════════╣
║ OVERALL: ✅ DEPLOYED WITH WARNINGS ║
║ Rollback: READY (revision 47) ║
╚══════════════════════════════════════════╝
## Error Handling
If any stage fails:
1. Print the full failure details
2. If Stage 4+ fails, ask: "Do you want to rollback?"
3. If user says yes, execute `rollback-guard` in execute mode
4. Generate an incident ticket with timeline
## Fast Path
For non-production deployments (dev/staging), skip:
- Stage 6 (Rollback Guard)
- Trivy critical-only gate → downgrade to warning
User can say: `/pipeline --env=staging` for the fast path.
10.2 使用方式
# 一键完整流水线(生产环境)
> /pipeline
# 快速模式(staging,跳过回滚守卫)
> /pipeline --env=staging
# 只跑安全扫描
> /security-scan
# 只构建 Docker 镜像
> /docker-build
# 部署并验证
> /k8s-deploy
# 出事了!回滚
> /rollback-guard
十一、进阶:与 GitHub Actions 的深度集成
11.1 在 CI 中自动调用 Agent Skill
使用 Claude Code 的 GitHub Action,让 CI 流水线自动执行 Skill:
# .github/workflows/ai-review.yml
name: AI Agent Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Run AI Code Review
uses: anthropics/claude-code-action@v1
with:
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
# 自动加载项目中的 Skills
skills: ci-review,security-scan
prompt: |
Review this PR using the ci-review and security-scan skills.
Post findings as PR comments.
11.2 Skill 即文档:CI/CD 配置的可解释性
传统 CI/CD 的一个老大难问题是——YAML 不是给人读的。一个 200 行的 workflow 文件,新人看到头大。
而 Agent Skill 天然就是可解释的:
新人: "为什么 CI 里要先跑 SAST 再跑依赖扫描?"
Agent: 根据 security-scan Skill 的设计:
Layer 1 SAST 先跑,因为它是纯静态分析,速度最快(~30s),
可以快速拦截明显的代码注入漏洞。
Layer 2 依赖扫描需要联网查询 CVE 数据库,较慢(~2min)。
如果 SAST 就能拦截,就不需要浪费时间跑后续扫描。
这是 fail-fast 原则——最快的检查放最前面。
你再也不需要维护一份永远过时的 Wiki。Skill 本身就是活文档。
十二、企业级部署:安全红线与团队规范
12.1 安全红线清单
| 红线 | 检查点 | 在哪个 Skill 中执行 |
|---|---|---|
| 零硬编码密钥 | 无 AWS Key/GitHub Token/数据库密码 | security-scan Layer 3 |
零 latest 标签 |
Docker 镜像和 K8s 清单中无 latest |
docker-build + k8s-deploy |
| 零 root 容器 | Dockerfile 中 USER nonroot |
docker-build |
| 零无限制资源 | K8s 清单中设置了 requests+limits | k8s-deploy |
| 零无回滚部署 | 部署前有回滚预案并验证 | rollback-guard |
| 零未扫描镜像 | Trivy 扫描无 HIGH/CRITICAL | docker-build |
12.2 团队协作模式
Tech Lead:
→ 维护 .claude/skills/ 目录(代码审查、版本控制)
→ 定义安全红线和企业规范
DevOps Engineer:
→ 编写和优化 Skill 脚本
→ 监控流水线健康度
开发者:
→ 使用 /pipeline 一键部署
→ 使用 /security-scan 自查
→ 使用 /rollback-guard 应急
新人:
→ 阅读 SKILL.md 理解 CI/CD 流程
→ 通过 /ci-review 学习最佳实践
→ Skill 就是可交互的新人培训文档
12.3 Skill 版本管理与分发
# 企业内部 Skill Registry(推荐)
npx skills add https://gitlab.internal.com/devops/agent-skills --skill pipeline
# 团队共享:放入项目仓库
git add .claude/skills/
git commit -m "feat: add CI/CD pipeline skills v2.0"
# 新成员 clone 后自动获得所有 Skills
git clone https://github.com/org/my-project.git
cd my-project
# .claude/skills/ 随项目到达,无需额外安装
十三、效果对比:传统 CI/CD vs Agent Skills CI/CD
| 维度 | 传统方式 | Agent Skills 方式 |
|---|---|---|
| 配置时间 | 2-3 天手写 YAML | 30 分钟写 6 个 Skill |
| 安全覆盖 | 依赖工程师经验,容易遗漏 | Skill 内置检查清单,零遗漏 |
| 知识传承 | 离职带走 | Skill 随项目走 |
| 故障响应 | 翻 Wiki、问老员工 | 问 Agent,秒级回答 |
| 新人上手 | 1-2 周理解流水线 | 读 SKILL.md,1 天理解 |
| 可解释性 | YAML 难读 | 自然语言指令,人人可读 |
| 跨项目复用 | 复制粘贴 YAML | Skill 目录级复用 |
| 回滚预案 | 经常忘了写 | 强制生成+验证 |
| 周五部署 | 提心吊胆 | 回滚守卫保驾护航 |
十四、总结:从 DevOps 到 AIOps 的关键一步
14.1 核心洞察
Agent Skills + CI/CD 不是简单的"用 AI 写 YAML",而是一次根本性的范式转换:
传统 DevOps:
人写配置 → 机器执行 → 出了问题人排查
Agent Skills DevOps:
人定义规则 → Agent 执行+审查 → 出了问题 Agent 先处置
你不再是"写流水线的人",你是"教 Agent 怎么写流水线的人"。
14.2 六个 Skill 的本质
| Skill | 本质 | 类比 |
|---|---|---|
ci-review |
CI 配置审查员 | 质检员 |
security-scan |
安全安检门 | 机场安检 |
docker-build |
制品工厂厂长 | 生产车间 |
k8s-deploy |
部署指挥官 | 交通调度 |
deploy-verify |
验收检查员 | 出厂检验 |
rollback-guard |
安全兜底 | 保险绳 |
pipeline |
总调度 | 厂长 |
14.3 行动路线
第一步:安装 ci-review + security-scan 两个 Skill
→ 立即提升 CI 安全性
↓
第二步:添加 docker-build + k8s-deploy
→ 标准化构建和部署流程
↓
第三步:添加 deploy-verify + rollback-guard
→ 完善部署后验证和安全兜底
↓
第四步:编写 pipeline 元 Skill
→ 一键完整流水线
↓
第五步:接入 GitHub Actions Claude Code Action
→ CI 中自动调用 Skill,PR 自动审查
↓
第六步:团队推广 → 企业 Skill Registry
→ 从"一个人用"到"全团队标准化"
附录:关键资源
| 资源 | 链接 |
|---|---|
| Agent Skills 开放标准 | https://agentskills.io/specification |
| 上篇:Agent Skills 完全指南 | 从 Prompt 到 Skill 的范式革命 |
| devops-engineer Skill | https://github.com/Jeffallan/claude-skills |
| Pulumi Agent Skills | https://github.com/pulumi/agent-skills |
| GitHub Actions Claude Code Action | https://github.com/anthropics/claude-code-action |
| Trivy 容器扫描 | https://trivy.dev |
| Semgrep SAST | https://semgrep.dev |
| Gitleaks 密钥检测 | https://gitleaks.io |
| ArgoCD GitOps | https://argoproj.github.io/cd |
| skills-ref 验证工具 | https://github.com/agentskills/agentskills |
| awesome-agent-skills | https://github.com/VoltAgent/awesome-agent-skills |
写在最后:CI/CD 是软件工程中最"反人性"的环节之一——它要求你既懂安全、又懂运维、还要写 YAML 这种反人类语言。Agent Skills 把这些硬核知识封装成可复用的"技能包",让每个开发者都能拥有一个 10 年经验的 DevOps 工程师在旁边把关。
从今天起,你的 CI/CD 流水线不再是一条冷冰冰的管道——它是一个有判断力、有安全意识、有回滚预案的智能助手。
点赞+收藏+关注三连,下期我们聊聊:用 Agent Skills 构建多 Agent 协作的微服务治理平台。
更多推荐


所有评论(0)