用 Agent Skills 构建完整的 CI/CD 自动化流水线:让 AI Agent 成为你的 DevOps 工程师

上篇我们聊了 Agent Skills 的概念与入门实战——5 分钟写一个 Git 变更摘要 Skill,30 分钟搭一套代码审查 Skill。但那只是"单兵作战"。

真正的生产级 CI/CD 流水线涉及代码审查、安全扫描、构建测试、Docker 镜像、K8s 部署、回滚验证……六个环节,数十个检查点。过去你需要写几千行 YAML、维护十几个 Shell 脚本、祈祷周五部署不出事故。

今天,我们用 6 个 Agent Skill 把整条流水线装进 AI Agent 的大脑。一条命令,从 git push 到生产环境健康检查,全链路自动化。


一、为什么 CI/CD 是 Agent Skills 最有价值的战场?

1.1 传统 CI/CD 的三大痛点

痛点 表现 Agent Skill 如何解决
配置地狱 一个 GitHub Actions workflow 200+ 行 YAML,改一个检查点要翻半天 每个 Skill 只管一个环节,改哪里一目了然
知识断层 只有老王知道为什么 staging 要先跑数据库迁移,老人一走,流水线就炸 标准化 SOP 封装进 Skill,知识不随人走
安全盲区 写了 CI 忘了加安全扫描,或者加了 SAST 忘了加容器扫描 Skill 内置安全检查清单,零遗漏

1.2 一个真实的场景对比

传统方式

开发者 push 代码
  → 写 .github/workflows/ci.yml(200行+)
  → 写 Dockerfile(忘了多阶段构建)
  → 写 k8s/deployment.yaml(忘了加 resource limits)
  → 写 rollback.sh(忘了加健康检查验证)
  → 部署到生产
  → 凌晨 3 点被 PagerDuty 叫醒

Agent Skills 方式

开发者 push 代码
  → /ci-review        ← 自动审查 CI 配置
  → /security-scan    ← SAST + 依赖扫描 + 容器扫描
  → /docker-build     ← 多阶段构建 + 镜像扫描 + 推送
  → /k8s-deploy       ← GitOps 部署 + 健康检查
  → /deploy-verify    ← 烟雾测试 + 回滚就绪确认
  → 睡个好觉

1.3 核心设计理念:Skill 既是"手册"也是"检查员"

传统 CI/CD 的配置文件是"哑管道"——它只管执行,不管对错。你忘了加安全扫描?它不会提醒你。你用了 latest 标签?它照样部署。

Agent Skill 是"智能管道"——它既知道该怎么做,也知道不该怎么做。每一个 Skill 内置了 MUST DOMUST NOT DO,就像一个 10 年经验的 DevOps 工程师在实时审查你的每一步操作。


二、流水线全景:6 个 Skill 覆盖完整 CI/CD

2.1 架构总览

                    ┌─────────────────────────────────┐
                    │         主 Agent (Claude)         │
                    │      统一调度 6 个 Skills          │
                    └──────────┬──────────────────────┘
                               │
          ┌────────────────────┼────────────────────┐
          │                    │                    │
    ┌─────▼─────┐      ┌──────▼──────┐      ┌──────▼──────┐
    │  Stage 1  │      │  Stage 2   │      │  Stage 3   │
    │ CI 审查    │ ───→ │  安全扫描   │ ───→ │ Docker 构建 │
    │ ci-review │      │security-scan│      │docker-build│
    └───────────┘      └────────────┘      └────────────┘
                                               │
          ┌────────────────────┼────────────────┘
          │                    │
    ┌─────▼─────┐      ┌──────▼──────┐
    │  Stage 4  │      │  Stage 5   │
    │ K8s 部署  │ ───→ │  部署验证   │
    │k8s-deploy │      │deploy-verify│
    └───────────┘      └────────────┘
                            │
                     ┌──────▼──────┐
                     │  Stage 6   │
                     │  回滚就绪   │
                     │rollback-guard│
                     └────────────┘

2.2 6 个 Skill 的职责分工

Skill 触发时机 核心职责 风险拦截
ci-review PR 提交 审查 CI 配置、检查依赖、验证构建脚本 阻止无效的 workflow 配置
security-scan CI 阶段 SAST + 依赖漏洞 + 密钥泄露 + 容器扫描 阻止含漏洞/硬编码密钥的代码合入
docker-build 构建阶段 多阶段构建、镜像扫描、标签策略 阻止 latest 标签、root 用户镜像
k8s-deploy 部署阶段 生成 K8s 清单、GitOps 同步、灰度配置 阻止缺少 resource limits 的部署
deploy-verify 部署后 烟雾测试、健康检查、指标验证 阻止异常部署继续放流
rollback-guard 全程 回滚预案生成、回滚验证、回滚演练 阻止无回滚预案的部署

2.3 项目目录结构

my-project/
├── .claude/
│   └── skills/
│       ├── ci-review/
│       │   └── SKILL.md
│       ├── security-scan/
│       │   ├── SKILL.md
│       │   └── scripts/
│       │       ├── sast-scan.sh
│       │       ├── dependency-check.sh
│       │       └── secret-detect.sh
│       ├── docker-build/
│       │   ├── SKILL.md
│       │   └── assets/
│       │       └── Dockerfile.template
│       ├── k8s-deploy/
│       │   ├── SKILL.md
│       │   ├── references/
│       │   │   └── K8S-BEST-PRACTICES.md
│       │   └── assets/
│       │       └── deployment-template.yaml
│       ├── deploy-verify/
│       │   └── SKILL.md
│       └── rollback-guard/
│           ├── SKILL.md
│           └── scripts/
│               └── rollback-test.sh
├── .github/
│   └── workflows/
│       └── ci.yml
├── k8s/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── ingress.yaml
├── Dockerfile
└── src/

三、Stage 1:CI 配置审查 Skill

3.1 设计思路

CI 配置是最容易被忽视的环节——开发者往往复制粘贴一个模板,改改分支名就算完事。但糟糕的 CI 配置会导致:缓存失效拖慢构建、权限过大泄露密钥、缺少超时设置浪费资源。

3.2 SKILL.md

---
name: ci-review
description: >
  Reviews CI/CD pipeline configurations for correctness, security,
  and performance. Use when creating or modifying GitHub Actions,
  GitLab CI, or Jenkins pipeline configs, or when the user asks
  to review their CI setup.
metadata:
  author: devops-team
  version: "1.0"
---

# CI Configuration Review Skill

## Instructions

### Step 1: Identify CI System

Detect which CI system is in use:
- `.github/workflows/*.yml` → GitHub Actions
- `.gitlab-ci.yml` → GitLab CI
- `Jenkinsfile` → Jenkins

### Step 2: Structural Review

Check the following for **each** workflow file:

**Security Checks:**
- [ ] No hardcoded secrets (use `${{ secrets.* }}` instead)
- [ ] `permissions:` block is set to minimum required
- [ ] Third-party actions are pinned to SHA, not just tag
- [ ] `pull_request` workflows don't have `write` permissions
- [ ] No `AWS_ACCESS_KEY_ID` in env vars without secrets reference

**Performance Checks:**
- [ ] Caching is enabled (npm/pip/docker layer caching)
- [ ] Jobs run in parallel where possible (no unnecessary `needs:`)
- [ ] Timeout limits are set (`timeout-minutes: 15` default)
- [ ] Artifact retention is configured (not infinite)

**Reliability Checks:**
- [ ] Branch filters are correct (main/release only for deploy)
- [ ] Failure notifications are configured
- [ ] Retry logic for flaky steps (`retry-action` or self-authored)
- [ ] Build matrix covers all supported versions

### Step 3: Generate Report

Output a structured review:

| Category | Finding | Severity | File:Line | Fix |
|----------|---------|----------|-----------|-----|

Severity levels:
- 🔴 **Block**: Must fix before merge (security risk, will break)
- 🟡 **Warning**: Should fix soon (performance, reliability)
- 🟢 **Suggestion**: Nice to have (optimization, style)

### Step 4: Auto-Fix (if requested)

When the user says "fix it", apply all 🔴 fixes automatically
and present 🟡 fixes for confirmation.

## Current CI Configuration

!`find .github/workflows -name '*.yml' -o -name '*.yaml' 2>/dev/null | head -20`

!`cat .github/workflows/*.yml 2>/dev/null || echo "No GitHub Actions workflows found"`

四、Stage 2:安全扫描 Skill(含脚本)

4.1 设计思路

安全扫描是 CI/CD 流水线的"安检门"。我们设计三层扫描:SAST 静态代码扫描依赖漏洞扫描密钥泄露检测。三个脚本独立运行,Skill 统一编排。

4.2 SKILL.md

---
name: security-scan
description: >
  Performs multi-layer security scanning: SAST code analysis,
  dependency vulnerability check, and secret/credential leak detection.
  Use before merging PRs, during CI pipeline, or when the user
  asks for a security audit.
metadata:
  author: security-team
  version: "1.0"
allowed-tools: Bash(bandit:*) Bash(pip-audit:*) Bash(trivy:*) Bash(gitleaks:*) Read
---

# Security Scan Skill

## Overview

Three-layer security scanning for CI/CD pipelines:
1. **SAST** — Static code analysis for vulnerability patterns
2. **Dependency** — Known CVE scanning for third-party packages
3. **Secrets** — Credential and API key leak detection

## Execution

### Layer 1: SAST (Static Application Security Testing)

```bash
bash scripts/sast-scan.sh

The script detects the project language and runs the appropriate scanner:

  • Python → bandit
  • JavaScript/TypeScript → semgrep
  • Go → gosec
  • Java → spotbugs

Layer 2: Dependency Vulnerability Scan

bash scripts/dependency-check.sh

Scans lockfiles and manifests for known CVEs:

  • Python → pip-audit / safety
  • Node.js → npm audit / snyk
  • Go → nancy / osv-scanner
  • Java → owasp-dependency-check

Layer 3: Secret Detection

bash scripts/secret-detect.sh

Scans for:

  • AWS keys, API tokens, private keys
  • Database connection strings with passwords
  • Hardcoded JWT secrets
  • .env files committed to the repo

Output Format

For each finding:

# Layer Severity File Line Issue CVE/Rule Fix

Severity levels:

  • 🔴 Critical: Exploitable vulnerability, leaked credentials → BLOCK merge
  • 🟡 High: Known CVE with patch available → Warn, suggest fix
  • 🟢 Medium/Low: Best practice, informational → Document only

Decision Logic

if Critical findings > 0:
    → BLOCK: "Security scan failed. Fix critical issues before merging."
    → Generate fix suggestions for each critical finding
elif High findings > 3:
    → WARN: "Multiple high-severity issues found. Review recommended."
    → List all high findings with fix suggestions
else:
    → PASS: "Security scan passed."
    → Summary of low/info findings

Edge Cases

  • If no scanner is installed, provide installation instructions
  • If scan times out (>5 min), report partial results and note which layers are incomplete
  • For false positives, document the suppression rule (not just ignore)
  • Never skip a layer entirely — if one fails, note it and continue with others

### 4.3 SAST 扫描脚本

```bash
#!/bin/bash
# scripts/sast-scan.sh — 静态代码安全扫描
set -euo pipefail

echo "=== SAST Scan Starting ==="
echo "Detecting project language..."

# 检测项目语言并运行对应扫描器
if [ -f "requirements.txt" ] || [ -f "pyproject.toml" ]; then
    echo "→ Python project detected"
    if command -v bandit &>/dev/null; then
        bandit -r src/ -f json -o /tmp/sast-results.json 2>/dev/null || true
        bandit -r src/ -f txt 2>/dev/null || true
    else
        echo "⚠ bandit not installed. Install: pip install bandit"
    fi
elif [ -f "package.json" ]; then
    echo "→ JavaScript/TypeScript project detected"
    if command -v semgrep &>/dev/null; then
        semgrep --config auto --json src/ -o /tmp/sast-results.json 2>/dev/null || true
        semgrep --config auto src/ 2>/dev/null || true
    else
        echo "⚠ semgrep not installed. Install: pip install semgrep"
    fi
elif [ -f "go.mod" ]; then
    echo "→ Go project detected"
    if command -v gosec &>/dev/null; then
        gosec -fmt=json -out=/tmp/sast-results.json ./... 2>/dev/null || true
        gosec -fmt=text ./... 2>/dev/null || true
    else
        echo "⚠ gosec not installed. Install: go install github.com/securego/gosec/v2/cmd/gosec@latest"
    fi
elif [ -f "pom.xml" ] || [ -f "build.gradle" ]; then
    echo "→ Java project detected"
    echo "ℹ Java SAST requires SpotBugs. Consider CI integration."
fi

echo "=== SAST Scan Complete ==="

4.4 密钥泄露检测脚本

#!/bin/bash
# scripts/secret-detect.sh — 密钥与凭证泄露检测
set -euo pipefail

echo "=== Secret Detection Starting ==="

# gitleaks 扫描
if command -v gitleaks &>/dev/null; then
    echo "→ Running gitleaks scan..."
    gitleaks detect --source . --no-banner --report-format json \
        --report-path /tmp/secret-results.json 2>/dev/null || true

    # 检查是否有发现
    if [ -s /tmp/secret-results.json ]; then
        FINDING_COUNT=$(python3 -c "
import json
try:
    data = json.load(open('/tmp/secret-results.json'))
    print(len(data))
except:
    print(0)
" 2>/dev/null || echo "0")

        if [ "$FINDING_COUNT" -gt 0 ]; then
            echo "🔴 CRITICAL: $FINDING_COUNT secret(s) detected!"
            gitleaks detect --source . --no-banner 2>/dev/null || true
        else
            echo "✅ No secrets detected"
        fi
    else
        echo "✅ No secrets detected"
    fi
else
    echo "⚠ gitleaks not installed. Install: brew install gitleaks"
    echo "  Alternative: docker run --rm -v $(pwd):/repo zricethezav/gitleaks detect --source /repo"
fi

# 额外检查:.env 文件是否被提交
if git ls-files --error-unmatch .env .env.* 2>/dev/null; then
    echo "🔴 CRITICAL: .env file is tracked in git! Remove with: git rm --cached .env"
fi

echo "=== Secret Detection Complete ==="

五、Stage 3:Docker 构建 Skill(含模板)

5.1 设计思路

Docker 镜像是 CI/CD 的"交付物"。一个糟糕的 Dockerfile 可以让镜像膨胀到 2GB,也可以让攻击者拿到 root 权限。这个 Skill 强制执行:多阶段构建、非 root 用户、固定标签、镜像扫描。

5.2 SKILL.md

---
name: docker-build
description: >
  Builds production-grade Docker images with multi-stage builds,
  security scanning, and proper tagging. Use when writing or
  reviewing Dockerfiles, building Docker images, or pushing to
  container registries.
metadata:
  author: devops-team
  version: "1.0"
disable-model-invocation: false
allowed-tools: Bash(docker:*) Bash(trivy:*) Read Write
---

# Docker Build Skill

## Pre-Build Checklist

Before writing or reviewing any Dockerfile, verify:

- [ ] Uses multi-stage build (builder + runtime)
- [ ] Base image uses specific tag, NOT `latest`
- [ ] Non-root user is set (`USER nonroot`)
- [ ] HEALTHCHECK instruction is present
- [ ] `.dockerignore` exists and excludes: `.git`, `node_modules`, `__pycache__`, `.env`
- [ ] No secrets are COPY'd into the image
- [ ] Build context is minimal (no unnecessary files)

## Build Process

### Step 1: Review or Generate Dockerfile

If a Dockerfile exists, review it against the checklist above.
If not, generate one using the template at [Dockerfile.template](assets/Dockerfile.template).

### Step 2: Build the Image

```bash
# Build with build args for version injection
docker build \
  --build-arg VERSION=$(git describe --tags --always) \
  --build-arg COMMIT_SHA=$(git rev-parse HEAD) \
  -t myapp:$(git rev-parse --short HEAD) \
  -t myapp:latest-build \
  .

Tag Policy:

  • Git SHA tag (e.g., myapp:abc1234) → For exact version tracking
  • Branch tag (e.g., myapp:main) → For latest on branch
  • ❌ NEVER push a latest tag to production registry

Step 3: Scan the Image

# Trivy vulnerability scan
trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:$(git rev-parse --short HEAD)

# If critical vulnerabilities found:
# 1. Check if base image update fixes them
# 2. If not, document the risk and get explicit approval
# 3. Never silently ignore critical vulnerabilities

Step 4: Push to Registry

docker tag myapp:abc1234 ghcr.io/org/myapp:abc1234
docker push ghcr.io/org/myapp:abc1234

MUST NOT DO

  • ❌ Use FROM node:latest or any latest tag
  • ❌ Run as root user in production
  • ❌ COPY .env or any secret file
  • ❌ Push latest tag to production registry
  • ❌ Skip image scanning before push
  • ❌ Use ADD instead of COPY (unless extracting tar)

### 5.3 Dockerfile 模板

```dockerfile
# assets/Dockerfile.template — 多阶段构建模板
# 适用于 Python 项目,按需调整为 Node.js/Go/Java

# ============ Stage 1: Builder ============
FROM python:3.12-slim AS builder

WORKDIR /app

# 先复制依赖文件(利用 Docker 层缓存)
COPY requirements.txt .

# 安装依赖到独立目录
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ============ Stage 2: Runtime ============
FROM python:3.12-slim

# 元数据
ARG VERSION=dev
ARG COMMIT_SHA=unknown
LABEL version="${VERSION}" \
      commit-sha="${COMMIT_SHA}" \
      maintainer="dev-team"

WORKDIR /app

# 从 builder 阶段复制已安装的依赖
COPY --from=builder /install /usr/local

# 复制应用代码
COPY . .

# 创建非 root 用户
RUN groupadd -r appuser && useradd -r -g appuser appuser
USER appuser

# 健康检查
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# 暴露端口
EXPOSE 8080

# 启动命令
CMD ["python", "main.py"]

六、Stage 4:Kubernetes 部署 Skill

6.1 SKILL.md

---
name: k8s-deploy
description: >
  Generates and reviews Kubernetes deployment manifests following
  production best practices. Use when deploying to K8s, writing
  deployment manifests, or setting up GitOps with ArgoCD/Flux.
metadata:
  author: devops-team
  version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(helm:*) Read Write
---

# Kubernetes Deploy Skill

## Pre-Deploy Checklist

Every deployment MUST satisfy:

- [ ] Resource limits set (CPU + Memory for requests and limits)
- [ ] Liveness probe configured
- [ ] Readiness probe configured
- [ ] Image tag is specific SHA, NOT `latest`
- [ ] Secrets mounted from Secret objects, NOT environment variables
- [ ] Pod Disruption Budget defined (for production)
- [ ] Network policy configured (if cluster requires)
- [ ] Anti-affinity rules set (for HA, ≥2 replicas)

## Deployment Flow

### Step 1: Generate/Review Manifests

Check the following files:
- `k8s/deployment.yaml` — Main deployment
- `k8s/service.yaml` — Service exposure
- `k8s/ingress.yaml` — Ingress routing
- `k8s/hpa.yaml` — Horizontal Pod Autoscaler (recommended)

Refer to [K8S-BEST-PRACTICES.md](references/K8S-BEST-PRACTICES.md)
for the complete reference.

### Step 2: Validate Manifests

```bash
# Dry-run validation
kubectl apply --dry-run=client -f k8s/

# If using Helm:
helm template myapp ./chart | kubeval --strict

Step 3: GitOps Sync (ArgoCD)

# Push manifests to GitOps repo
git add k8s/
git commit -m "deploy: myapp $(git rev-parse --short HEAD)"
git push origin main

# ArgoCD auto-syncs. Monitor:
argocd app get myapp --refresh

Step 4: Deploy (Non-GitOps Fallback)

kubectl apply -f k8s/
kubectl rollout status deployment/myapp -n production --timeout=300s

Deployment Strategy Selection

Strategy When to Use Risk Level
Rolling Update Default, low-risk changes Low
Blue-Green Major version upgrades, DB schema changes Medium
Canary High-traffic services, gradual validation Medium

MUST NOT DO

  • ❌ Deploy without resource limits
  • ❌ Use imagePullPolicy: Always with latest tag
  • ❌ Expose secrets as plain env vars
  • ❌ Deploy single-replica workloads to production
  • ❌ Deploy on Friday without on-call coverage

### 6.2 K8s 最佳实践参考

```markdown
<!-- references/K8S-BEST-PRACTICES.md -->

# Kubernetes Production Best Practices

## Resource Management

```yaml
resources:
  requests:
    cpu: "100m"      # 最小保证
    memory: "128Mi"  # 最小保证
  limits:
    cpu: "500m"      # 最大上限
    memory: "512Mi"  # 最大上限(OOM Kill 阈值)

关键原则

  • Requests 必须设置,否则 Pod 可能被调度到资源不足的节点
  • Limits 必须设置,防止一个异常 Pod 吃掉整台节点资源
  • Requests:Limits 比例建议 1:2 到 1:5

Probes

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Security Context

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

### 6.3 Deployment 模板

```yaml
# assets/deployment-template.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
  labels:
    app: myapp
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: myapp
        version: v1
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: myapp
          image: ghcr.io/org/myapp:PLACEHOLDER_SHA
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: VERSION
              value: "PLACEHOLDER_VERSION"
          envFrom:
            - secretRef:
                name: myapp-secrets
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}

七、Stage 5:部署验证 Skill

7.1 设计思路

部署不是终点,验证才是。太多团队点击"部署"后就认为万事大吉,结果 5 分钟后服务 503。这个 Skill 在部署后自动执行:烟雾测试、健康检查、指标验证、日志巡检。

7.2 SKILL.md

---
name: deploy-verify
description: >
  Verifies deployment success through smoke tests, health checks,
  metric validation, and log inspection. Use after deploying to
  any environment, especially staging and production.
metadata:
  author: devops-team
  version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(curl:*) Read
---

# Deploy Verify Skill

## Verification Sequence

After any deployment, execute these checks **in order**:

### Phase 1: Infrastructure Health (0-30s)

```bash
# Pod 状态
kubectl get pods -n production -l app=myapp

# 等待 Rollout 完成
kubectl rollout status deployment/myapp -n production --timeout=120s

# 检查是否有 CrashLoopBackOff
kubectl get pods -n production -l app=myapp \
  --field-selector=status.phase!=Running

Phase 2: Service Health (30s-2min)

# 健康检查端点
curl -sf https://myapp.example.com/health | jq .

# 就绪检查
curl -sf https://myapp.example.com/ready | jq .

# 指标端点(如果暴露了 /metrics)
curl -sf https://myapp.example.com/metrics | grep "up 1"

Phase 3: Smoke Tests (2-5min)

Execute critical path tests:

# API 核心功能测试
curl -sf -X POST https://myapp.example.com/api/v1/login \
  -H "Content-Type: application/json" \
  -d '{"user":"smoke-test","pass":"test"}' | jq .

# 数据库连接验证
curl -sf https://myapp.example.com/api/v1/ping | jq .

Phase 4: Metric Validation (5-10min)

Check for anomalies in the first 10 minutes:

Metric Threshold Action if Breached
Error Rate (5xx) < 1% Alert + consider rollback
Latency P99 < 500ms Alert + investigate
Pod Restarts 0 in 10min Alert + check logs
CPU Usage < 80% Alert + scale up

Phase 5: Log Inspection

# 检查最近的 ERROR 日志
kubectl logs -n production -l app=myapp --tail=100 | grep -i "error"

# 检查 OOM 事件
kubectl get events -n production --field-selector reason=OOMKilling

Decision Matrix

if Phase 1 fails:
    → 🔴 CRITICAL: Pods not running. Check logs immediately.
    → Do NOT proceed to Phase 2.

if Phase 2 fails:
    → 🔴 CRITICAL: Health checks failing. Consider rollback.
    → Run: /rollback-guard

if Phase 3 fails:
    → 🟡 WARNING: Functional test failures. Investigate before allowing full traffic.

if Phase 4 fails:
    → 🟡 WARNING: Performance anomaly detected. Monitor closely.
    → If persists >15min, consider rollback.

if all phases pass:
    → ✅ DEPLOYMENT VERIFIED
    → Schedule 1-hour monitoring check

Output

═══════════════════════════════════
  DEPLOYMENT VERIFICATION REPORT
═══════════════════════════════════
App:        myapp
Version:    abc1234
Env:        production
Timestamp:  2026-05-19T10:30:00Z

Phase 1 - Infrastructure:  ✅ PASS
Phase 2 - Service Health:  ✅ PASS
Phase 3 - Smoke Tests:     ✅ PASS
Phase 4 - Metrics:         🟡 WARNING (latency P99: 620ms)
Phase 5 - Logs:            ✅ PASS

Overall: 🟡 DEPLOYED WITH WARNINGS
Action: Monitor latency for next 15 minutes
═══════════════════════════════════

---

## 八、Stage 6:回滚守卫 Skill

### 8.1 设计思路

回滚是最后的防线。很多团队的回滚方案是这样的:"出问题了就 `kubectl rollout undo`"——但真正出事的时候,往往连回滚命令都找不到,或者回滚后忘了验证。

这个 Skill 做三件事:**部署前生成回滚预案**、**部署时验证回滚就绪**、**出事时一键回滚+验证**。

### 8.2 SKILL.md

```yaml
---
name: rollback-guard
description: >
  Generates rollback plans, verifies rollback readiness before
  deployment, and executes verified rollback with health checks.
  Use before deploying to production or when a deployment fails.
disable-model-invocation: true
metadata:
  author: devops-team
  version: "1.0"
allowed-tools: Bash(kubectl:*) Bash(curl:*) Read Write
---

# Rollback Guard Skill

## Overview

This skill ensures **no deployment goes to production without a
tested rollback plan**. It operates in three modes:

1. **Plan mode**: Generate a rollback plan before deploying
2. **Verify mode**: Confirm rollback is possible before deploying
3. **Execute mode**: Rollback and verify the rollback succeeded

## Mode 1: Plan (Before Deployment)

Generate a rollback plan:

### Step 1: Record Current State

```bash
# 当前部署版本
CURRENT_IMAGE=$(kubectl get deployment myapp -n production \
  -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Current image: $CURRENT_IMAGE"

# 当前副本数
CURRENT_REPLICAS=$(kubectl get deployment myapp -n production \
  -o jsonpath='{.spec.replicas}')
echo "Current replicas: $CURRENT_REPLICAS"

# 当前 revision
kubectl rollout history deployment/myapp -n production

Step 2: Generate Rollback Plan

Save to ROLLBACK-PLAN.md:

# Rollback Plan: myapp

**Generated**: {timestamp}
**Current Version**: {current_image}
**Target Version**: {new_image}

## Rollback Command
kubectl rollout undo deployment/myapp -n production

## Verification
kubectl rollout status deployment/myapp -n production --timeout=120s
curl -sf https://myapp.example.com/health

## Full Reset (if undo fails)
kubectl set image deployment/myapp myapp={current_image} -n production
kubectl rollout status deployment/myapp -n production --timeout=120s

## Database Rollback (if applicable)
{migration_rollback_command}

Mode 2: Verify (Before Deployment)

Confirm rollback readiness:

  • Previous revision exists in rollout history
  • Rollback command tested in staging (or has been used before)
  • Database migration has a reversible down migration
  • Feature flags can disable new functionality
  • On-call engineer is available

If any check fails → BLOCK deployment until resolved.

Mode 3: Execute (When Things Go Wrong)

Step 1: Rollback

kubectl rollout undo deployment/myapp -n production

Step 2: Verify Rollback

# 等待 rollback 完成
kubectl rollout status deployment/myapp -n production --timeout=120s

# 验证镜像版本回退
CURRENT=$(kubectl get deployment myapp -n production \
  -o jsonpath='{.spec.template.spec.containers[0].image}')
echo "Rolled back to: $CURRENT"

# 健康检查
curl -sf https://myapp.example.com/health | jq .

Step 3: Post-Rollback

  • Confirm error rate returned to baseline
  • Notify team in Slack/飞书: “Production rollback executed”
  • Create incident ticket for root cause analysis
  • Do NOT redeploy until root cause is identified

Critical Rules

  • Never deploy without a rollback plan
  • Never skip rollback verification after executing
  • Never auto-rollback more than once (second failure = human intervention)
  • Always log the rollback reason and timestamp

### 8.3 回滚测试脚本

```bash
#!/bin/bash
# scripts/rollback-test.sh — 在 staging 环境验证回滚流程
set -euo pipefail

NAMESPACE="${1:-staging}"
APP="${2:-myapp}"

echo "=== Rollback Readiness Test ==="
echo "Namespace: $NAMESPACE"
echo "App: $APP"

# 1. 检查 rollout history 是否存在
echo ""
echo "→ Checking rollout history..."
HISTORY=$(kubectl rollout history deployment/$APP -n $NAMESPACE 2>/dev/null | wc -l)
if [ "$HISTORY" -lt 2 ]; then
    echo "🔴 FAIL: Insufficient rollout history (need ≥2 revisions, have $HISTORY)"
    exit 1
fi
echo "✅ Rollout history: $HISTORY revisions"

# 2. 检查当前 Pod 状态
echo ""
echo "→ Checking current pod health..."
READY=$(kubectl get deployment $APP -n $NAMESPACE \
    -o jsonpath='{.status.readyReplicas}' 2>/dev/null || echo "0")
DESIRED=$(kubectl get deployment $APP -n $NAMESPACE \
    -o jsonpath='{.spec.replicas}' 2>/dev/null || echo "0")
if [ "$READY" != "$DESIRED" ]; then
    echo "🟡 WARNING: Not all pods ready ($READY/$DESIRED)"
else
    echo "✅ All pods ready ($READY/$DESIRED)"
fi

# 3. 模拟回滚
echo ""
echo "→ Simulating rollback (dry-run)..."
kubectl rollout undo deployment/$APP -n $NAMESPACE --dry-run=client
echo "✅ Dry-run rollback succeeded"

echo ""
echo "=== Rollback Readiness: PASS ==="
echo "This deployment has a viable rollback path."

九、GitHub Actions 完整流水线集成

9.1 将 6 个 Skill 编排为一套 CI/CD Workflow

以下是一个完整的 GitHub Actions workflow,对应我们的 6 个 Skill:

# .github/workflows/ci.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ========== Stage 1: CI 审查 + 安全扫描 ==========
  review-and-scan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write
    steps:
      - uses: actions/checkout@v4

      # SAST 扫描 (对应 security-scan Skill Layer 1)
      - name: Run Semgrep SAST
        uses: semgrep/semgrep-action@v1
        with:
          config: auto

      # 依赖漏洞扫描 (对应 security-scan Skill Layer 2)
      - name: Run pip-audit
        run: |
          pip install pip-audit
          pip-audit -r requirements.txt

      # 密钥泄露检测 (对应 security-scan Skill Layer 3)
      - name: Run Gitleaks
        uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

  # ========== Stage 2: 构建与测试 ==========
  build-and-test:
    needs: review-and-scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run tests
        run: |
          pytest --cov=src --cov-report=xml --junitxml=test-results.xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        if: always()

  # ========== Stage 3: Docker 构建 ==========
  docker-build:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    permissions:
      packages: write
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: |
            VERSION=${{ github.ref_name }}
            COMMIT_SHA=${{ github.sha }}

      # 容器安全扫描 (对应 docker-build Skill Step 3)
      - name: Scan image with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          exit-code: '1'
          severity: 'HIGH,CRITICAL'

  # ========== Stage 4: K8s 部署 ==========
  deploy:
    needs: docker-build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to Kubernetes
        run: |
          # 更新 deployment 镜像标签
          kubectl set image deployment/myapp myapp=${{ needs.docker-build.outputs.image_tag }} \
            -n production

          # 等待 rollout 完成
          kubectl rollout status deployment/myapp -n production --timeout=300s

  # ========== Stage 5: 部署验证 ==========
  verify:
    needs: deploy
    runs-on: ubuntu-latest
    steps:
      - name: Health Check
        run: |
          for i in $(seq 1 10); do
            if curl -sf https://myapp.example.com/health; then
              echo "✅ Health check passed (attempt $i)"
              exit 0
            fi
            echo "⏳ Waiting for health check... (attempt $i/10)"
            sleep 10
          done
          echo "🔴 Health check failed after 10 attempts"
          exit 1

      - name: Smoke Test
        run: |
          curl -sf https://myapp.example.com/api/v1/ping | jq .

      - name: Verify Metrics
        run: |
          # 检查错误率
          ERROR_RATE=$(curl -sf 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])' | jq '.data.result[0].value[1]' || echo "1")
          if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
            echo "🔴 Error rate too high: $ERROR_RATE"
            exit 1
          fi
          echo "✅ Error rate acceptable: $ERROR_RATE"

十、一条命令触发完整流水线:编排 Skill 的 Skill

10.1 元 Skill:一键流水线

现在我们有 6 个独立的 Skill,但真正的威力在于把它们串起来——用一个"元 Skill"统一调度:

---
name: pipeline
description: >
  Executes the complete CI/CD pipeline: review, security scan,
  docker build, K8s deploy, verification, and rollback readiness.
  Use when the user wants to deploy, run the full pipeline,
  or says "ship it".
disable-model-invocation: true
metadata:
  author: devops-team
  version: "1.0"
---

# Pipeline Skill — 一键 CI/CD

## 执行流程

When this skill is invoked, execute the following stages **in order**.
If any stage fails, STOP and report the failure. Do not proceed.

### Stage 1: CI Review

Load and execute the `ci-review` skill.
Review all CI/CD configurations for correctness and security.

**Gate**: Zero 🔴 findings required to proceed.

### Stage 2: Security Scan

Load and execute the `security-scan` skill.
Run all three scan layers (SAST, dependency, secrets).

**Gate**: Zero 🔴 findings required to proceed.

### Stage 3: Docker Build

Load and execute the `docker-build` skill.
Build the production image with security scanning.

**Gate**: Trivy scan passes (no HIGH/CRITICAL CVEs).

### Stage 4: K8s Deploy

Load and execute the `k8s-deploy` skill.
Deploy to the target environment with proper manifests.

**Gate**: `kubectl rollout status` succeeds.

### Stage 5: Deploy Verify

Load and execute the `deploy-verify` skill.
Run all five verification phases.

**Gate**: All phases PASS or WARNING (not CRITICAL).

### Stage 6: Rollback Guard

Load and execute the `rollback-guard` skill (plan + verify mode).
Confirm rollback readiness for this deployment.

**Gate**: Rollback plan exists and verified.

## Output Summary

After all stages complete:

╔══════════════════════════════════════════╗
║ CI/CD PIPELINE EXECUTION ║
╠══════════════════════════════════════════╣
║ Stage 1: CI Review ✅ PASS ║
║ Stage 2: Security Scan ✅ PASS ║
║ Stage 3: Docker Build ✅ PASS ║
║ Stage 4: K8s Deploy ✅ PASS ║
║ Stage 5: Deploy Verify 🟡 WARN ║
║ Stage 6: Rollback Guard ✅ READY ║
╠══════════════════════════════════════════╣
║ OVERALL: ✅ DEPLOYED WITH WARNINGS ║
║ Rollback: READY (revision 47) ║
╚══════════════════════════════════════════╝


## Error Handling

If any stage fails:

1. Print the full failure details
2. If Stage 4+ fails, ask: "Do you want to rollback?"
3. If user says yes, execute `rollback-guard` in execute mode
4. Generate an incident ticket with timeline

## Fast Path

For non-production deployments (dev/staging), skip:
- Stage 6 (Rollback Guard)
- Trivy critical-only gate → downgrade to warning

User can say: `/pipeline --env=staging` for the fast path.

10.2 使用方式

# 一键完整流水线(生产环境)
> /pipeline

# 快速模式(staging,跳过回滚守卫)
> /pipeline --env=staging

# 只跑安全扫描
> /security-scan

# 只构建 Docker 镜像
> /docker-build

# 部署并验证
> /k8s-deploy

# 出事了!回滚
> /rollback-guard

十一、进阶:与 GitHub Actions 的深度集成

11.1 在 CI 中自动调用 Agent Skill

使用 Claude Code 的 GitHub Action,让 CI 流水线自动执行 Skill:

# .github/workflows/ai-review.yml
name: AI Agent Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Run AI Code Review
        uses: anthropics/claude-code-action@v1
        with:
          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          # 自动加载项目中的 Skills
          skills: ci-review,security-scan
          prompt: |
            Review this PR using the ci-review and security-scan skills.
            Post findings as PR comments.

11.2 Skill 即文档:CI/CD 配置的可解释性

传统 CI/CD 的一个老大难问题是——YAML 不是给人读的。一个 200 行的 workflow 文件,新人看到头大。

而 Agent Skill 天然就是可解释的

新人: "为什么 CI 里要先跑 SAST 再跑依赖扫描?"

Agent: 根据 security-scan Skill 的设计:
  Layer 1 SAST 先跑,因为它是纯静态分析,速度最快(~30s),
  可以快速拦截明显的代码注入漏洞。
  Layer 2 依赖扫描需要联网查询 CVE 数据库,较慢(~2min)。
  如果 SAST 就能拦截,就不需要浪费时间跑后续扫描。
  
  这是 fail-fast 原则——最快的检查放最前面。

你再也不需要维护一份永远过时的 Wiki。Skill 本身就是活文档


十二、企业级部署:安全红线与团队规范

12.1 安全红线清单

红线 检查点 在哪个 Skill 中执行
零硬编码密钥 无 AWS Key/GitHub Token/数据库密码 security-scan Layer 3
latest 标签 Docker 镜像和 K8s 清单中无 latest docker-build + k8s-deploy
零 root 容器 Dockerfile 中 USER nonroot docker-build
零无限制资源 K8s 清单中设置了 requests+limits k8s-deploy
零无回滚部署 部署前有回滚预案并验证 rollback-guard
零未扫描镜像 Trivy 扫描无 HIGH/CRITICAL docker-build

12.2 团队协作模式

Tech Lead:
  → 维护 .claude/skills/ 目录(代码审查、版本控制)
  → 定义安全红线和企业规范

DevOps Engineer:
  → 编写和优化 Skill 脚本
  → 监控流水线健康度

开发者:
  → 使用 /pipeline 一键部署
  → 使用 /security-scan 自查
  → 使用 /rollback-guard 应急

新人:
  → 阅读 SKILL.md 理解 CI/CD 流程
  → 通过 /ci-review 学习最佳实践
  → Skill 就是可交互的新人培训文档

12.3 Skill 版本管理与分发

# 企业内部 Skill Registry(推荐)
npx skills add https://gitlab.internal.com/devops/agent-skills --skill pipeline

# 团队共享:放入项目仓库
git add .claude/skills/
git commit -m "feat: add CI/CD pipeline skills v2.0"

# 新成员 clone 后自动获得所有 Skills
git clone https://github.com/org/my-project.git
cd my-project
# .claude/skills/ 随项目到达,无需额外安装

十三、效果对比:传统 CI/CD vs Agent Skills CI/CD

维度 传统方式 Agent Skills 方式
配置时间 2-3 天手写 YAML 30 分钟写 6 个 Skill
安全覆盖 依赖工程师经验,容易遗漏 Skill 内置检查清单,零遗漏
知识传承 离职带走 Skill 随项目走
故障响应 翻 Wiki、问老员工 问 Agent,秒级回答
新人上手 1-2 周理解流水线 读 SKILL.md,1 天理解
可解释性 YAML 难读 自然语言指令,人人可读
跨项目复用 复制粘贴 YAML Skill 目录级复用
回滚预案 经常忘了写 强制生成+验证
周五部署 提心吊胆 回滚守卫保驾护航

十四、总结:从 DevOps 到 AIOps 的关键一步

14.1 核心洞察

Agent Skills + CI/CD 不是简单的"用 AI 写 YAML",而是一次根本性的范式转换

传统 DevOps:
  人写配置 → 机器执行 → 出了问题人排查

Agent Skills DevOps:
  人定义规则 → Agent 执行+审查 → 出了问题 Agent 先处置

你不再是"写流水线的人",你是"教 Agent 怎么写流水线的人"。

14.2 六个 Skill 的本质

Skill 本质 类比
ci-review CI 配置审查员 质检员
security-scan 安全安检门 机场安检
docker-build 制品工厂厂长 生产车间
k8s-deploy 部署指挥官 交通调度
deploy-verify 验收检查员 出厂检验
rollback-guard 安全兜底 保险绳
pipeline 总调度 厂长

14.3 行动路线

第一步:安装 ci-review + security-scan 两个 Skill
        → 立即提升 CI 安全性
        ↓
第二步:添加 docker-build + k8s-deploy
        → 标准化构建和部署流程
        ↓
第三步:添加 deploy-verify + rollback-guard
        → 完善部署后验证和安全兜底
        ↓
第四步:编写 pipeline 元 Skill
        → 一键完整流水线
        ↓
第五步:接入 GitHub Actions Claude Code Action
        → CI 中自动调用 Skill,PR 自动审查
        ↓
第六步:团队推广 → 企业 Skill Registry
        → 从"一个人用"到"全团队标准化"

附录:关键资源

资源 链接
Agent Skills 开放标准 https://agentskills.io/specification
上篇:Agent Skills 完全指南 从 Prompt 到 Skill 的范式革命
devops-engineer Skill https://github.com/Jeffallan/claude-skills
Pulumi Agent Skills https://github.com/pulumi/agent-skills
GitHub Actions Claude Code Action https://github.com/anthropics/claude-code-action
Trivy 容器扫描 https://trivy.dev
Semgrep SAST https://semgrep.dev
Gitleaks 密钥检测 https://gitleaks.io
ArgoCD GitOps https://argoproj.github.io/cd
skills-ref 验证工具 https://github.com/agentskills/agentskills
awesome-agent-skills https://github.com/VoltAgent/awesome-agent-skills

写在最后:CI/CD 是软件工程中最"反人性"的环节之一——它要求你既懂安全、又懂运维、还要写 YAML 这种反人类语言。Agent Skills 把这些硬核知识封装成可复用的"技能包",让每个开发者都能拥有一个 10 年经验的 DevOps 工程师在旁边把关。

从今天起,你的 CI/CD 流水线不再是一条冷冰冰的管道——它是一个有判断力、有安全意识、有回滚预案的智能助手。

点赞+收藏+关注三连,下期我们聊聊:用 Agent Skills 构建多 Agent 协作的微服务治理平台。

Logo

葡萄城是专业的软件开发技术和低代码平台提供商,聚焦软件开发技术,以“赋能开发者”为使命,致力于通过表格控件、低代码和BI等各类软件开发工具和服务

更多推荐