Kubernetes Operator模式
约 1466 字大约 5 分钟
kubernetesoperator
2025-06-16
Operator 是 Kubernetes 中封装运维知识的编程模式,通过自定义控制器将人类运维经验代码化,实现复杂有状态应用的自动化管理。本文将深入讲解 Operator 原理、开发框架和实际应用。
Operator 核心概念
Operator 的本质是自定义资源(CRD)+ 自定义控制器。它扩展了 Kubernetes API,将领域特定的运维逻辑编码到控制器中:
控制器协调循环(Reconciliation Loop)
控制器的核心是 Reconcile 函数,它不断将当前状态趋向期望状态:
Reconcile 函数的基本结构(Go 代码示例):
func (r *MySQLClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. 获取 CR 实例
cluster := &v1alpha1.MySQLCluster{}
if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
if apierrors.IsNotFound(err) {
return ctrl.Result{}, nil // CR 已删除,忽略
}
return ctrl.Result{}, err
}
// 2. 处理 Finalizer(清理逻辑)
if cluster.DeletionTimestamp != nil {
return r.handleDeletion(ctx, cluster)
}
// 3. 确保 StatefulSet 存在
if err := r.ensureStatefulSet(ctx, cluster); err != nil {
return ctrl.Result{}, err
}
// 4. 确保 Service 存在
if err := r.ensureService(ctx, cluster); err != nil {
return ctrl.Result{}, err
}
// 5. 检查集群健康状态
healthy, err := r.checkClusterHealth(ctx, cluster)
if err != nil {
return ctrl.Result{RequeueAfter: 30 * time.Second}, err
}
// 6. 更新 Status
cluster.Status.Ready = healthy
cluster.Status.Replicas = cluster.Spec.Replicas
if err := r.Status().Update(ctx, cluster); err != nil {
return ctrl.Result{}, err
}
// 7. 定期重新协调
return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil
}Reconcile 的关键设计原则
- 幂等性:同一个 Reconcile 调用多次结果相同
- 级别触发而非边缘触发:关注"当前状态是什么"而非"发生了什么事件"
- 期望状态驱动:只关心最终状态,不关心中间过程
- 错误重试:返回错误时自动重新入队
Operator SDK
Operator SDK 提供了三种开发方式:
Go-based Operator
最强大的方式,完全控制协调逻辑:
# 初始化项目
operator-sdk init --domain example.com --repo github.com/example/mysql-operator
# 创建 API 和 Controller
operator-sdk create api --group database --version v1alpha1 --kind MySQLCluster --resource --controller生成的 CRD 类型定义:
// api/v1alpha1/mysqlcluster_types.go
type MySQLClusterSpec struct {
// Replicas is the number of MySQL instances
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=7
Replicas int32 `json:"replicas"`
// Version is the MySQL version
// +kubebuilder:validation:Enum={"8.0","8.1","8.2"}
Version string `json:"version"`
// Storage defines the storage configuration
Storage StorageSpec `json:"storage"`
// Backup defines the backup configuration
// +optional
Backup *BackupSpec `json:"backup,omitempty"`
}
type MySQLClusterStatus struct {
// Ready indicates whether the cluster is ready
Ready bool `json:"ready"`
// Replicas is the current number of replicas
Replicas int32 `json:"replicas"`
// Conditions represent the latest available observations
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".spec.replicas"
// +kubebuilder:printcolumn:name="Ready",type="boolean",JSONPath=".status.ready"
// +kubebuilder:printcolumn:name="Version",type="string",JSONPath=".spec.version"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
type MySQLCluster struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec MySQLClusterSpec `json:"spec,omitempty"`
Status MySQLClusterStatus `json:"status,omitempty"`
}Helm-based Operator
将现有 Helm Chart 包装为 Operator,适合简单场景:
operator-sdk init --plugins helm --domain example.com
operator-sdk create api --group apps --version v1alpha1 --kind NginxIngress --helm-chart nginx-ingress# watches.yaml
- group: apps.example.com
version: v1alpha1
kind: NginxIngress
chart: helm-charts/nginx-ingress
watchDependentResources: true
overrideValues:
controller.replicaCount: $SPEC.replicas
controller.service.type: $SPEC.serviceTypeAnsible-based Operator
使用 Ansible Playbook 实现运维逻辑:
operator-sdk init --plugins ansible --domain example.com
operator-sdk create api --group cache --version v1alpha1 --kind Redis# watches.yaml
- version: v1alpha1
group: cache.example.com
kind: Redis
role: redis
reconcilePeriod: 5mKubebuilder
Kubebuilder 是 Operator SDK(Go 方式)的底层框架,提供项目脚手架和代码生成:
关键 Marker 注解(用于代码生成):
// +kubebuilder:validation:Minimum=1 - 字段验证
// +kubebuilder:validation:Enum={"a","b"} - 枚举值
// +kubebuilder:default:=3 - 默认值
// +kubebuilder:subresource:status - 启用 status 子资源
// +kubebuilder:printcolumn - kubectl 输出列
// +kubebuilder:rbac:groups=...,verbs=... - RBAC 权限声明
// +optional - 可选字段OLM(Operator Lifecycle Manager)
OLM 管理 Operator 的安装、升级和生命周期:
# Subscription 示例
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: prometheus-operator
namespace: operators
spec:
channel: stable
name: prometheus
source: operatorhubio-catalog
sourceNamespace: olm
installPlanApproval: Automatic # 自动审批安装/升级常用 Operator 实例
Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
replicas: 2
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: backend
ruleSelector:
matchLabels:
role: alert-rules
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
retention: 30dCert-Manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: app-tls
spec:
secretName: app-tls-secret
duration: 2160h # 90 天
renewBefore: 360h # 提前 15 天续期
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- app.example.com
- "*.app.example.com"Strimzi(Kafka Operator)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: production-kafka
spec:
kafka:
version: 3.6.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
storage:
type: persistent-claim
size: 500Gi
class: fast-ssd
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
default.replication.factor: 3
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 50Gi总结
Operator 模式的核心价值在于将运维知识代码化:
| 开发方式 | 适用场景 | 复杂度 |
|---|---|---|
| Go (Kubebuilder) | 复杂有状态应用 | 高 |
| Helm | 已有 Helm Chart 的简单封装 | 低 |
| Ansible | 运维团队,基于 Playbook | 中 |
开发 Operator 的最佳实践:
- Reconcile 函数保持幂等,多次调用结果一致
- 使用 Finalizer 处理资源清理
- 合理使用 Status 和 Conditions 反馈状态
- 实现优雅降级,外部依赖不可用时不崩溃
- 编写充分的集成测试,使用 envtest 框架
- 通过 OLM 分发 Operator,简化安装和升级
贡献者
更新日志
2026/3/14 13:09
查看所有更新日志
9f6c2-feat: organize wiki content and refresh site setup于