【已解决】添加本地的一个minkube集群,没有nfs-server 的 service,导之相关的卷一直无法挂载

请完善如下信息,方便我们协助排查问题

Rainbond版本: rainbond:v5.3.0-release-allinone
操作系统: Ubuntu 18.04.5 LTS
内核版本: 5.4.0-66-generic #74~18.04.2-Ubuntu
环境:(云服务商,虚拟机等): 物理机器
节点配置: 8 core, 16G
安装类型: allinone

如何复现:

- 使用docker启动一个allinone rainbond服务
- 使用minikube创建一个k8s集群
- 在rainbond重添加上述k8s集群并初始化
- 在k8s dashboard 会看到rainbond相关服务初始化失败: rbd-hub, rbd-node

尝试解决:
相关截图:


是否重新执行安装:
重试多次,相同问题

集群或应用问题额外需要提供如下信息:

  1. 集群是否正常(grctl cluster)
  2. 应用是否正常 (grctl service get <应用别名> -t <租户>)
  3. 应用监听端口是否正确,是否开启了健康检测,持久化目录是否设置正确
  4. 操作流程,能否复现
  5. 是否尝试过更新部分组件的镜像,是否有效
  6. 控制台报错或者异常请确定哪个接口报异常,F12

抄送: @dazuimao1990

在一些特殊的情况下,导致了 nfs-server 的 service 资源没有被正确的创建,这的确会阻塞安装的继续进行。下面是手动创建这一资源的操作步骤:

  • 首先,需要查询到 rainbondvolumerwx 这个 storageclass 的 uid
kubectl get sc rainbondvolumerwx -n rbd-system -o yaml | grep uid

uid: 5cc5173b-e031-4894-99e9-2d1191814629
  • 其次,创建一个 nfs-provisioner-service.yaml 配置文件,用于定义这个资,内容如下:
apiVersion: v1
kind: Service
metadata:
  labels:
    belongTo: rainbond-operator
    creator: Rainbond
    name: nfs-provisioner
    manager: manager
    operation: Update
  name: nfs-provisioner
  namespace: rbd-system
  ownerReferences:
  - apiVersion: rainbond.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RainbondVolume
    name: rainbondvolumerwx
    uid: <此处填写上一步获取的 uid >
  resourceVersion: "837"
  selfLink: /api/v1/namespaces/rbd-system/services/nfs-provisioner
  uid: f98d2472-75fd-4c89-b127-69ecc4940342
spec:
  clusterIP: 10.43.0.139
  ports:
  - name: nfs
    port: 2049
    protocol: TCP
    targetPort: nfs
  - name: nfs-udp
    port: 2049
    protocol: UDP
    targetPort: nfs-udp
  - name: nlockmgr
    port: 32803
    protocol: TCP
    targetPort: nlockmgr
  - name: nlockmgr-udp
    port: 32803
    protocol: UDP
    targetPort: nlockmgr-udp
  - name: mountd
    port: 20048
    protocol: TCP
    targetPort: mountd
  - name: mountd-udp
    port: 20048
    protocol: UDP
    targetPort: mountd-udp
  - name: rquotad
    port: 875
    protocol: TCP
    targetPort: rquotad
  - name: rquotad-udp
    port: 875
    protocol: UDP
    targetPort: rquotad-udp
  - name: rpcbind
    port: 111
    protocol: TCP
    targetPort: rpcbind
  - name: rpcbind-udp
    port: 111
    protocol: UDP
    targetPort: rpcbind-udp
  - name: statd
    port: 662
    protocol: TCP
    targetPort: statd
  - name: statd-udp
    port: 662
    protocol: UDP
    targetPort: statd-udp
  selector:
    belongTo: rainbond-operator
    creator: Rainbond
    name: nfs-provisioner
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
  • 最后一步,创建 service
kubectl apply -f nfs-provisioner-service.yaml
  • 验证
kubectl get service nfs-provisioner -n rbd-system

NAME              TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                                                                                                     AGE
nfs-provisioner   ClusterIP   10.43.0.139   <none>        2049/TCP,2049/UDP,32803/TCP,32803/UDP,20048/TCP,20048/UDP,875/TCP,875/UDP,111/TCP,111/UDP,662/TCP,662/UDP   5d12h
  • 继续安装流程

理论上讲,安装流程会自动进行下去,但如果发现 rbd-system 命名空间下,存在 pod 处于 pending 状态过久,那么就手动杀掉那些 pod 使之重启即可。操作命令略。

1 Like

按照上述方法,nfs-provisioner启动成功,相应的卷也挂载成功了。但是rainbond cluster初始化还是失败
主要

  - lastHeartbeatTime: "2021-03-10T02:37:53Z"
    lastTransitionTime: "2021-03-09T11:38:33Z"
    message: rbdcomponent(rbd-chaos) not ready
    reason: RbdComponentNotReady
    status: "False"
    type: Running

k8s dashboard上没有搜到:rbd-chaos

  - lastHeartbeatTime: "2021-03-10T02:37:53Z"
    lastTransitionTime: "2021-03-09T11:38:53Z"
    message: 'push image: error detail: Get https://goodrain.me/v2/: dial tcp: lookup
      goodrain.me on 192.168.122.1:53: no such host'
    reason: DefaultImageRepoFailed
    status: "False"
    type: ImageRepository
  gatewayAvailableNodes: {}
```![x|690x293](upload://pZpGP0lWjBi3N5b3hzGtbDDw6X1.png)

尝试将 rbd-system 命名空间下 rainbond-operator-xxxx 这个 pod 删除,重新触发初始化的全流程。

重新初始化一次就初始化成功了,但是在最后一步 checkAPIHealthy 失败,我先检查一下证书

time="2021-03-10T18:45:35+08:00" level=info msg="cluster rainbond states: Push Images:5/5\tKubernetesVersion=>True;\tStorage=>True;\tDNS=>True;\tMemory=>True;\tRunning=>True;\tImageRepository=>True;\t"
time="2021-03-10T18:45:35+08:00" level=error msg="ping region api failure: Get \"https://192.168.39.215:8443/v2/health\": x509: certificate signed by unknown authority"
[GIN] 2021/03/10 - 18:45:40 | 200 |     307.512µs |       127.0.0.1 | GET      "/enterprise-server/api/v1/enterprises/13d6cc80f5434e41a225a04dcd42c78b/tasks/f8359bbaf5f84e16b33d5851aa0d932e/events"
time="2021-03-10T18:45:40+08:00" level=info msg="cluster rainbond states: Push Images:5/5\tKubernetesVersion=>True;\tStorage=>True;\tDNS=>True;\tMemory=>True;\tRunning=>True;\tImageRepository=>True;\t"
time="2021-03-10T18:45:40+08:00" level=error msg="ping region api failure: Get \"https://192.168.39.215:8443/v2/health\": x509: certificate signed by unknown authority"

该问题的解决跳转