37｜kube-scheduler调度器源码剖析（下）

孔令飞

你好，我是孔令飞。
上节课，我介绍了 kube-scheduler 在启动时，需要加载的调度插件和调度策略，可以理解为 kube-scheduler 的静态部分。接下来要介绍的调度队列管理、调度 Pod 流程是 kube-scheduler 的动态部分，是调度器的最核心逻辑。
调度队列管理kube-scheduler 会从调度队列中获取需要调度的 Pod 和目标 Node 列表，通过调度流程，最终将 Pod 调度到合适的 Node 节点上。
创建调度队列我们先来看下调度队列是如何创建的。
在 scheduler.New 中，通过以下代码创建了优先级调度队列 podQueue：
func New(ctx context.Context,
         client clientset.Interface,
         informerFactory informers.SharedInformerFactory,
         dynInformerFactory dynamicinformer.DynamicSharedInformerFactory,
         recorderFactory profile.RecorderFactory,
         opts ...Option) (*Scheduler, error) {
    ...
    // 首先遍历 profiles 获取其对应的已注册好的 PreEnqueuePlugins 插件，这些插件会在 Pods 被添加到 activeQ 之前调用。
    preEnqueuePluginMap := make(map[string][]framework.PreEnqueuePlugin)
    queueingHintsPerProfile := make(internalqueue.QueueingHintMapPerProfile)
    for profileName, profile := range profiles {
        preEnqueuePluginMap[profileName] = profile.PreEnqueuePlugins()
        queueingHintsPerProfile[profileName] = buildQueueingHintMap(profile.EnqueueExtensions())
    }
    // 初始化一个优先队列作为调度队列
    podQueue := internalqueue.NewSchedulingQueue(
        profiles[options.profiles[0].SchedulerName].QueueSortFunc(),
        informerFactory,
        // 设置 pod 的 Initial阶段的 Backoff 的持续时间
        internalqueue.WithPodInitialBackoffDuration(time.Duration(options.podInitialBackoffSeconds)*time.Second),
        // 最大backoff持续时间
        internalqueue.WithPodMaxBackoffDuration(time.Duration(options.podMaxBackoffSeconds)*time.Second),
        internalqueue.WithPodLister(podLister),
        // 设置一个pod在 unschedulablePods 队列停留的最长时间
        internalqueue.WithPodMaxInUnschedulablePodsDuration(options.podMaxInUnschedulablePodsDuration),
        internalqueue.WithPreEnqueuePluginMap(preEnqueuePluginMap),
        internalqueue.WithQueueingHintMapPerProfile(queueingHintsPerProfile),
        // 指标相关
        internalqueue.WithPluginMetricsSamplePercent(pluginMetricsSamplePercent),
        internalqueue.WithMetricsRecorder(*metricsRecorder),
    )
    ...
    sched := &Scheduler{
        Cache:                    schedulerCache,
        client:                   client,
        nodeInfoSnapshot:         snapshot,
        percentageOfNodesToScore: options.percentageOfNodesToScore,
        Extenders:                extenders,
        StopEverything:           stopEverything,
        SchedulingQueue:          podQueue,
        Profiles:                 profiles,
        logger:                   logger,
    }
    sched.NextPod = podQueue.Pop
    sched.applyDefaultHandlers()
    if err = addAllEventHandlers(sched, informerFactory, dynInformerFactory, unionedGVKs(queueingHintsPerProfile)); err != nil {
        return nil, fmt.Errorf("adding event handlers: %w", err)
    }
    return sched, nil
}

公开

同步至部落

取消

完成

0/2000

荧光笔

直线

曲线

笔记

复制

AI

深入了解
翻译
英语
中文简体
法语
德语
日语
韩语
俄语
西班牙语
解释
总结

1. `kube-scheduler`中的调度队列管理是动态部分，是调度器的核心逻辑。 2. 调度队列由优先级调度队列`podQueue`和三个重要的子队列`activeQ`、`backoffQ`、`unschedulableQ`组成。 3. `activeQ`按照Pod的优先级进行排序，Scheduler Pipeline会从中获取Pod进行调度流程。 4. `backoffQ`是退避队列，持有已尝试调度但失败的Pod，定期尝试重新调度。 5. `unschedulableQ`持有已尝试调度但确定不可调度的Pod，定期刷入`activeQ`或`backoffQ`。 6. 调度队列的设置包括Pod排序方法、回退时间间隔、预调度插件、队列提示和插件度量指标采样等。 7. Pod入队列是通过Kubernetes EventHandler的方式，在Pod、Node有变更时将Pod添加到指定的调度队列和缓冲中。 8. `schedulingCycle`函数中，会按顺序依次执行调度扩展点，包括`sched.SchedulePod`进行Pod调度，`sched.findNodesThatFitPod`查找适合调度给定Pod的节点，`prioritizeNodes`对可行节点进行优先级排序，以及`selectHost`选择最终的节点。 9. `sched.findNodesThatFitPod`方法查找适合调度给定Pod的节点，获取可行节点列表、诊断信息和可能的错误。 10. `prioritizeNodes`方法对可行节点进行优先级排序，获取一个优先级列表。

仅可试看部分内容，如需阅读全部内容，请付费购买文章所属专栏
《Kubernetes 源码剖析与实战》，新⼈⾸单¥68

立即购买

登录后留言

精选留言

由作者筛选后的优质留言将会公开显示，欢迎踊跃留言。

收起评论