Pod Scheduling for Multi-cluster Deployments
SUMMARY Juniper Cloud-Native Contrail Networking (CN2) release 23.2 supports network-aware pod scheduling for multi-cluster deployments. This article provides information about the changes to this feature from the previous release (23.1), and additional components for multi-cluster support.
Pod Scheduling in CN2
This article provides information about network-aware pod scheduling for multi-cluster deployments. For information about this feature for single-cluster deployments, see Pod Scheduling.
contrail-scheduler
, schedules
pods based on the following network metrics:- Number of active ingress/egress traffic flows
- Bandwidth utilization
- Number of virtual machine interfaces (VMIs)
Network-Aware Pod Scheduling Overview
Many high-performance applications have bandwidth or network interface requirements as well as the typical CPU or VMI requirements. Ifcontrail-scheduler
assigns a pod to a
node with low bandwidth availability, that application cannot run optimally. CN2 release 23.1
addressed this issue with the introduction of a metrics-collector
, a
central-collector
, and custom scheduler plugins. These components collect,
store, and process network metrics so that the contrail-scheduler
schedules
pods based on these metrics. CN2 release 23.2 enables multi-cluster deployments to take
advantage of this feature. Custom Controlllers
In 23.2, CN2 introduces a metrics controller CR and a central collector CR. The following two new controllers manage the CRs:-
MetricsConfig
controller: This controller reconciles and monitors theMetricsConfig
custom resource (CR). This controller writes config information for themetrics-collectors
in the central and distributed clusters. This controller also listens to create, read, update, delete (CRUD) events ofkubemanager
resources on the central cluster. The config file of ametrics-collector
contains the following information:-
The
writeLocation
field references themetrics-collector
configMap
details for mounting. Thename
,namespace
, andkey
of themetrics-collector's
configMap
for writing it's configuration. -
The receiving service's IP address and port. The
metrics-collector
uses this service IP and port to forward requested metrics data. -
Configuration data regarding Transport Layer Security TLS (if required).
-
A
metrics
section which defines the types of metrics the receiving service is requesting.
The following is an example of a
metrics-collector
config:receivers: - encoding: json metrics: - <relevant_metrics> - <relevant_metrics> serviceName: <service_name_or_ip> port: <receiver_listener_port> writeLocation: name: mc_configmap namespace: contrail key: config.yaml
-
-
CentralCollector
controller: This controller also reconciles and monitors the central collector CR. TheCentralCollector
controller manages the LCM of thecentral-collector
. Additionally, it creates aMetricsConfig
CR, which then configures themetrics-collector
configuration. TheCentralCollector
controller also creates the clusterIP service for thecentral-collector
. Unlike 23.1, you do not need to create acentral-collector
Deployment
orconfigMap
. Create and apply the CR and thecentral-collector
controller manages all configuration, deployment, and LCM functions.apiVersion: collectors.juniper.net/v1 kind: CentralCollector metadata: name: central-collector namespace: contrail spec: common: containers: - image: <image-repository>/central-collector:TAG name: central-collector metricsCollectorConfigmapLoc: name: "metrics-collector-configmap" namespace: contrail
multi-cluster Pod Scheduling Components
Aside from the CentralCollector
controller and the
MetricsConfig
controller, the following components comprise CN2's
network-aware pod scheduling solution for multi-cluster deployments:
-
Metrics-collector
: This component runs in a container alongside the vRouter pod that runs on each node in the central cluster and distributed clusters. Themetrics-collector
then forwards requested data to configured sinks which are specified in the configuration. The central collector is one of the configured sinks and recieves this data from the metrics collector. This release adds an additional field in the config file of the metrics collector. This field designates a cluster name, specifying which cluster the metrics collector is collecting data from and will send the same in the metadata to the reciever. -
Central-collector
: This component acts as an aggregator and stores data received from all of the nodes in a cluster via the metrics collector. The central collector exposes gRPC endpoints which consumers use to request this data for nodes in a cluster. Contrail-scheduler
: This custom scheduler introduces the following three custom plugins:VMICapacity
plugin (available from release 22.4 onwards): Implements Filter, Score, andNormalizeScore
extension points in the scheduler framework. Thecontrail-scheduler
uses these extension points to determine the best node to assign a pod to based on active VMIs.FlowsCapacity
plugin: Determines the best node to schedule a pod based on the number of active flows in a node. Too many traffic flows on a node means more competition for new pod traffic. Pods and nodes with a lower flow count are ranked higher by the scheduler.BandwidthUsage
plugin: Determines the best node to assign a pod based on the bandwidth usage of a node. The node with the least bandwidth usage (ingoing and outgoing traffic) per second is ranked highest.
Metrics Collector Deployment
CN2 includes the metrics collector in vRouter pod deployments by default. The
agent:
default:
field of the vRouter spec contains a collectors:
field which is configured with the metric collector reciever address. The example below
shows the value collectors: - localhost: 6700
. Since the the metrics
collector runs in the same pod as the vRouter agent, it can communicate over the
localhost
port. Note that port 6700 is fixed as the metrics collector
reciever address and cannot be changed. The vRouter agent sends metrics data to this
address.
The following is a section of a default vRouter deployment with the collector enabled:
apiVersion: dataplane.juniper.net/v1 kind: Vrouter metadata: name: contrail-vrouter-nodes namespace: contrail spec: agent: default: collectors: - localhost:6700 xmppAuthEnable: true sandesh: introspectSslEnable: true
Create Permissions
After you configure the vRouter to send metrics to the metrics-collector
over port 6700, you must apply a Role-Based Access (RBAC) manifest. Applying this manifest
creates required permissions for the contrail-telemetry-controller
. The
contrail-telemetry-controller
reconciles the
MetricsConfig
CR and creates the configMap
for the
metrics-collector
.
The following is an example RBAC manifest. Note that this manifest also creates the
namespace (contrail-analytics
) for the metrics-collector
.
apiVersion: v1 kind: Namespace metadata: name: contrail-analytics --- apiVersion: v1 kind: ServiceAccount metadata: name: contrail-telemetry-controller namespace: contrail-analytics --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: creationTimestamp: null name: contrail-telemetry-controller-role rules: - apiGroups: - "" resources: - configmaps verbs: - get - list - watch - create - update - patch - delete - apiGroups: - "" resources: - configmaps/status verbs: - get - update - patch - apiGroups: - "" resources: - namespaces - pods - pods/status verbs: - get - list - watch - patch - apiGroups: - apps resources: - deployments verbs: - get - update - list - watch - apiGroups: - "" resources: - secrets verbs: - create - delete - get - list - patch - update - watch - apiGroups: - telemetry.juniper.net resources: - metricsconfigs verbs: - create - delete - get - list - patch - update - watch - apiGroups: - telemetry.juniper.net resources: - metricsconfigs/status verbs: - get - patch - update - apiGroups: - configplane.juniper.net resources: - kubemanagers verbs: - create - delete - get - list - patch - update - watch - apiGroups: - configplane.juniper.net resources: - kubemanagers/status verbs: - get - patch - update --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-telemetry-controller-role-rolebinding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: contrail-telemetry-controller-role subjects: - kind: ServiceAccount name: contrail-telemetry-controller namespace: contrail-analytics
After applying the RBAC manifest, you must create a Deployment
for the
contrail-telemetry-controller
. The following is an example
Deployment
.
apiVersion: apps/v1 kind: Deployment metadata: name: contrail-telemetry-controller namespace: contrail-analytics labels: app: contrail-telemetry-controller spec: replicas: 1 selector: matchLabels: app: contrail-telemetry-controller template: metadata: labels: app: contrail-telemetry-controller spec: serviceAccount: contrail-telemetry-controller securityContext: fsGroup: 2000 runAsGroup: 3000 runAsNonRoot: true runAsUser: 1000 containers: - name: contrail-scheduler image: <image-respository>/contrail-telemetry-controller:<TAG> command: - /contrail-telemetry-controller
Central Collector Deployment
Apply the CentralCollector
CR. Once applied, the
CentralCollector
controller creates all of the necessary objects for the
central-collector
. The following is an example
CentralCollector
CR.
apiVersion: collectors.juniper.net/v1 kind: CentralCollector metadata: name: central-collector namespace: contrail spec: common: containers: - image: <image-repository/central-collector:<TAG> name: central-collector metricsCollectorConfigmapLoc: name: "metrics-collector-configmap" namespace: contrail
Create Configmaps
Perform the following steps to create configMaps
for multi-cluster
pod-scheduling components.
Perform these steps in each of the clusters of your multi-cluster environment.
-
Create a
configMap
calledcluster-details
:Applying the
CentralCollector
CR in the previous section also creates aconfigMap
calledcluster-details
. You must replicate thisconfigMap
in the same namespace where you intend to deploycontrail-scheduler
. The CR creates thisconfigMap
in the same namespace as the CR. TheconfigMap
includes the following information:-
Central-collector's service
clusterIP
-
Metrics gRPC service port: Used by the
contrail-scheduler
to retrieve and process network metrics and schedule pods accordingly. -
Name of the cluster where the
configMap
is located: Used to identify the cluster wherecontrail-scheduler
is running.
The following is an example
cluster-details
configMap
:clustername: <name_of_the_cluster> centralcollectoraddress: <ip:port>
-
-
Create a
vmi-config
configMap
: ThisconfigMap
defines the maximum VMI count allowed on DPDK nodes. The following is an exampleconfigMap
:nodeLabels: "agent-mode": "dpdk" maxVMICount: 64
-
Create a
configMap
for thecontrail-scheduler
: ThisconfigMap
defines thecontrail-scheduler
configuration. The following is an exampleconfigMap
withVMICapacity
,FlowsCapacity
, andBandwidthUsage
pluigin information:apiVersion: kubescheduler.config.k8s.io/v1 clientConnection: acceptContentTypes: "" burst: 100 contentType: application/vnd.kubernetes.protobuf kubeconfig: /tmp/config/kubeconfig qps: 50 enableContentionProfiling: true enableProfiling: true kind: KubeSchedulerConfiguration leaderElection: leaderElect: false profiles: - schedulerName: no-plugins-scheduler - schedulerName: vmi-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: VMICapacityArgs config: /tmp/vmi/config.yaml clusterConfig: /tmp/cluster/config.yaml name: VMICapacity plugins: multiPoint: enabled: - name: VMICapacity - schedulerName: flows-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: FlowsCapacityArgs clusterConfig: /tmp/cluster/config.yaml name: FlowsCapacity plugins: multiPoint: enabled: - name: FlowsCapacity - schedulerName: bandwidth-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: BandwidthUsageArgs clusterConfig: /tmp/cluster/config.yaml name: BandwidthUsage plugins: multiPoint: enabled: - name: BandwidthUsage - schedulerName: contrail-scheduler pluginConfig: - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: VMICapacityArgs config: /tmp/vmi/config.yaml clusterConfig: /tmp/cluster/config.yaml name: VMICapacity - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: FlowsCapacityArgs clusterConfig: /tmp/cluster/config.yaml name: FlowsCapacity - args: apiVersion: kubescheduler.config.k8s.io/v1 kind: BandwidthUsageArgs clusterConfig: /tmp/cluster/config.yaml name: BandwidthUsage plugins: multiPoint: enabled: - name: VMICapacity weight: 50 - name: FlowsCapacity weight: 1 - name: BandwidthUsage weight: 20
-
Create a
ServiceAccount
object (required) and configure theClusterRoles
for theServiceAccount
. AServiceAccount
assigns a role to a pod or component within a cluster. In the example below, theServiceAccount
assigns the same permissions as the default Kubernetes scheduler (kube-scheduler
). The following is an exampleServiceAccount
:apiVersion: v1 kind: ServiceAccount metadata: name: contrail-scheduler namespace: contrail-scheduler
The following is an example
ClusterRoles
object:apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-scheduler subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler roleRef: kind: ClusterRole name: system:kube-scheduler apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: contrail-scheduler-as-volume-scheduler subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler roleRef: kind: ClusterRole name: system:volume-scheduler apiGroup: rbac.authorization.k8s.io --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: contrail-scheduler-extension-apiserver-authentication-reader namespace: contrail-scheduler roleRef: kind: Role name: extension-apiserver-authentication-reader apiGroup: rbac.authorization.k8s.io subjects: - kind: ServiceAccount name: contrail-scheduler namespace: contrail-scheduler
-
Create a kubeconfig
Secret
. Mount the kubeconfig file within the scheduler container when you apply thecontrail-scheduler
Deployment
.kubectl create secret generic kubeconfig -n contrail-scheduler --from-file=kubeconfig=<path-to-kubeconfig-file>
-
Create a
Deployment
for thecontrail-scheduler
. The following is an exampleDeployment
:apiVersion: apps/v1 kind: Deployment metadata: name: contrail-scheduler namespace: contrail-scheduler labels: app: scheduler spec: replicas: 1 selector: matchLabels: app: scheduler template: metadata: labels: app: scheduler spec: serviceAccountName: contrail-scheduler securityContext: fsGroup: 2000 runAsGroup: 3000 runAsNonRoot: true runAsUser: 1000 containers: - name: contrail-scheduler image: <image-repository>/contrail-scheduler:<TAG> command: - /contrail-scheduler - --authentication-kubeconfig=/tmp/config/kubeconfig - --authorization-kubeconfig=/tmp/config/kubeconfig - --config=/tmp/scheduler/scheduler-config - --secure-port=10271 imagePullPolicy: Always livenessProbe: failureThreshold: 8 httpGet: path: /healthz port: 10271 scheme: HTTPS initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 30 resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: path: /healthz port: 10271 scheme: HTTPS initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 30 volumeMounts: - mountPath: /tmp/config name: kubeconfig readOnly: true - mountPath: /tmp/scheduler name: scheduler-config readOnly: true - mountPath: /tmp/vmi name: vmi-config readOnly: true - mountPath: /tmp/cluster name: cluster-details readOnly: true hostPID: false volumes: - name: kubeconfig secret: secretName: kubeconfig - name: scheduler-config configMap: name: scheduler-config - name: vmi-config configMap: name: vmi-config - name: cluster-details configMap: name: cluster-details
Note:When creating
Secrets
orconfigMaps
, ensure that thekey
used during the creation matches thekey
used when mounting them in theDeployment
. For example, in theDeployment
above, the path/tmp/scheduler/scheduler-config
is provided in the command section, and/tmp/scheduler
is provided in the volume mount section. In this case, the key is "scheduler-config". If the keys do not match, you will need to explicitly specify a custom key in the volume mounts for custom file names. If no file name is given explicitly, use "config.yaml" as the key name.