Loading…
Thursday, September 28 • 11:50am - 12:25pm
如何使用集群自动缩放器将批处理作业的节点扩展到2k个节点 | How We Scale up to 2k Nodes for Batch Jobs Using Cluster Autoscaler - Lei Qian, ByteDance

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
批处理作业具有批量创建和删除的特点,而云提供了强大的弹性。因此,批处理作业和云是完美的匹配。在云原生世界中,我们可以使用Kubernetes和集群自动缩放器来降低成本。但与微服务不同,批处理作业对集群的弹性要求更高,给集群自动缩放器带来了更多挑战。 在我们的场景中,用户将在短时间内创建多达16,000个Pod。当这批任务完成时,集群需要快速缩小。在本次演讲中,我们将分享在批量创建和删除场景中使用集群自动缩放器遇到的一些问题和解决方案。例如,为什么集群无法成功扩展,为什么Pod创建时间如此长,为什么空闲节点没有及时删除等等。通过解决这些问题,我们能够将集群扩展到2,000个节点。

Batch jobs have the characteristic of bulk creation and deletion, and the cloud provides strong elasticity. Therefore, batch jobs and the cloud makes a perfect match. In the cloud-native world, we can use Kubernetes and cluster autoscaler to reduce costs. But unlike microservices, batch jobs have higher requirements for the elasticity of the cluster, posing more challenges to cluster autoscaler. In our scenario, users will create up to 16,000 pods within a short period. When this batch of tasks is completed, the cluster needs to be quickly scaled down. In this talk, we will share some issues and solutions encountered using cluster autoscaler in batch creation and deletion scenarios. For example, why cluster is not successfully scaled up, why pod creation takes so much time, why idle nodes were not promptly deleted, and so on. By solving these issues, we are able to scale the cluster to 2,000 nodes in production.

Speakers
avatar for 钱磊

钱磊

Software Engineer, Volcano Engine
A kubernetes developer in Volcano Engine. Focus on building a stable kubernetes engine on public cloud.



Thursday September 28, 2023 11:50am - 12:25pm CST
2层 会议室 1 | 2F Room 1
  运维+性能 | Operations + Performance