[MUSIC] Okay, hello everyone, welcome to attend this session today I'd like to share some experience, an insight of how to run artificial intelligence and machine learning applications on Kubernetes in Cloud Native way. My name is Kai Zhang, I'm now working as a staff engineer of Alibaba Cloud Container Service team. In past several years, my team works on containers based AI Solutions and customer supports. We have released a bunch of AI or, machine learning and heterogeneous computing related products and solutions are Alibaba Cloud and also we contribute a lot of projects to the open communities. So I will introduce a lot of details into this session soon. Okay now let's look at today's topics and I will cover several topics here about. Firstly, I will talk about what are the challenges of running large scale AI or machine learning jobs, especially on Cloud. Secondly, I will talk about how we build container based solution on Alibaba Cloud to help clients to fix those challenges. Certainly I will go deeply into the key features and user cases of our solutions. Then finally I will wrap up with some helpful past practice and solution materials. I will share with all audience as a reference for your follow up actions or our talks. Okay, so let's start with the first part, so let's talk about how does data scientists run AI or machine learning works day by day. Basically they spent a lot of time to prepare data. They may do some coding to build their models, use static methods or even neural networks sometimes distribute jobs through the cluster. To do some real training with much larger dataset, parimeter combinations and repeat training iterativel. They also need care about how to distribute jobs to multiple machines to use more computer resources in parallel to accelerate their training jobs, right? Although a normal training jobs still takes hours, days, even weeks to run, think about Burt. Once you catch a certified model, it will be deployed online to provide specific inference service and support the business logic. The online service may collect new data which will be used for knew training iteration to continuously improving the AI model performance. So from this work workflow we found out when running the AI or machine learning works, especially when we run at a large scale. The first one is a high efficient step, high efficient heterogeneous resource management. We need a heterogeneous resource management system or capability to manage all those kind of different types of resources in a unified way. Into the 2nd challenge we found that it's always an end to end support for the AI or machine learning experiments and all the experiment history is records and asset should be persisted for future in others to analysis and reproduce, right? And certainly we talk about scale rights? The challenge is a skill. It's really challenged continuously training or serve model service at large scale. We need to make sure that jobs can still out on demand. Meanwhile, it can still get optimized cost. Because you know GPU is always expensive so the cost is always a consideration we need to care, right? It's not just the performance but also the cost. Right? So that's three top challenges when we talk about running AI or machine learning at scale. So how we accelerate AI or machine learning? We summarize it into three perspectives. The first one is a compatibility for all kinds of jobs. Dev scientist needs a flexibility to compose some specific workflow for different model or targets by selecting different steps. And the later, okay, also need to adjust some steps to continuously optimize training pipelines. So AI or machine learning systems need to automated the steps as much as possible to relax the data scientists, to let them focus on their real algorithms and data process. Secondly, we think about scalability is another way to accelerate the AI or machine learning. The system should easily scale out the model training jobs from maybe one node to hundreds or even thousands of nodes but without any code changes. And thirdly is a portability, there are many types of devices can accelerate AI or machine learning jobs. The system needs to support diverse accelerators with unified resource abstraction and a programmer programmable interface. And also there are some so many computation frameworks, runtimes, libraries, dependencies used by different models. Think about TensorFlow, PyTorch, Spark, FLAME, and even classic MPI. All those frameworks can be used to train different types of models at differential of skills. The system needs support all of this in the same manner. We call it immutable environment for AI or machine learning execution. It can be done across either on the premise your data center or on the public cloud. So that's three ways we considering to accelerate the AI and machine learning jobs. The composability, the scalability, and the portability. Okay, with that, let's see how we address those challenges and provide those acceleration methods, Alibaba Cloud, a cloud native way, right? The Alibaba cloud report, the cognitive AI machine learning solution built on container and Kubernetes Technologies. Basically our solution helps to fix two layer problems. On the underlayer we provide unified resource management and scheduling capabilities to manage all types of heterogeneous infrastructure services. On the upper layer, heterogenous types of applications are all containerwised and managed by single Kubernetes system. The data scientists just need choose to use which framework to train his job on how many GPUs, but he needed to care about how to make that happen on his machines. All the detailed complexities among devices and the differential frameworks are completely hidden by our solution. Let's see how our solutions landed or implemented. Before that, let's give a very quickly introduce to our container service product. Our cloud native AI solution is built on top of our Kubernetes service, which is called Alibaba Cloud container service for Kubernetes or ACK in short. I will not go to more details of ACK itself today while you just need to know ACK is a product, helps users to operate Kubernetes clusters to leverage Alibaba's clouds infrastructure as a service capabilities for public cloud, private cloud, or some edge computing environment. So here is a reference architecture of our cloud native AI or machine Learning solution is okay. Well I don't know is infrastructure layer applied in all types of computation, storage and high bandwidth network capabilities. In the middle we build different service for different lifecycle stages. The, they are machine learning jobs lifecycle. For development, that Jupyter notebook result gate and tensor board integrated are used to fall model coding and debugging. For training or popular open source computation, deep learning frameworks are supported, including Tensorflow Cafe, Mxnet, Pat Roach, Holder out, the spark. And it's not loading ramps of package in the container image. But also how to deploy training task to single machine and then extend it to a cluster. For inference, tensorflow serving tells RT inference server and some other community offering select cell then supported. Meanwhile, we will use it to the service match to control the inference service. It helps to do the model. Maybe tasked kanorado days and the routing, traffic control and hybrid cloud connectivity with different types of policies. For operation, load Balance, Autoscaling and monitoring are pre-booted in the system and in the hot tub and spark services are also integrated for the data preparation and a feature engineering answer talk. We create a orchestration layer and tooling to cover all those underlying services to hide all the complexities or in this reference architectures. As a result and user only need to use one single command line tool or some SDK to do all his draw. But no need to care about how to set up and then configure all those services in the architecture. Okay, how to build AI or machine learning application with GPU in a CK. Actually, it's super easy. And you just need the four steps to cover all most of the scenarios. The first one is for the cluster administrator. He need to create I CK cluster and add some GPU machines in the cluster. Then the administrator can choose to install our AI solution, add ONS into the case is a cluster. For example, he can install the GPU sharing feature. Later on, to make his GPU devices can be shared by multiple models with different memories result request in the same time dynamically. Certainly, for the algorithm develop, he just need to use our AI tooling to submit his model training job to the plaster and just ACK RSA Cano. How many resource, or how many GPUs he wants to use for that model training. And after codes or model change, that operator or the Mideast engineer can desire to want to rap the model into a package, and with the version control and 1 two diploid to honor as our online service, and how to connect it integrated with other online services together. Okay so, so far I have introduced our OR cloud native AI Machine Learning Solutions Cloud. Now let's go through more details of the key features and user cases of our solution. To say, what exactly, what it can do for the data scientists and the algorithm to developers? I summarize the top 6 features categories of ACK Cloud native AI solution. The first one is for the end user. Right, it's a products choose. The solution provides tools to simplify the AI machine learning job lifecycle management. Secondly, for the backend cluster resource utilization consideration, it provides optimized scheduler to match the AI jobs with cluster nodes properly. Meanwhile, considering the performance that the cluster resource utilization and the cost. And certainly for the cluster administrator is the solution provides unified resource abstraction. And management to transparent physical differences for the platform users. Though the force category will be for the developers, it provides distributed cache service to accelerate the data loading of training jobs. The caches general purpose service, which can be used in deep learning training or machine learning jobs or even the big data analytics jobs commonly. Again for the developers, the solution also provides autoscaling capabilities, which is super important for accelerates the whole AI or machine learning jobs. For the the main again, the solution provides building, monitoring and problem determination systems to observe the real time situation of your GPU's and other accelerators status and also observe the systems Kelsey situations and other stators. And also it can easily integrate with your alert system and even your team talk service. Okay, firstly, let's say the AI or machine learning job lifecycle management incorporated community could flow is the most popular projects to build portable machine learning solution in native way. In ACK, we already integrate with Cooper Flow Project. My team developed and contribute AI job lifecycle management to record arena. We contributed to the code flow community. Currently, it's a subproject of Cooper flow. Actually, it's a command line tool, also provides the SDK support. Arena is open source and used by many cons to wrap and build it into their own machine learning and AI platform. If you quickly go through some very important arena commands, the most used arena commands is arena submit which helps to start a training job. User meeting to care about how this job is created in the cluster. We just need to give parameters to specify where is the model code located and where is the data, where to output that a result, and how many workers or GPU's he want to use for this job exclusion. And of course, he can specify which frameworks he want to use to run his code. This command will trigger Manning communities Internet workflows. However username not understand communities at all. User can also list all jobs in the cluster. Get details of each jobs and check the real time job logs and also the GPU device usage, monitoring stators are also can be queried from the arena come online, real time. And if you are training the job with the tensor flow, we will integrate with the tensor board automatically for you and you can observe your training and validates the training process in real time. So around top job and arena top node will help you to understand your GPU usage from different angles from the job perspective or frontend node perspective in the real time. And on the left side, I listed on other more arena commands. If you have interest just to go to the GitHub and then look into it. And then let me choose our resource management. Actually, ACK fully support GPU's set up scheduling at monitoring and management with basically supported developer or administrator. Just need to let the SDK control plane to know which type of GPU and how many GPU nodes we want and the system ACK will set up all the GPU notes, the environment. And the And then say it's automatically. And as an enhancement, it's okay, not only support GPU device level scheduling, but also we can support schedule multiple job containers to one single GPU device, and to share this GPU device memory. It's, actually it's the first GPU sharing solution of communities in the industry. It helps so many to save at least 15% GPU course for model and inference services. Again, we open source our GPU sharing scheduler in GitHub as well, and also some new features are coming. For example, we will add the GPU memory isolation to the GPU sharing scheduler to guarantee that the jobs who are share the same GPU will not disturb eachother. Okay, the GPU actually is our virtualized GPU which provided by the Alibaba, and Alibaba cloud, and NVIDIA, based on the hypervisor virtualization technology. Actually you can consider it another kind of GPU sharing solution. However, it requires a real virtualization technology, which means it will provide better isolation and security, but you need to spend more overhead. ACK can support a virtualized GPU device as well, as long as user add virtual GPU in a ACK cluster. Basically, can still schedule that jobs to virtualize the GPU as long as it has free resource to allocate more jobs on it. Okay, I think a lot of people had already know Alibaba has its own AI chips right, which is named Hanguang 800. Hanguang 800 is one of the fastest electric in the world for air model inference service. ACK provides a first class citizen support for Hanguang 800. It can automatically set up monitor scale and manage Hanguang 800 in a cluster. One Hanguang 800 device can be calculate multiple course, so a job can run on one or multiple course. It's OK. We also support for scheduling multiple AI models to share multiple Hanguang 800 course no matter they're located on the same single device or many devices. In that way you can see, in ACK you can share Hangusng device or Hangusng call to multiple jobs, just like what we do for the GPU. So in that way, it maximize the super powerful AI chips' capacity. Okay. Yeah, there are more heterogeneous accelerators, right? There are a lot of ASIC chips. FPGA is most used or most popular solution in that category. Actually results similar architecture in the user experience. ACK can support FPGA device operation and scheduling in a similar way. And for network, high bandwidth network, RDMA is very important for the large scale distributed AI under HPC, high performance computing jobs. In ACK, we also support, considering the RDMA network, card network device as a resource, which can be can be scheduled and request just like unique raster or computation resource like a CPU or GPU. And together with NVIDIA GPU, user can enable Nico in CCL, which is NVIDIA's high performance collective communication library, to program over GPU and RDMA together to extremely accelerate the distributing training job. Okay. Basically, as I mentioned, has the building, monitoring and problem determination solution for GPU. User can monitor his GPU from both note view or from the application view. The metrics are supported including the GPU call, duty cycles and memory usage and the device temperature and some other important. And also you can specify your customized metrics as well. As long as the GPU resource and AI jobs are monitored, because based on that metrics. SDK provides auto scaling capability with several scaling policies you can specify. You can choose. If you can scale out or scaling AI or machine learning workloads on two levels, including the application instance level and GPU in node level. If your job performance metrics is under a special threshold, it will automatically scale out more job instance to execute, and if there is no enough GPU resource to exclude this knew instances to automatically add more GPU nodes into the customer. And the one that is not many job to run a scale some notes to ensure that the constant performance and the cost with the pastor balance and of course the jobs are still running high performance and the efficiency. Actually, there are many auto scaling patterns and policies user can choose. If you have interest in autoscaling topic, feel free to contact me to discuss which is the best policy for your application and clusters. Okay? Okay scheduling, scheduling, well is the most important features and capabilities in ACK to support the AI or machine learning such of batch computing jobs. Now let's look at how those jobs are scheduled to execute clusters. For the cluster, all we consider scheduling into perspective. First lesson, the cluster GPU resource perspective. There are several resource allocation patterns in ACK. First of, we call it a gang scheduling. The idea of gang scheduling can be simply explained as all or nothing, means basically, where all located resources to a job only with all of these subtasks requirements are satisfied together. Okay, otherwise no sub task is started and nothing resource is occupied by any tasks, so that help to avoid the cluster GPU resources. Okay? Secondly, the topology of aware scheduling. ACK can allocate multiple GPU device gather to a job with high test P2P connection bandwidth than human topology, NV link connection and even our DMA connection are considered in the scheduling process. Actually, topology aware scheduling to be honest, it's a very complicated gathering policy. However, for some specific snare wells, especially the distributed GPU training scenarios, it will give the best performance on organization for the job, then spread and the bean type patterns. Two different options to consider how to allocate job's tasks, sub tasks to OneNote auto. Spread means it will deploy all the sub tasks to manning note which will help to improve the job availability, right? Well, the beam pack will increase the performance since it avoid remote network communication cost between the subtasks off one job. Okay, if all those automated scheduling patterns are still not satisfied, user can specify which nodes run his job exactly. We call it bundle pattern. Turns out faster. Our GPU resource perspective. And on the other hand of from the AI or machine learning job perspective, there are different scheduling policies can be supported. Actually SDK provides a similar batch job scheduling capabilities as just like the Hadoop Young users can manage both web application and AI or machine learning of big data patching jobs on one single platform in one single cluster. So I suppose people are already familiar with Hadoop Yarn scheduler, so actually the classic scheduling policies are supported. The first one is a foul means first [INAUDIBLE] Right, the jobs are running time sequence, so the job submission time takes the president and the second one is the capacity is gathering the job. Different job can borrow or return the resources of other jobs. So the job performance will be maximized and the cluster resource utilization will takes presidents and the sort of policies are fair sharing scheduling, so all jobs will run with a fair chances. Right on average job have furnace to take to share the cluster resources. So all those scheduling policies have already supported in ACKs specific scheduler and some of them are we are trying to contribute back to the communities upstream community. As I mentioned in the most GPU training jobs remote data loading is always a news to the overall performance. We leverage distributed data cache service to increase data loading for distributed data processing jobs especially for the training and some machine learning jobs running on top of a GPU. This cash service supports a unified fuse interface. You can use Posix interface API and support different storage backends, including the objective storage stories or the HD FS or non service on Alabama cloud. The cash service itself can be around to scaling and it can utilize the RAM or the SSD or even HDD. An even mixture of them from different layers, cache mechanisms. We run other website I just gave out. Quick sample about how its performance performs when we run tensorflow resnet 15 image classification benchmark or this distributed cache and then we can get over 14% performance improvement the acceleration can be kept. Linear growth, while extended from one GPU, demands to at least study to GPU devices. Okay, we have, well, hatch of link SDK for quite awhile. Actually I gotta have this enterprise level flink products recalled blink is already running on a CK public cloud. Actually spark and crystal also running on already. I don't think. Yeah I don't think I will spend much time to go through this part but. Without underlying I mentioned are already mentioned, resource management, batch job scheduling, job lifecycle management, those features user can run spark or flame or other big data. Workloads are SDK just like running other type of applications. Okay, okay let me quick quickly share some user cases of SDK is the case cloud native API or machine learning solutions. The first one is way below. I think most of the people care about pretty well, right? Actually, its largest macroblock application and social network service provider in China, right? We doubled its machine learning platform on top of our solution. Buildings are ACK itself, including the flow and also arena and our scheduling. The platform can process one data in real time data samples with over ten millions of features model in just one day and the number is still growing and actually the platform is built on top of over 400 GPU nodes. Okay, the second user cases is about what we do to accelerate AI model, distribute training together with Alibaba's optimized framework and the SDK support for GPU scattering. For the distributed training for the image classification model, we can get over 90% speedup scale to up to 64 GPU devices, which actually is a 45% better than native tensor flow. Now let me wrap up today's content and share some more useful reference for you to get quick start. Okay, in one page. To summarize, ACK provides a solution, help AI developer and service providers to build their elastic heterogeneous computing or AI machine learning platform. With ACK's solution, user can manage clusters of CPU, GPU, MPU or other RDMA or other accelerators with a very simple clicks. And all those resources are well monitored and can be scaled out of stealing automatically based on workload changes and policies. The solution integrate all types of Alibaba Cloud infrastructure as a service layer services gallo. We well expose them to developers with a unified concept and API, so developers can use the arena tooling to submit and control his AI job's life cycle. And in the back end of ACK scheduler we all keep on dynamically allocating proper GPU resources to those jobs and with consideration for performance, resource efficiency and cost. And there is the data cache services, most of the training jobs on GPU can be accelerated in a general way. And in that way, developer and data scientists can run their tensor flow or power charge or spark or flame or whatever data computation drops on communities cloud native way. Very easy and with highly efficiency. Okay, so here are what I mentioned before that some best practices documentation and the Quickstart asset for our cognitive AI and machine learning solution on ACK. Please feel free to look into them and talk to me if you have any interest in running your AI or machine learning workloads on communities and with containers. So with that, I finish model showing today. Thank you for your listening and expecting more follow-up questions or queries and your interest. >> Thank you for presenting. I hope you find it useful and you learn something new. Please read on your right. Value of a pack. Also feel free to post your question into any box if you have any or if you're talk, you can put your questions in the entire group. Look at our solution architect or speaker. Respond to you shortly. In the meanwhile, as mentioned in the beginning of the event, there is a clouder certificate assigned for each session. Before we move to the next session, we highly recommend you to take a couple of minutes and complete the quiz. You can access the quiz by clicking the banner below the screen. If you pass the quiz, you will be entitled to a cloud certificate as a token of appreciation. If you get three Crowders across all day sessions today and post them on social network, you will get access to free ACA DevOps certificate training and exam. If you get six clouders across all eight sessions today and post them on social network, apart from free ACA certificate training and exam. You'll receive a Alibaba Cloud DevOps t-shirt. This is only limited to first 200 winners. And if you get full stack of of eight clouders, wow, that's a great achievement, congratulations. In addition to all the benefit mentioned above, you will stand a chance to win a free TOC opportunity worth up to $1000 US. Terms and conditions apply. So at the beginning of the event, we mentioned we have turned our activities online and especially in May, we had extensive training. So how we conduct the training? We actually move our training sessions, especially the technologies sessions into DingTalk. DingTalk is a mobile workspace with a large-scale video core and network capabilities. This application is developed by Alibaba Group. So far we have more than 500 members across customers, partners, communities. Join us on DingTalk in couple of weeks. If you don't DingTalk you are able to get free training and an exam, including cloud computing and security on monthly basis. And you will both attend regular product and solution enablement training on weekly basis and engage with our solution architect. Last but not least, since we already have more than 500 members. You are able to engage before community members and build your own social network. So how to join us on DingTalk Lite? Here are the steps listed on the slide. You can take a screenshot or take a photo of the slide if you're not able to finish the process. I look forward meeting you all on DingTalk Lite. In short, two action points. Number one, complete the quiz, and number two, join us on DingTalk Lite. Thank you so much for attending the session, I will see you in the next session soon.