主机参考:VPS测评参考推荐/专注分享VPS服务器优惠信息!若您是商家可以在本站进行投稿,查看详情!此外我们还提供软文收录、PayPal代付、广告赞助等服务,查看详情! |
我们发布的部分优惠活动文章可能存在时效性,购买时建议在本站搜索商家名称可查看相关文章充分了解该商家!若非中文页面可使用Edge浏览器同步翻译!PayPal代付/收录合作 |
对于AI企业来说,GPU等计算资源昂贵,如何提高资源利用率,保护计算力投资?如何解决资源抢占,保证资源使用公平合理?如何减少等待时间,提高模型训练效率……这些问题都关系着研发创新的进度。且看浪潮AIStation人工智能开发平台用三招“组合拳”打破计算力壁垒,加速企业AI开发进程。
某企业AI开发面临的问题
某企业有四台8卡GPU服务器供50位开发人员使用,典型的人多资源少。具体来说,有以下几大问题:
人均不足一张GPU卡,GPU使用需要相互协调,开发效率低;
每十多人为一个小组共用一个GPU节点,可能使有的小组资源空闲而有的小组却无资源可用,造成资源孤岛;
缺乏优先机制,重要任务无法得到及时提交;
在白天GPU卡几乎全部用于开发环境创建,开发人员只能在晚上提交训练任务,模型训练数量非常有限。
AIStation三招解决企业算力问题
AIStation是面向AI企业开发场景的人工智能资源平台,可通过资源配额、GPU共享、排队托管三招“组合拳”,智能化分配GPU计算资源,提高资源利用率,帮助用户提高开发效率。
首先,AIStation收拢分散的计算资源,提供集群式的池化管理,并设置资源配额策略,实现多用户公平均衡使用资源。
AIStation将开发用户划分为5个用户组,每个用户组10人,并根据业务需求设置每组和每个用户的使用配额,如可设置每组使用6张GPU卡、40个CPU核。并对每个用户的开发环境使用时长、同时提交任务数量进行限制。
其次,AIStation通过GPU共享策略,可以让多人共用一张GPU卡且互不影响。
AIStation统一管理4台GPU节点,将其中2个节点的16张GPU卡设置为开发资源组,用于开发环境创建,剩下16张GPU卡为训练资源组,用于模型训练。通过共享策略,AIStation可将开发资源组的每张GPU卡切分为8份,每份使用4G显存。这样原来的16张GPU卡相当于变为了128张卡。并且通过设置CPU超线程策略扩展CPU核数,满足50个用户同时创建开发环境的需求。用户也可以根据自己的模型设置batchsize和显存使用的大小。
GPU共享模式
最后,AIStation通过任务排队托管、定义任务优先级,充分利用空闲时间训练任务,并且可根据优先级调度任务排队运行。
用户可同时提交多个训练任务,资源不足时排队等待,一个任务训练结束后自动释放资源给排队等待的任务,从而可以充分利用夜间、周末训练任务,延长GPU的使用时间。同时用户可设置优先级,让重要任务优先训练。
开发用户任务托管
AIStation取得显著效果
GPU使用时间加大近1倍。原来单个GPU节点分配给一个用户小组使用,每卡每天的平均使用时间仅为14.4小时。AIStation通过GPU共享、任务托管,解决资源孤岛,将每卡每天的平均使用时间提升到22.8小时。
一天周期内集群GPU使用情况
GPU利用率提升50%。原来用户在开发阶段独占一张GPU,GPU利用率仅为10%,训练阶段可达90%,每天每卡的平均利用率为30%。使用AIStation后,开发阶段8人共用一张GPU卡,GPU利用率可上升为80%,训练阶段为90%,每天每卡的利用率可达到80%。
GPU使用情况对比
每周训练任务的数量增大一倍多。假设提交一个ImageNet数据集和一个ResNet50模型,使用1张Tesla V100 GPU卡训练任务,每个任务的训练时间大概为12小时。
原来因为人均不够一张卡, 白天GPU卡用于创建开发环境,晚上才能训练任务,那么一个工作日可以完成的任务数最多为32个,即一周可以完成160个任务。
而AIStation支持任务排队,可以最大限度的使用GPU资源。一周可以完成368个任务数,效率提升2.3倍。假设开发团队单个项目平均需要训练的任务数为50,那么每个月的项目完成数从3个提升到7个。
可以看出,浪潮AIStation通过对计算资源的高效管理、调度,在GPU使用时间、利用率和训练任务数量上,相比原方案均实现了大幅提升,最大化地优化了资源使用。
浪潮AIStation人工智能资源平台面向AI企业开发场景,致力于帮助企业构建一体化的AI开发平台,为AI开发工程师提供高效的计算力支撑、精准的资源管理和调度、敏捷的数据整合及加速、流程化的AI开发业务整合,助力AI企业提高开发效率和产品上市速度,增强企业竞争力。
除了高效的资源管理,AIStation在开发环境创建、数据管理、开发流程管理等方面也表现出色。在后续文章中,我们将结合实际应用场景为大家详细介绍,敬请关注。
For AI enterprises, GPU and other computing resources are expensive. How to improve resource utilization and protect computing investment? How to solve the problem of seizing resources and ensure the fair and reasonable use of resources? How to reduce waiting time and improve model training efficiency These problems are related to the progress of R D innovation. Let's see Inspur aistation AI development platform break the barrier of computing power with three moves "combination fist" to accelerate the AI development process of enterprises.
Problems in AI development of an enterprise
There are four 8-card GPU servers for 50 developers in an enterprise. Specifically, there are the following major problems:
There is less than one GPU card per capita, the use of GPU needs to be coordinated with each other, and the development efficiency is low;
Every ten people share a GPU node for a group, which may make some group resources idle and some groups have no resources available, resulting in resource islands;
Lack of priority mechanism and failure to submit important tasks in time;
During the day, almost all GPU cards are used to create development environment, developers can only submit training tasks at night, and the number of model training is very limited.
Three ways of aistation to solve the problem of enterprise computing power
Aistation is an artificial intelligence resource platform for AI enterprise development scenarios. It can intelligently allocate GPU computing resources, improve resource utilization and help users improve development efficiency through three "combined fists" of resource a, GPU sharing and queue hosting.
First of all, aistation collects distributed computing resources, provides clustered pool management, and sets resource a strategy to achieve fair and balanced use of resources by multiple users.
Aistation divides the development users into five user groups, each of which has 10 users, and sets the usage a for each group and each user according to the business requirements. For exle, six GPU cards and 40 CPU cores can be set for each group. It also limits the use time of each user's development environment and the number of tasks submitted at the same time.
Secondly, through the GPU sharing strategy, aistation allows multiple people to share a GPU card without affecting each other.
Aistation manages 4 GPU nodes in a unified way. 16 GPU cards of 2 nodes are set as development resource groups for development environment creation. The remaining 16 GPU cards are training resource groups for model training. Through the sharing strategy, AIStation can divide each GPU card of the development resource group into 8 copies, each using 4G video memory. In this way, the original 16 GPU cards are equivalent to 128 cards. And by setting the CPU hyper threading strategy to expand the number of CPU cores, 50 users can create development environment at the same time. Users can also set the batch size and the size of the display memory according to their own model.
GPU sharing mode
Finally, aistation can make full use of idle time to train tasks and schedule tasks according to their priority.
Users can submit multiple training tasks at the same time, wait in line when resources are insufficient, and automatically release resources to the waiting tasks after a task training, so as to make full use of night and weekend training tasks and extend the use time of GPU. At the same time, the user can set priority to train important tasks.
Development user task hosting
Aistation achieved remarkable results
GPU usage time nearly doubled. The original single GPU node was assigned to a user group, and the average usage time of each card is only 14.4 hours per day. Aistation solves the problem of resource isolation through GPU sharing and task hosting, and improves the average daily use time of each card to 22.8 hours.
Cluster GPU usage in one day cycle
GPU utilization increased by 50%. The original user monopolized one GPU in the development stage, the GPU utilization rate was only 10%, the training stage could reach 90%, and the average utilization rate per card per day was 30%. After using aistation, 8 people share one GPU card in the development stage, the utilization rate of GPU can rise to 80%, 90% in the training stage, and the utilization rate of each card can reach 80% every day.
GPU usage comparison
The number of training tasks per week more than doubled. Suppose we submit an Imagenet data set and a resnet50 model, and use a Tesla V100 GPU card to train tasks. The training time of each task is about 12 hours.
Originally, because there is not enough one card per capita, GPU card is used to create development environment in the daytime, and only training tasks can be performed in the evening. Then, the maximum number of tasks that can be completed in one working day is 32, that is, 160 tasks can be completed in one week.
Aistation supports task queuing and maximizes the use of GPU resources. 368 tasks can be completed in a week, with the efficiency increased by 2.3 times. Assuming that the average number of tasks a development team needs to train for a single project is 50, the number of completed projects per month will be increased from 3 to 7.
It can be seen that Inspur aistation, through the efficient management and scheduling of computing resources, has greatly improved the GPU use time, utilization rate and the number of training tasks compared with the original scheme, and maximized the use of resources.
Inspur aistation artificial intelligence resource platform is oriented to AI enterprise development scenarios. It is dedicated to helping enterprises build an integrated AI development platform, providing AI development engineers with efficient computing power support, accurate resource management and scheduling, agile data integration and accelerated, process based AI development business integration, helping AI enterprises improve development efficiency and product listing speed, and enhancing enterprise competition Struggle.
In addition to efficient resource management, aistation also performs well in development environment creation, data management, development process management, etc. In the following articles, we will give you a detailed introduction based on the actual application scenarios. Please pay attention.
--------------------------------------------------------------
主机参考,收集国内外VPS,VPS测评,主机测评,云服务器,虚拟主机,独立服务器,国内外服务器,高性价比建站主机相关优惠信息@zhujicankao.com
详细介绍和测评国外VPS主机,云服务器,国外服务器,国外主机的相关优惠信息,商家背景,网络带宽等等,也是目前国内最好的主机云服务器VPS参考测评资讯优惠信息分享平台
这几篇文章你可能也喜欢:
- KVM云怎么样? 海外主机详细介绍(kvm官网)
- 2024年低价海外VPS推荐汇总(低价海外VPS)
- 国内VPS和海外VPS的区别(海外VPS平台哪个更好)
- ITLDC,意大利VPS新品上线/全场循环7折优惠低至€2.7/月,全球超17个机房可选/美国/欧洲/新加坡等,KVM虚拟/100Mbps带宽不限流量
- Racknerd:618 促销 - 低至 17.88 美元/年
本文由主机参考刊发,转载请注明:高效共享GPU!浪潮AIStation突破企业AI计算资源极限 https://zhujicankao.com/12530.html
评论前必须登录!
注册