baizectl 命令行工具使用指南¶
baizectl
是在 DCE 5.0 AI Lab 模块中专门服务于模型开发者与数据科学家们使用的命令行工具。 它提供了一系列命令来帮助用户管理分布式训练作业、查看任务状态、管理数据集等操作,同时支持连接 Kubernetes 工作集群和 DCE 5.0 工作空间,帮助用户更高效地使用和管理 Kubernetes 平台资源。
安装¶
目前,baizectl
已经集成在 DCE 5.0 AI Lab 中。 你在创建 Notebook 后,即可在 Notebook 中直接使用 baizectl
。
快速上手¶
基本信息¶
baizectl
命令的基本格式如下:
jovyan@19d0197587cc:/$ baizectl
AI platform management tool
Usage:
baizectl [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
data Management datasets
help Help about any command
job Manage jobs
login Login to the platform
version Show cli version
Flags:
--cluster string Cluster name to operate
-h, --help help for baizectl
--mode string Connection mode: auto, api, notebook (default "auto")
-n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.
-s, --server string DCE5 access base url
--skip-tls-verify Skip TLS certificate verification
--token string DCE5 access token
-w, --workspace int32 Workspace ID to use for the operation
Use "baizectl [command] --help" for more information about a command.
以上是 baizectl
的基本信息,用户可以通过 baizectl --help
查看帮助信息, 或者通过 baizectl [command] --help
查看具体命令的帮助信息。
查看版本信息¶
baizectl
支持通过 version
命令查看版本信息。
命令格式¶
baizectl
命令的基本格式如下:
其中,[command]
是具体的操作命令,如 data
、job
等,[flags]
是可选的参数,用于指定操作的详细信息。
常用选项¶
--cluster string
:指定要操作的集群名称。-h, --help
:显示帮助信息。--mode string
:连接模式,可选值为auto
、api
、notebook
(默认值为auto
)。-n, --namespace string
:指定操作的命名空间。如果未设置,将使用默认命名空间。-s, --server string
:DCE5 访问基础 URL。--skip-tls-verify
:跳过 TLS 证书验证。--token string
:DCE5 访问令牌。-w, --workspace int32
:指定操作的工作区 ID。
功能介绍¶
任务管理¶
baizectl 提供了一系列命令来管理分布式训练任务,包含了查看任务列表,提交任务、查看日志、重启任务、删除任务等。
jovyan@19d0197587cc:/$ baizectl job
Manage jobs
Usage:
baizectl job [command]
Available Commands:
delete Delete a job
logs Show logs of a job
ls List jobs
restart restart a job
submit Submit a job
Flags:
-h, --help help for job
-o, --output string Output format. One of: table, json, yaml (default "table")
--page int Page number (default 1)
--page-size int Page size (default -1)
--search string Search query
--sort string Sort order
--truncate int Truncate output to the given length, 0 means no truncation (default 50)
Use "baizectl job [command] --help" for more information about a command.
提交训练任务¶
baizectl
支持使用 submit
命令提交一个任务,用户可以通过 baizectl job submit --help
查看详细信息。
(base) jovyan@den-0:~$ baizectl job submit --help
Submit a job
Usage:
baizectl job submit [flags] -- command ...
Aliases:
submit, create
Examples:
# Submit a job to run the command "torchrun python train.py"
baizectl job submit -- torchrun python train.py
# Submit a job with 2 workers(each pod use 4 gpus) to run the command "torchrun python train.py" and use the image "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime"
baizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --requests-resources nvidia.com/gpu=4 -- torchrun python train.py
# Submit a tensorflow job to run the command "python train.py"
baizectl job submit --tensorflow -- python train.py
Flags:
--annotations stringArray The annotations of the job, the format is key=value
--auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)
--commands stringArray The default command of the job
-d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist
-e, --envs stringArray The environment variables of the job, the format is key=value
-x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, Dataset, etc.
auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.
false: Do not read the configuration of the current Notebook.
true: Read the configuration of the current Notebook. (default "auto")
-h, --help help for submit
--image string The image of the job, it must be specified if fromNotebook is false.
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--labels stringArray The labels of the job, the format is key=value
--max-retries int32 number of retries before marking this job failed
--max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it
--name string The name of the job, if empty, the name will be generated automatically.
--paddle PaddlePaddle Job, has higher priority than --job-type
--priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority
--pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist
--pytorch Pytorch Job, has higher priority than --job-type
--queue string The queue to used
--requests-resources stringArray Similar to resources, but sets the resources of requests
--resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.
--restart-policy string The job restart policy (default "on-failure")
--runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment
--shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB
--tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.
--tensorflow Tensorflow Job, has higher priority than --job-type
--workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)
--working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file
Note
提交任务的命令参数说明:
- --name: 任务名称,如果为空,则会自动生成
- --image: 镜像名称,必须指定
- --priority: 任务优先级,支持 高=
baize-high-priority
、中=baize-medium-priority
、低=baize-low-priority
- --requests-resources: requests 的任务资源,格式为
cpu=1 memory=1Gi,nvidia.com/gpu=1
- --resources: limits 的任务资源,格式为
cpu=1 memory=1Gi,nvidia.com/gpu=1
- --workers: 任务工作节点数,默认为 1,当设置大于 1 时,任务将会分布式运行
- --queue: 任务队列,需要提前创建队列资源
- --working-dir: 工作目录,如果在 Notebook 模式下,会默认使用当前文件目录
- --datasets: 数据集,格式为
datasetName:mountPath
,例如mnist:/data/mnist
- --shm-size: 共享内存大小,在分布式训练任务时,可以启用,表示使用共享内存,单位为 MiB
- --labels: 任务标签,格式为
key=value
- --max-retries: 最大重试次数,任务失败后重试次数,失败后会重启任务,默认不限制
- --max-run-duration: 最大运行时间,任务运行时间超过指定时间后,会被系统终止,默认不限制
- --restart-policy: 重启策略,支持
on-failure
、never
、always
,默认为on-failure
- --from-notebook: 是否从 Notebook 中读取配置,支持
auto
、true
、false
,默认为auto
PyTorch 单机任务示例¶
提交训练任务示例,用户可以根据实际需求修改参数,以下为创建一个 PyTorch 任务的示例:
baizectl job submit --name demojob-v2 -t PYTORCH \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--requests-resources cpu=1,memory=1Gi \
--workers 1 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
PyTorch 分布式任务示例¶
提交训练任务示例,用户可以根据实际需求修改参数,以下为创建一个 PyTorch 任务的示例:
baizectl job submit --name demojob-v2 -t PYTORCH \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--requests-resources cpu=1,memory=1Gi \
--workers 2 \ # 多任务副本会自动创建分布式任务
--shm-size 1024 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
Tensorflow 任务示例¶
使用 -t
参数指定任务类型,以下为创建一个 Tensorflow 任务的示例:
baizectl job submit --name demojob-v2 -t TENSORFLOW \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--from-notebook auto \
--workers 1 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
也可以使用 --job-type
或者 --tensorflow
参数指定任务类型
Paddle 任务示例¶
baizectl job submit --name demojob-v2 -t PADDLE \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
查看任务列表¶
baizectl job
支持通过 ls
命令查看任务列表,默认显示 pytroch
任务,用户可以通过 -t
指定任务类型。
(base) jovyan@den-0:~$ baizectl job ls # 默认查看 pytorch 任务
NAME TYPE PHASE DURATION COMMAND
demong PYTORCH SUCCEEDED 1m2s sleep 60
demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200
(base) jovyan@den-0:~$ baizectl job ls demo-sleep # 查看指定任务
NAME TYPE PHASE DURATION COMMAND
demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200
(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW # 查看 tensorflow 任务
NAME TYPE PHASE DURATION COMMAND
demotfjob TENSORFLOW CREATED 0s sleep 1000
任务列表默认情况下使用 table
作为展示形式,如果希望查看更多信息,可使用 json
或 yaml
格式展示,可以通过 -o
参数指定。
(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW -o yaml
- baseConfig:
args:
- sleep
- "1000"
image: release.daocloud.io/baize/baize-notebook:v0.5.0
labels:
app: den
podConfig:
affinity: {}
kubeEnvs:
- name: CONDA_EXE
value: /opt/conda/bin/conda
- name: CONDA_PREFIX
value: /opt/conda
- name: CONDA_PROMPT_MODIFIER
value: '(base) '
- name: CONDA_SHLVL
value: "1"
- name: CONDA_DIR
value: /opt/conda
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_DEFAULT_ENV
value: base
- name: PATH
value: /opt/conda/bin:/opt/conda/condabin:/command:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
priorityClass: baize-high-priority
queue: default
creationTimestamp: "2024-06-16T07:47:27Z"
jobSpec:
runPolicy:
suspend: true
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
affinity: {}
containers:
- args:
- sleep
- "1000"
env:
- name: CONDA_EXE
value: /opt/conda/bin/conda
- name: CONDA_PREFIX
value: /opt/conda
- name: CONDA_PROMPT_MODIFIER
value: '(base) '
- name: CONDA_SHLVL
value: "1"
- name: CONDA_DIR
value: /opt/conda
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_DEFAULT_ENV
value: base
- name: PATH
value: /opt/conda/bin:/opt/conda/condabin:/command:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
image: release.daocloud.io/baize/baize-notebook:v0.5.0
name: tensorflow
resources:
limits:
memory: 1Gi
requests:
cpu: "1"
memory: 2Gi
workingDir: /home/jovyan
priorityClassName: baize-high-priority
name: demotfjob
namespace: ns-chuanjia-ndx
phase: CREATED
roleConfig:
TF_WORKER:
replicas: 1
resources:
limits:
memory: 1Gi
requests:
cpu: "1"
memory: 2Gi
totalResources:
limits:
memory: "1073741824"
requests:
cpu: "1"
memory: "2147483648"
trainingConfig:
restartPolicy: RESTART_POLICY_ON_FAILURE
trainingMode: SINGLE
type: TENSORFLOW
查看任务日志¶
baizectl job
支持使用 logs
命令查看任务日志,用户可以通过 baizectl job logs --help
查看详细信息。
(base) jovyan@den-0:~$ baizectl job logs --help
Show logs of a job
Usage:
baizectl job logs <job-name> [pod-name] [flags]
Aliases:
logs, log
Flags:
-f, --follow Specify if the logs should be streamed.
-h, --help help for logs
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tail int Lines of recent log file to display.
--tensorflow Tensorflow Job, has higher priority than --job-type
--timestamps Show timestamps
Note
--follow
参数实时查看日志--tail
参数指定查看日志的行数,默认为 50 行--timestamps
参数显示时间戳
示例查看任务日志:
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f
2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Conv1 (Conv2D) (None, 13, 13, 8) 80
flatten (Flatten) (None, 1352) 0
Softmax (Dense) (None, 10) 13530
=================================================================
Total params: 13610 (53.16 KB)
Trainable params: 13610 (53.16 KB)
Non-trainable params: 0 (0.00 Byte)
...
删除任务¶
baizectl job
支持使用 delete
命令删除任务,并且同时支持删除多个任务。
(base) jovyan@den-0:~$ baizectl job delete --help
Delete a job
Usage:
baizectl job delete [flags]
Aliases:
delete, del, remove, rm
Flags:
-h, --help help for delete
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tensorflow Tensorflow Job, has higher priority than --job-type
示例删除任务:
(base) jovyan@den-0:~$ baizectl job ls
NAME TYPE PHASE DURATION COMMAND
demong PYTORCH SUCCEEDED 1m2s sleep 60
demo-sleep PYTORCH RUNNING 1h20m51s sleep 7200
demojob PYTORCH FAILED 16m46s sleep 1000
demojob-v2 PYTORCH RUNNING 3m13s sleep 1000
demojob-v3 PYTORCH CREATED 0s sleep 1000
(base) jovyan@den-0:~$ baizectl job delete demojob # 删除单个任务
Delete job demojob in ns-chuanjia-ndx successfully
(base) jovyan@den-0:~$ baizectl job delete demojob-v2 demojob-v3 # 删除多个任务
Delete job demojob-v2 in ns-chuanjia-ndx successfully
Delete job demojob-v3 in ns-chuanjia-ndx successfully
重启任务¶
baizectl job
支持使用 restart
命令重启任务,用户可以通过 baizectl job restart --help
查看详细信息。
(base) jovyan@den-0:~$ baizectl job restart --help
restart a job
Usage:
baizectl job restart [flags] job
Aliases:
restart, rerun
Flags:
-h, --help help for restart
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tensorflow Tensorflow Job, has higher priority than --job-type
数据集管理¶
baizectl
支持管理数据集,目前支持查看数据集列表,方便在任务训练时,快速绑定数据集。
(base) jovyan@den-0:~$ baizectl data
Management datasets
Usage:
baizectl data [flags]
baizectl data [command]
Aliases:
data, dataset, datasets, envs, runtime-envs
Available Commands:
ls List datasets
Flags:
-h, --help help for data
-o, --output string Output format. One of: table, json, yaml (default "table")
--page int Page number (default 1)
--page-size int Page size (default -1)
--search string Search query
--sort string Sort order
--truncate int Truncate output to the given length, 0 means no truncation (default 50)
Use "baizectl data [command] --help" for more information about a command.
查看数据集列表¶
baizectl data
支持通过 ls
命令查看数据集列表,默认显示 table
格式,用户可以通过 -o
参数指定输出格式。
(base) jovyan@den-0:~$ baizectl data ls
NAME TYPE URI PHASE
fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY
sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY
training-output PVC pvc://training-output READY
在提交训练任务时,可以使用 -d
或者 --datasets
参数指定数据集,例如:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--datasets sample-code:/home/jovyan/code \
-- sleep 1000
同时挂载多个数据集,可以按照如下格式:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--datasets sample-code:/home/jovyan/code fashion-mnist:/home/jovyan/data \
-- sleep 1000
查看依赖库(环境)¶
环境 runtime-env
是 DCE 的特色环境管理能力,通过将模型开发、训练任务以及推理中所需的依赖库解耦, 提供了一种更加灵活的依赖库管理方式,无需重复构建复杂的 Docker 镜像,只需选择合适的环境即可。
同时 runtime-env
支持热更新,动态升级,无需重新构建镜像,即可更新环境依赖库。
baizectl data
支持通过 runtime-env
命令查看环境列表,默认显示 table
格式,用户可以通过 -o
参数指定输出格式。
(base) jovyan@den-0:~$ baizectl data ls --runtime-env
NAME TYPE URI PHASE
fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY
sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY
training-output PVC pvc://training-output READY
tensorflow-sample CONDA conda://python?version=3.12.3 PROCESSING
在提交训练任务时,可以使用 --runtime-env
参数指定环境,例如:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--runtime-env tensorflow-sample \
-- sleep 1000
高级用法¶
baizectl 支持更多高级用法,例如自动补全脚本生成、使用特定集群和命名空间、使用特定工作空间等。
自动补全脚本生成¶
上述命令生成 bash
的自动补全脚本,并将其保存到 /etc/bash_completion.d/baizectl
目录中,用户可以通过 source /etc/bash_completion.d/baizectl
加载自动补全脚本。
使用特定集群和命名空间¶
该命令将列出 my-cluster
集群中 my-namespace
命名空间下的所有作业。
使用特定工作空间¶
常见问题¶
-
问题:为什么无法连接到服务器?
解决方法:检查
--server
参数是否正确设置,并确保网络连接正常。 如果服务器使用自签名证书,可以使用--skip-tls-verify
跳过 TLS 证书验证。 -
问题:如何解决权限不足的问题?
解决方法:确保使用正确的
--token
参数登录,并检查当前用户是否具有相应的操作权限。 -
问题:为什么无法列出数据集?
解决方法:检查命名空间和工作区是否正确设置,确保当前用户有权限访问这些资源。
结语¶
通过以上指南,用户可以快速上手 baizectl
命令,并在实际应用中高效地管理 AI 平台资源。 如果有任何疑问或问题,建议参考 baizectl [command] --help
获取更多详细信息。