baizectl CLI Usage Guide¶
baizectl
is a command line tool specifically designed for model developers and data scientists within the DCE 5.0 AI Lab module. It provides a series of commands to help users manage distributed training jobs, check job statuses, manage datasets, and more. It also supports connecting to Kubernetes worker clusters and DCE 5.0 workspaces, aiding users in efficiently using and managing Kubernetes platform resources.
Installation¶
Currently, baizectl
is integrated within DCE 5.0 AI Lab. Once you create a Notebook, you can directly use baizectl
within it.
Getting Started¶
Basic Information¶
The basic format of the baizectl
command is as follows:
jovyan@19d0197587cc:/$ baizectl
AI platform management tool
Usage:
baizectl [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
data Management datasets
help Help about any command
job Manage jobs
login Login to the platform
version Show cli version
Flags:
--cluster string Cluster name to operate
-h, --help help for baizectl
--mode string Connection mode: auto, api, notebook (default "auto")
-n, --namespace string Namespace to use for the operation. If not set, the default Namespace will be used.
-s, --server string DCE5 access base url
--skip-tls-verify Skip TLS certificate verification
--token string DCE5 access token
-w, --workspace int32 Workspace ID to use for the operation
Use "baizectl [command] --help" for more information about a command.
The above provides basic information about baizectl
. Users can view the help information using baizectl --help
, or view the help information for specific commands using baizectl [command] --help
.
View Versions¶
baizectl
supports viewing version information using the version
command.
Command Format¶
The basic format of the baizectl
command is as follows:
Here, [command]
refers to the specific operation command, such as data
and job
, and [flags]
are optional parameters used to specify detailed information about the operation.
Common Options¶
--cluster string
: Specify the name of the cluster to operate on.-h, --help
: Display help information.--mode string
: Connection mode, optional values areauto
,api
,notebook
(default value isauto
).-n, --namespace string
: Specify the namespace for the operation. If not set, the default namespace will be used.-s, --server string
: Base URL for accessing DCE5.--skip-tls-verify
: Skip TLS certificate verification.--token string
: Access token for DCE5.-w, --workspace int32
: Specify the workspace ID for the operation.
Features¶
Job Management¶
baizectl
provides a series of commands to manage distributed training jobs, including viewing job lists, submitting jobs, viewing logs, restarting jobs, deleting jobs, and more.
jovyan@19d0197587cc:/$ baizectl job
Manage jobs
Usage:
baizectl job [command]
Available Commands:
delete Delete a job
logs Show logs of a job
ls List jobs
restart restart a job
submit Submit a job
Flags:
-h, --help help for job
-o, --output string Output format. One of: table, json, yaml (default "table")
--page int Page number (default 1)
--page-size int Page size (default -1)
--search string Search query
--sort string Sort order
--truncate int Truncate output to the given length, 0 means no truncation (default 50)
Use "baizectl job [command] --help" for more information about a command.
Submit Training Jobs¶
baizectl
supports submitting a job using the submit
command. You can view detailed information by using baizectl job submit --help
.
(base) jovyan@den-0:~$ baizectl job submit --help
Submit a job
Usage:
baizectl job submit [flags] -- command ...
Aliases:
submit, create
Examples:
# Submit a job to run the command "torchrun python train.py"
baizectl job submit -- torchrun python train.py
# Submit a job with 2 workers(each pod use 4 gpus) to run the command "torchrun python train.py" and use the image "pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime"
baizectl job submit --image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime --workers 2 --resources nvidia.com/gpu=4 -- torchrun python train.py
# Submit a tensorflow job to run the command "python train.py"
baizectl job submit --tensorflow -- python train.py
Flags:
--annotations stringArray The annotations of the job, the format is key=value
--auto-load-env It only takes effect when executed in Notebook, the environment variables of the current environment will be automatically read and set to the environment variables of the Job, the specific environment variables to be read can be specified using the BAIZE_MAPPING_ENVS environment variable, the default is PATH,CONDA_*,*PYTHON*,NCCL_*, if set to false, the environment variables of the current environment will not be read. (default true)
--commands stringArray The default command of the job
-d, --datasets stringArray The dataset bind to the job, the format is datasetName:mountPath, e.g. mnist:/data/mnist
-e, --envs stringArray The environment variables of the job, the format is key=value
-x, --from-notebook string Define whether to read the configuration of the current Notebook and directly create tasks, including images, resources, and dataset.
auto: Automatically determine the mode according to the current environment. If the current environment is a Notebook, it will be set to notebook mode.
false: Do not read the configuration of the current Notebook.
true: Read the configuration of the current Notebook. (default "auto")
-h, --help help for submit
--image string The image of the job, it must be specified if fromNotebook is false.
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--labels stringArray The labels of the job, the format is key=value
--max-retries int32 number of retries before marking this job failed
--max-run-duration int Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it
--name string The name of the job, if empty, the name will be generated automatically.
--paddle PaddlePaddle Job, has higher priority than --job-type
--priority string The priority of the job, current support baize-medium-priority, baize-low-priority, baize-high-priority
--pvcs stringArray The pvcs bind to the job, the format is pvcName:mountPath, e.g. mnist:/data/mnist
--pytorch Pytorch Job, has higher priority than --job-type
--queue string The queue to used
--requests-resources stringArray Similar to resources, but sets the resources of requests
--resources stringArray The resources of the job, it is a string in the format of cpu=1,memory=1Gi,nvidia.com/gpu=1, it will be set to the limits and requests of the container.
--restart-policy string The job restart policy (default "on-failure")
--runtime-envs baizectl data ls --runtime-env The runtime environment to use for the job, you can use baizectl data ls --runtime-env to get the runtime environment
--shm-size int32 The shared memory size of the job, default is 0, which means no shared memory, if set to more than 0, the job will use the shared memory, the unit is MiB
--tensorboard-log-dir string The tensorboard log directory, if set, the job will automatically start tensorboard, else not. The format is /path/to/log, you can use relative path in notebook.
--tensorflow Tensorflow Job, has higher priority than --job-type
--workers int The workers of the job, default is 1, which means single worker, if set to more than 1, the job will be distributed. (default 1)
--working-dir string The working directory of job container, if in notebook mode, the default is the directory of the current file
Note
Explanation of command parameters for submitting jobs:
- --name: Job name. If empty, it will be auto-generated.
- --image: Image name. This must be specified.
- --priority: Job priority, supporting high=
baize-high-priority
, medium=baize-medium-priority
, low=baize-low-priority
. - --resources: Job resources, formatted as
cpu=1 memory=1Gi,nvidia.com/gpu=1
. - --workers: Number of job worker nodes. The default is 1. When set to greater than 1, the job will run in a distributed manner.
- --queue: Job queue. Queue resources need to be created in advance.
- --working-dir: Working directory. In Notebook mode, the current file directory will be used by default.
- --datasets: Dataset, formatted as
datasetName:mountPath
, for examplemnist:/data/mnist
. - --shm-size: Shared memory size. This can be enabled for distributed training jobs, indicating the use of shared memory, with units in MiB.
- --labels: Job labels, formatted as
key=value
. - --max-retries: Maximum retry count. The number of times to retry the job upon failure. The job will restart upon failure. Default is unlimited.
- --max-run-duration: Maximum run duration. The job will be terminated by the system if it exceeds the specified run time. Default is unlimited.
- --restart-policy: Restart policy, supporting
on-failure
,never
,always
. The default ison-failure
. - --from-notebook: Whether to read configurations from the Notebook. Supports
auto
,true
,false
, with the default beingauto
.
Example of a PyTorch Single-Node Job¶
Example of submitting a training job. Users can modify parameters based on their actual needs. Below is an example of creating a PyTorch job:
baizectl job submit --name demojob-v2 -t PYTORCH \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--resources cpu=1,memory=1Gi \
--workers 1 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
PyTorch 分布式任务示例¶
提交训练任务示例,用户可以根据实际需求修改参数,以下为创建一个 PyTorch 任务的示例:
Example of a Distributed PyTorch Job¶
Example of submitting a training job. You can modify parameters based on their actual needs. Below is an example of creating a distributed PyTorch job:
baizectl job submit --name demojob-v2 -t PYTORCH \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--resources cpu=1,memory=1Gi \
--workers 2 \ # Multiple job replicas will automatically create a distributed job.
--shm-size 1024 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
Example of a TensorFlow Job¶
Use the -t
parameter to specify the job type. Below is an example of creating a TensorFlow job:
baizectl job submit --name demojob-v2 -t TENSORFLOW \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--from-notebook auto \
--workers 1 \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
You can also use the --job-type
or --tensorflow
parameter to specify the job type.
Example of a Paddle Job¶
baizectl job submit --name demojob-v2 -t PADDLE \
--image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--priority baize-high-priority \
--queue default \
--working-dir /data \
--datasets fashion-mnist:/data/mnist \
--labels job_type=pytorch \
--max-retries 3 \
--max-run-duration 60 \
--restart-policy on-failure \
-- sleep 1000
View Job List¶
baizectl job
supports viewing the job list using the ls
command. By default, it displays pytorch
jobs, but users can specify the job type using the -t
parameter.
(base) jovyan@den-0:~$ baizectl job ls # View pytorch jobs by default
NAME TYPE PHASE DURATION COMMAND
demong PYTORCH SUCCEEDED 1m2s sleep 60
demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200
(base) jovyan@den-0:~$ baizectl job ls demo-sleep # View a specific job
NAME TYPE PHASE DURATION COMMAND
demo-sleep PYTORCH RUNNING 1h25m28s sleep 7200
(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW # View tensorflow jobs
NAME TYPE PHASE DURATION COMMAND
demotfjob TENSORFLOW CREATED 0s sleep 1000
The job list uses table
as the default display format. If you want to view more information, you can use the json
or yaml
format, which can be specified using the -o
parameter.
(base) jovyan@den-0:~$ baizectl job ls -t TENSORFLOW -o yaml
- baseConfig:
args:
- sleep
- "1000"
image: release.daocloud.io/baize/baize-notebook:v0.5.0
labels:
app: den
podConfig:
affinity: {}
kubeEnvs:
- name: CONDA_EXE
value: /opt/conda/bin/conda
- name: CONDA_PREFIX
value: /opt/conda
- name: CONDA_PROMPT_MODIFIER
value: '(base) '
- name: CONDA_SHLVL
value: "1"
- name: CONDA_DIR
value: /opt/conda
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_DEFAULT_ENV
value: base
- name: PATH
value: /opt/conda/bin:/opt/conda/condabin:/command:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
priorityClass: baize-high-priority
queue: default
creationTimestamp: "2024-06-16T07:47:27Z"
jobSpec:
runPolicy:
suspend: true
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
affinity: {}
containers:
- args:
- sleep
- "1000"
env:
- name: CONDA_EXE
value: /opt/conda/bin/conda
- name: CONDA_PREFIX
value: /opt/conda
- name: CONDA_PROMPT_MODIFIER
value: '(base) '
- name: CONDA_SHLVL
value: "1"
- name: CONDA_DIR
value: /opt/conda
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_PYTHON_EXE
value: /opt/conda/bin/python
- name: CONDA_DEFAULT_ENV
value: base
- name: PATH
value: /opt/conda/bin:/opt/conda/condabin:/command:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
image: release.daocloud.io/baize/baize-notebook:v0.5.0
name: tensorflow
resources:
limits:
memory: 1Gi
requests:
cpu: "1"
memory: 2Gi
workingDir: /home/jovyan
priorityClassName: baize-high-priority
name: demotfjob
namespace: ns-chuanjia-ndx
phase: CREATED
roleConfig:
TF_WORKER:
replicas: 1
resources:
limits:
memory: 1Gi
requests:
cpu: "1"
memory: 2Gi
totalResources:
limits:
memory: "1073741824"
requests:
cpu: "1"
memory: "2147483648"
trainingConfig:
restartPolicy: RESTART_POLICY_ON_FAILURE
trainingMode: SINGLE
type: TENSORFLOW
View Job Logs¶
baizectl job
supports viewing job logs using the logs
command. You can view detailed information by using baizectl job logs --help
.
(base) jovyan@den-0:~$ baizectl job logs --help
Show logs of a job
Usage:
baizectl job logs <job-name> [pod-name] [flags]
Aliases:
logs, log
Flags:
-f, --follow Specify if the logs should be streamed.
-h, --help help for logs
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tail int Lines of recent log file to display.
--tensorflow Tensorflow Job, has higher priority than --job-type
--timestamps Show timestamps
Note
- The
--follow
parameter allows for real-time log viewing. - The
--tail
parameter specifies the number of log lines to view, with a default of 50 lines. - The
--timestamps
parameter displays timestamps.
Example of viewing job logs:
(base) jovyan@den-0:~$ baizectl job log -t TENSORFLOW tf-sample-job-v2-202406161632-evgrbrhn -f
2024-06-16 08:33:06.083766: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-16 08:33:06.086189: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-16 08:33:06.132416: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-16 08:33:06.132903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-16 08:33:07.223046: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
Conv1 (Conv2D) (None, 13, 13, 8) 80
flatten (Flatten) (None, 1352) 0
Softmax (Dense) (None, 10) 13530
=================================================================
Total params: 13610 (53.16 KB)
Trainable params: 13610 (53.16 KB)
Non-trainable params: 0 (0.00 Byte)
...
Delete Jobs¶
baizectl job
supports deleting jobs using the delete
command and also supports deleting multiple jobs simultaneously.
(base) jovyan@den-0:~$ baizectl job delete --help
Delete a job
Usage:
baizectl job delete [flags]
Aliases:
delete, del, remove, rm
Flags:
-h, --help help for delete
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tensorflow Tensorflow Job, has higher priority than --job-type
Here is an example to delete jobs:
(base) jovyan@den-0:~$ baizectl job ls
NAME TYPE PHASE DURATION COMMAND
demong PYTORCH SUCCEEDED 1m2s sleep 60
demo-sleep PYTORCH RUNNING 1h20m51s sleep 7200
demojob PYTORCH FAILED 16m46s sleep 1000
demojob-v2 PYTORCH RUNNING 3m13s sleep 1000
demojob-v3 PYTORCH CREATED 0s sleep 1000
(base) jovyan@den-0:~$ baizectl job delete demojob # delete a job
Delete job demojob in ns-chuanjia-ndx successfully
(base) jovyan@den-0:~$ baizectl job delete demojob-v2 demojob-v3 # delete several jobs
Delete job demojob-v2 in ns-chuanjia-ndx successfully
Delete job demojob-v3 in ns-chuanjia-ndx successfully
Restart Jobs¶
baizectl job
supports restarting jobs using the restart
command. You can view detailed information by using baizectl job restart --help
.
(base) jovyan@den-0:~$ baizectl job restart --help
restart a job
Usage:
baizectl job restart [flags] job
Aliases:
restart, rerun
Flags:
-h, --help help for restart
-t, --job-type string Job type: PYTORCH, TENSORFLOW, PADDLE (default "PYTORCH")
--paddle PaddlePaddle Job, has higher priority than --job-type
--pytorch Pytorch Job, has higher priority than --job-type
--tensorflow Tensorflow Job, has higher priority than --job-type
Dataset Management¶
baizectl
supports managing datasets. Currently, it supports viewing the dataset list, making it convenient to quickly bind datasets during job training.
(base) jovyan@den-0:~$ baizectl data
Management datasets
Usage:
baizectl data [flags]
baizectl data [command]
Aliases:
data, dataset, datasets, envs, runtime-envs
Available Commands:
ls List datasets
Flags:
-h, --help help for data
-o, --output string Output format. One of: table, json, yaml (default "table")
--page int Page number (default 1)
--page-size int Page size (default -1)
--search string Search query
--sort string Sort order
--truncate int Truncate output to the given length, 0 means no truncation (default 50)
Use "baizectl data [command] --help" for more information about a command.
View Datasets¶
baizectl data
supports viewing the datasets using the ls
command. By default, it displays in table
format, but users can specify the output format using the -o
parameter.
(base) jovyan@den-0:~$ baizectl data ls
NAME TYPE URI PHASE
fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY
sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY
training-output PVC pvc://training-output READY
When submitting a training job, you can specify the dataset using the -d
or --datasets
parameter, for example:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--datasets sample-code:/home/jovyan/code \
-- sleep 1000
To mount multiple datasets simultaneously, you can use the following format:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--datasets sample-code:/home/jovyan/code fashion-mnist:/home/jovyan/data \
-- sleep 1000
View Dependencies (Environment)¶
The environment runtime-env
is a unique environment management capability of DCE. By decoupling the dependencies required for model development, training tasks, and inference, it offers a more flexible way to manage dependencies without the need to repeatedly build complex Docker images. You simply need to select the appropriate environment.
Additionally, runtime-env
supports hot updates and dynamic upgrades, allowing you to update environment dependencies without rebuilding the image.
baizectl data
supports viewing the environment list using the runtime-env
command. By default, it displays in table
format, but users can specify the output format using the -o
parameter.
(base) jovyan@den-0:~$ baizectl data ls --runtime-env
NAME TYPE URI PHASE
fashion-mnist GIT https://gitee.com/samzong_lu/fashion-mnist.git READY
sample-code GIT https://gitee.com/samzong_lu/training-sample-code.... READY
training-output PVC pvc://training-output READY
tensorflow-sample CONDA conda://python?version=3.12.3 PROCESSING
When submitting a training job, you can specify the environment using the --runtime-env
parameter:
baizectl job submit --image release.daocloud.io/baize/baize-notebook:v0.5.0 \
--runtime-env tensorflow-sample \
-- sleep 1000
Advanced Usage¶
baizectl
supports more advanced usage, such as generating auto-completion scripts, using specific clusters and namespaces, and using specific workspaces.
Generating Auto-Completion Scripts¶
The above command generates an auto-completion script for bash
and saves it to the /etc/bash_completion.d/baizectl
directory. You can load the auto-completion script by using source /etc/bash_completion.d/baizectl
.
Using Specific Clusters and Namespaces¶
This command will list all jobs in the my-namespace
namespace within the my-cluster
cluster.
Using Specific Workspaces¶
Frequently Asked Questions¶
-
Question: Why can't I connect to the server?
Solution: Check if the
--server
parameter is set correctly and ensure that the network connection is stable. If the server uses a self-signed certificate, you can use--skip-tls-verify
to skip TLS certificate verification. -
Question: How can I resolve insufficient permissions issues?
Solution: Ensure that you are using the correct
--token
parameter to log in and check if the current user has the necessary permissions for the operation. -
Question: Why can't I list the datasets?
Solution: Check if the namespace and workspace are set correctly and ensure that the current user has permission to access these resources.
Conclusion¶
With this guide, you can quickly get started with baizectl
commands and efficiently manage AI platform resources in practical applications. If you have any questions or issues, it is recommended to use baizectl [command] --help
to check more detailed information.