Skip to content

Submit a DeepSpeed Training Task

According to the DeepSpeed official documentation, it's recommended to modifying your code to implement the training task.

Specifically, you can use deepspeed.init_distributed() instead of torch.distributed.init_process_group(...). Then run the command using torchrun to submit it as a PyTorch distributed task, which will allow you to run a DeepSpeed task.

Yes, you can use torchrun to run your DeepSpeed training script. torchrun is a utility provided by PyTorch for distributed training. You can combine torchrun with the DeepSpeed API to start your training task.

Below is an example of running a DeepSpeed training script using torchrun:

  1. Write the training script:

    train.py
    import torch
    import deepspeed
    from torch.utils.data import DataLoader
    
    # Load model and data
    model = YourModel()
    train_dataset = YourDataset()
    train_dataloader = DataLoader(train_dataset, batch_size=32)
    
    # Configure file path
    deepspeed_config = "deepspeed_config.json"
    
    # Create DeepSpeed training engine
    model_engine, optimizer, _, _ = deepspeed.initialize(
        model=model,
        model_parameters=model.parameters(),
        config_params=deepspeed_config
    )
    
    # Training loop
    for batch in train_dataloader:
        loss = model_engine(batch)
        model_engine.backward(loss)
        model_engine.step()
    
  2. Create the DeepSpeed configuration file:

    deepspeed_config.json
    {
      "train_batch_size": 32,
      "gradient_accumulation_steps": 1,
      "fp16": {
        "enabled": true,
        "loss_scale": 0
      },
      "optimizer": {
        "type": "Adam",
        "params": {
          "lr": 0.00015,
          "betas": [0.9, 0.999],
          "eps": 1e-08,
          "weight_decay": 0
        }
      }
    }
    
  3. Run the training script using torchrun or baizectl:

    torchrun train.py
    

    In this way, you can combine PyTorch's distributed training capabilities with DeepSpeed's optimization technologies for more efficient training. You can use the baizectl command to submit a job in a notebook:

    baizectl job submit --pytorch --workers 2 -- torchrun train.py
    

Comments