大模型训练

准备数据集。

LLaMA-Factory代码仓的"./data"路径下已存在预训练所需数据集。

data/
├── alpaca_data_en_52k.json
├── alpaca_data_zh_51k.json
├── alpaca_gpt4_data_en.json
├── alpaca_gpt4_data_zh.json
├── belle_multiturn
│   └── belle_multiturn.py
├── comparison_gpt4_data_en.json
├── comparison_gpt4_data_zh.json
├── dataset_info.json
├── example_dataset
│   ├── example_dataset.py
│   └── examples.json
├── hh_rlhf_en
│   └── hh_rlhf_en.py
├── lima.json
├── oaast_rm.json
├── oaast_rm_zh.json
├── oaast_sft.json
├── oaast_sft_zh.json
├── README.md
├── README_zh.md
├── self_cognition.json
├── sharegpt_zh_27k.json
├── ultra_chat
│   └── ultra_chat.py
└── wiki_demo.txt

配置双机通信环境。

安装pdsh。

wget https://github.com/chaos/pdsh/archive/refs/tags/pdsh-2.29.tar.gz

tar -zxvf pdsh-2.29.tar.gz
cd pdsh-pdsh-2.29
./configure --with-ssh --with-rsh --with-mrsh --with-mqshel --with-qshell  --with-dshgroups --with-machines=/etc/pdsh/machines  --without-pam

make
make install

在arm环境上执行./configure --with-ssh --with-rsh --with-mrsh --with-mqshel --with-qshell --with-dshgroups --with-machines=/etc/pdsh/machines --without-pam命令时还需增加--host=arm-linux --build=arm-linux参数。

安装完成后，执行pdsh -h命令。显示如下信息，表示安装成功。

Usage: pdsh [-options] command ...
-S                return largest of remote command return values
-h                output usage menu and quit
-V                output version information and quit
-q                list the option settings and quit
-b                disable ^C status feature (batch mode)
-d                enable extra debug information from ^C status
-l user           execute remote commands as user
-t seconds        set connect timeout (default is 10 sec)
-u seconds        set command timeout (no default)
-f n              use fanout of n nodes
-w host,host,...  set target node list on command line
-x host,host,...  set node exclusion list on command line
-R name           set rcmd module to name
-M name,...       select one or more misc modules to initialize first
-N                disable hostname: labels on output lines
-L                list info on all loaded modules and exit
-g groupname      target hosts in dsh group "groupname"
-X groupname      exclude hosts in dsh group "groupname"
-a                target all nodes
available rcmd modules: ssh,rsh,exec (default: rsh)

双机通信配置。
1. 编辑两台服务器的/etc/hosts文件，添加两台服务器的IP地址。
```
vim /etc/hosts
```
  将ip1和ip2替换为两台服务器的实际IP地址。
```
ip1 node1 
ip2 node2
```
  node1和node2为节点的重命名，请用户根据实际情况修改。
2. 执行以下命令来生成sshkey。
```
ssh-keygen -t rsa
```
3. 将ssh-key拷贝到每个节点，本机也要拷贝。
```
ssh-copy-id root@ip1 
ssh-copy-id root@ip2
```
4. 在每个节点上运行以下代码。如果不需要输入密码，则表示配置成功。然后执行exit退出。
```
ssh root@ip1 
ssh root@ip2
```

开始训练。

单击ModelZoo-Pytorch，将该目录下的run_baichuan_sft_2m.sh、ds_config_zero2.json及hostfile文件拷贝到LLaMA-Factory代码仓路径下。
```
cp ../run_baichuan_sft_2m.sh .
cp ../ds_config_zero2.json .
cp ../hostfile .
```

启动脚本。

该模型双机16卡微调，执行如下命令启动训练，部分参数说明请见表1。

sh run_baichuan_sft_2m.sh

表1 参数说明
参数	描述
--deepspeed	使用DeepSpeed分布式训练框架。
--dataset	指定训练数据集。
--finetuning_type	指定微调类型。
--output_dir	指定输出目录。
--per_device_train_batch_size	每个设备的训练批次大小。
--gradient_accumulation_steps	梯度累积步数。
--learning_rate	学习率。
--num_train_epochs	训练的轮数。
--fp16	使用fp16精度浮点数进行训练。

为确保双机训练成功，请保证双机环境及路径一致，包括项目路径、conda环境、cann和驱动等。训练完成后，权重文件保存--output_dir参数指定的路径下，并输出模型训练相关信息。
NPU不支持--tf32数据类型。

训练结果如下表所示。

表2 训练结果展示表
Device

Torch_Version

total epochs

train loss

train samples per second

train steps per second

16p-NPUs

2.0.1

10.0

0.903

11.378

0.022

表2 训练结果展示表
Device	Torch_Version	total epochs	train loss	train samples per second	train steps per second
16p-NPUs	2.0.1	10.0	0.903	11.378	0.022

父主题： 基于DeepSpeed的大模型迁移