LLaMA-Factory代码仓的"./data"路径下已存在预训练所需数据集。
data/ ├── alpaca_data_en_52k.json ├── alpaca_data_zh_51k.json ├── alpaca_gpt4_data_en.json ├── alpaca_gpt4_data_zh.json ├── belle_multiturn │ └── belle_multiturn.py ├── comparison_gpt4_data_en.json ├── comparison_gpt4_data_zh.json ├── dataset_info.json ├── example_dataset │ ├── example_dataset.py │ └── examples.json ├── hh_rlhf_en │ └── hh_rlhf_en.py ├── lima.json ├── oaast_rm.json ├── oaast_rm_zh.json ├── oaast_sft.json ├── oaast_sft_zh.json ├── README.md ├── README_zh.md ├── self_cognition.json ├── sharegpt_zh_27k.json ├── ultra_chat │ └── ultra_chat.py └── wiki_demo.txt
wget https://github.com/chaos/pdsh/archive/refs/tags/pdsh-2.29.tar.gz tar -zxvf pdsh-2.29.tar.gz cd pdsh-pdsh-2.29 ./configure --with-ssh --with-rsh --with-mrsh --with-mqshel --with-qshell --with-dshgroups --with-machines=/etc/pdsh/machines --without-pam make make install
在arm环境上执行./configure --with-ssh --with-rsh --with-mrsh --with-mqshel --with-qshell --with-dshgroups --with-machines=/etc/pdsh/machines --without-pam命令时还需增加--host=arm-linux --build=arm-linux参数。
Usage: pdsh [-options] command ... -S return largest of remote command return values -h output usage menu and quit -V output version information and quit -q list the option settings and quit -b disable ^C status feature (batch mode) -d enable extra debug information from ^C status -l user execute remote commands as user -t seconds set connect timeout (default is 10 sec) -u seconds set command timeout (no default) -f n use fanout of n nodes -w host,host,... set target node list on command line -x host,host,... set node exclusion list on command line -R name set rcmd module to name -M name,... select one or more misc modules to initialize first -N disable hostname: labels on output lines -L list info on all loaded modules and exit -g groupname target hosts in dsh group "groupname" -X groupname exclude hosts in dsh group "groupname" -a target all nodes available rcmd modules: ssh,rsh,exec (default: rsh)
vim /etc/hosts
ip1 node1 ip2 node2
node1和node2为节点的重命名,请用户根据实际情况修改。
ssh-keygen -t rsa
ssh-copy-id root@ip1 ssh-copy-id root@ip2
ssh root@ip1 ssh root@ip2
cp ../run_baichuan_sft_2m.sh . cp ../ds_config_zero2.json . cp ../hostfile .
参数 |
描述 |
---|---|
--deepspeed |
使用DeepSpeed分布式训练框架。 |
--dataset |
指定训练数据集。 |
--finetuning_type |
指定微调类型。 |
--output_dir |
指定输出目录。 |
--per_device_train_batch_size |
每个设备的训练批次大小。 |
--gradient_accumulation_steps |
梯度累积步数。 |
--learning_rate |
学习率。 |
--num_train_epochs |
训练的轮数。 |
--fp16 |
使用fp16精度浮点数进行训练。 |
Device |
Torch_Version |
total epochs |
train loss |
train samples per second |
train steps per second |
---|---|---|---|---|---|
16p-NPUs |
2.0.1 |
10.0 |
0.903 |
11.378 |
0.022 |