数据处理

使用方式

支持的本地数据格式：json、jsonl、parquet、txt、csv。

运行脚本tools/preprocess_data.py。

以下例子进行说明

huggingface上有完整数据：（比如THUCNews）

获取数据集：

wget https://huggingface.co/datasets/spiritx2023/ThuCnews/resolve/main/cnews.train.txt

运行脚本：

python tools/preprocess_data.py \
    --input /your/path/to/cnews.train.txt \
    --output-prefix thucnews \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file /your/path/to/gpt2-merges.txt \
    --vocab /your/path/to/gpt2-vocab.json \
    --append-eod \
    --workers 2

huggingface仅提供数据解析脚本：（比如wiki）

huggingface上没有wiki的完整数据，只有一个运行脚本，该运行脚本会根据用户传入的参数在线下载，或读取本地路径的数据。此时需要配置参数：

hf_config_json="./hf_ds_json.json"
cat <<EOT > $hf_config_json
{
    "path": "wikipedia",
    "name": "20220301.en"
}
EOT

运行脚本：

python tools/preprocess_data.py \
    --input /home/to/data/wikipedia \
    --output-prefix wikipedia \
    --hf-datasets-params ${hf_config_json} \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file /your/path/to/gpt2-merges.txt \
    --vocab /your/path/to/gpt2-vocab.json \
    --append-eod \
    --workers 2

微调数据制作：（比如alpaca）

获取数据集：

wget https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data-zh/tree/main

运行脚本：

python tools/preprocess_data.py \
    --input /your/path/to/alpaca_cn/data \
    --handler-name GeneralInstructionHandler \
    --output-prefix alpaca_cn \
    --dataset-impl mmap \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path your/path/to/llama_model \
    --tokenizer-not-use-fast \
    --append-eod \
    --workers 8

主要参数说明：

--input：文件路径，如/your/path/to/data.json或者数据文件夹/your/path/to/data。
--handler-name：数据处理类，默认使用：GeneralPretrainHandler（适用于处理预训练数据）, 其余可选GeneralInstructionHandler（适用于处理alpaca等指令微调数据集）、BelleMultiTurnInstructionHandler（用于处理BelleMultiTurn微调数据集）。
--json-keys：预训练数据的要处理的数据key值。
--split-sentences：是否对文档进行句子切分，默认不处理。
--tokenizer-type：tokenizer的类型，可选 BERT、GPT2或者PretrainedFromHF加载huggingface预训练tokenizer。
--tokenizer-name-or-path：tokenizer路径，当指定了从huggingface加载tokenizer时生效。
--append-eod：是否在文档末尾添加eod token符号。
--vocab-file：词表文件，不使用huggingface预训练tokenizer时需指定。
--merge-file：使用GPT2 tokenizer时需指定。
--dataset-impl：数据序列化方式，一般采用mmap。

概述

针对huggingface公开数据集，进行的数据预处理流程如下：

数据加载：支持本地加载数据集或者从huggingface上进行数据的加载。raw_datasets的加载统一使用如下的格式进行：

raw_datasets = load_dataset(
    args.input,
    split=split_flag,
    num_proc=None if args.streaming else args.workers,
    cache_dir=cache_dir,
    streaming=args.streaming
)

如果是从本地上进行数据的加载，还需要判断数据的格式（支持的是json还是txt）并且对数据进行相应的filter处理，之后再进行load_dataset生成raw_dataset：

data_files = [args.input] if os.path.isfile(args.input) else \
                glob.glob(os.path.join(args.input, '*'))
ext, data_format = _get_data_format(data_files)
filtered_data_files = list(filter(lambda x: x.split('.')[-1] == ext, data_files))

Prompt数据处理：不同类型的数据集需要进行不同的处理，比如Prompt数据，需要在文本内容中增加一些指令和标签。代码发布的时候，会注册常用数据的处理方法，同时也支持注册自定义处理方法。针对Prompt的制作也会提供相应的模板，以下以alpaca数据集进行举例：

class AlpacaTemplate:
    system_token = ""
    user_token = "### Instruction:"
    assistant_token = "### Response:"
    end_token = ""
    system = "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request. "
    "Please note that you need to think through your response logically and step by step."

代码实现主要体现在以下过程：

def generate_training_prompt(self, messages) -> str:
    prompt = self.template.system_token + "\n" + self.template.system + self.template.end_token + "\n"
    for message in messages:
        if message["role"] == self.user_role:
            prompt += self.template.user_token + "\n" + message["content"] + self.template.end_token + "\n"
        else:
            prompt += self.template.assistant_token + "\n" + message["content"] \
            + self.template.end_token + "\n"
        
    return prompt

最终呈现的prompt的形式是：instruction+content+end_token的结构。

tokenizer id化：将处理后的文本id化，用于输入模型。id化中使用的tokenizer也会进行更新，兼容huggingface模型库上的tokenizer，与开源保持一致。

def get_tokenized_data(self):
    """get tokenized(and prompted) data"""
    columns = next(iter(self.raw_datasets)).keys()
    remove_columns = list(set(columns) - set(self.args.json_keys))
    proc_kwargs = {} if self.args.streaming else {"num_proc": self.args.workers}
    return self.raw_datasets.map(self._filter, remove_columns=remove_columns, **proc_kwargs)

数据dump：将数据落盘，预训练或者微调可以直接加载数据。需要考虑序列化后数据大小和数据加载的速度不产生明显劣化，特别是数据加载过程。

父主题： 解决方案