transformers 预训练模型

作者|huggingface
编译|VK
来源|Github

这里的预训练模型是当前提供的预训练模型的完整列表，以及每个模型的简短介绍。

有关包含社区上传模型的列表，请参阅https://huggingface.co/models

体系架构	名称	模型的细节
BERT	bert-base-uncased	12个层，768个隐藏节点，12个heads，110M参数量。在小写英语文本上训练。
	bert-large-uncased	24个层，1024个隐藏节点，16个heads，340M参数量。在小写英语文本上训练。
	bert-base-cased	12个层，768个隐藏节点，12个heads，110M参数量。在区分大小写的英语文本上训练。
	bert-large-cased	24个层，1024个隐藏节点，16个heads，340M参数量。在区分大小写的英语文本上训练。
	bert-base-multilingual-uncased	（原始，不推荐）12个层，768个隐藏节点，12个heads，110M参数量。用维基百科的前102种语言在小写文本上训练(见细节：https://github.com/google-research/bert/blob/master/multilingual.md)。
	bert-base-multilingual-cased	（新的，推荐）12个层，768个隐藏节点，12个heads，110M参数量。用维基百科的前104种语言在区分大小写的文本上训练(见细节：https://github.com/google-research/bert/blob/master/multilingual.md)。
	bert-base-chinese	12个层，768个隐藏节点，12个heads，110M参数量。在中文简体和繁体中文上训练。
	bert-base-german-cased	12个层，768个隐藏节点，12个heads，110M参数量。通过Deepset.ai在区分文本大小写的德语上训练。(见细节：https://deepset.ai/german-bert)
	bert-large-uncased-whole-word-masking	24个层，1024个隐藏节点，16个heads，340M参数量。在小写英语文本上使用Whole-Word-Masking训练(见细节：https://github.com/google-research/bert/#bert).
	bert-large-cased-whole-word-masking	24个层，1024个隐藏节点，16个heads，340M参数量。在区分大小写的英语文本上使用Whole-Word-Masking训练(见细节：https://github.com/google-research/bert/#bert).
	bert-large-uncased-whole-word-masking-finetuned-squad	24个层，1024个隐藏节点，16个heads，340M参数量。用bert-large-uncased-whole-word-masking模型在SQuAD微调的结果（见细节：https://github.com/huggingface/transformers/tree/master/examples）.
	bert-large-cased-whole-word-masking-finetuned-squad	24个层，1024个隐藏节点，16个heads，340M参数量。用bert-large-cased-whole-word-masking模型在SQuAD微调的结果（见细节：https://github.com/huggingface/transformers/tree/master/examples）
	bert-base-cased-finetuned-mrpc	12个层，768个隐藏节点，12个heads，110M参数量。用bert-base-cased模型在MRPC微调的结果（见细节：https://huggingface.co/transformers/examples.html）
	bert-base-german-dbmdz-cased	12个层，768个隐藏节点，12个heads，110M参数量。用DBMDZ对区分大小写的德语文本的训练（见细节：https://github.com/dbmdz/berts）
	bert-base-german-dbmdz-uncased	12个层，768个隐藏节点，12个heads，110M参数量。用DBMDZ对小写德语文本的训练（见细节：https://github.com/dbmdz/berts）
	bert-base-japanese	12个层，768个隐藏节点，12个heads，110M参数量。该模型是日语模型，文本用MeCab和WordPiece来标记。（见细节：https://github.com/cl-tohoku/bert-japanese）
	bert-base-japanese-whole-word-masking	12个层，768个隐藏节点，12个heads，110M参数量。使用Whole-Word-Masking在日语上的训练，文字用MeCab和WordPiece来标记。（见细节：https://github.com/cl-tohoku/bert-japanese）
	bert-base-japanese-char	12个层，768个隐藏节点，12个heads，110M参数量。该模型是日语模型。在日语上字符级的训练。（见细节：https://github.com/cl-tohoku/bert-japanese）
	bert-base-japanese-char-whole-word-masking	12个层，768个隐藏节点，12个heads，110M参数量。该模型是日语模型。使用Whole-Word-Masking在日语上字符级的训练。（见细节：https://github.com/cl-tohoku/bert-japanese）
	bert-base-finnish-cased-v1	12个层，768个隐藏节点，12个heads，110M参数量。训练在区分大小写的芬兰文本。(见细节： turkunlp.org)
	bert-base-finnish-uncased-v1	12个层，768个隐藏节点，12个heads，110M参数量。训练小写的芬兰文本。(见细节： turkunlp.org)
	bert-base-dutch-cased	12个层，768个隐藏节点，12个heads，110M参数量。在区分大小写的荷兰文上训练。(见细节：https://github.com/cl-tohoku/bert-japanese)
GPT	openai-gpt	12个层，768个隐藏节点，12个heads，110M参数量。OpenAI GPT的英语模型
GPT-2	gpt2	12个层，768个隐藏节点，12个heads，117M参数量。OpenAI GPT-2的英语模型
	gpt2-medium	24个层，1024个隐藏节点，16个heads，345M参数量。OpenAI GPT-2的英语模型
	gpt2-large	36个层，1280个隐藏节点，20个heads，774M参数量。OpenAI GPT-2的英语模型
	gpt2-xl	48个层，1600个隐藏节点，25个heads，1558M参数量。OpenAI GPT-2的英语模型
Transformer-XL	transfo-xl-wt103	18个层，1024个隐藏节点，16个heads，257M参数量。WIKITEXT-103上训练的英语模型
XLNet	xlnet-base-cased	12个层，768个隐藏节点，12个heads，110M参数量。XLNet的英语模型
XLNet	xlnet-large-cased	24个层，1024个隐藏节点，16个heads，340M参数量。XLNet的大型英语模型
XLM	xlm-mlm-en-2048	12个层，2048个隐藏节点，16个heads。XLM的英语模型
	xlm-mlm-ende-1024	6个层，1024个隐藏节点，8个heads。在英语和德语的维基语料上训练的XLM的英语-德语模型
	xlm-mlm-enfr-1024	6个层，1024个隐藏节点，8个heads。在英语和法语的维基语料上训练的LM的英语-法语模型
	xlm-mlm-enro-1024	6个层，1024个隐藏节点，8个heads。XLM的英语-罗马尼亚多语言模型
	xlm-mlm-xnli15-1024	12个层，1024个隐藏节点，8个heads。用MLM进行15种XNLI语言的预训练的XLM的模型。
	xlm-mlm-tlm-xnli15-1024	12个层，1024个隐藏节点，8个heads。用MLM+TLM进行15种XNLI语言的预训练的XLM的模型。
	xlm-clm-enfr-1024	6个层，1024个隐藏节点，8个heads。在英语和法语的维基语料上用CLM训练英语-法语的XLM模型
	xlm-clm-ende-1024	6个层，1024个隐藏节点，8个heads。在英语和德语的维基语料上用CLM训练英语-德语的XLM模型
	xlm-mlm-17-1280	16个层，1280个隐藏节点，16个heads。在17个语言上用MLM训练的XLM模型
	xlm-mlm-100-1280	16个层，1280个隐藏节点，16个heads。在100个语言上用MLM训练的XLM模型
RoBERTa	roberta-base	12个层，768个隐藏节点，12个heads，125M的参数量。RoBERTa使用BERT-base的架构(见细节：https://github.com/pytorch/fairseq/tree/master/examples/roberta)
	roberta-large	24个层，1024个隐藏节点，16个heads，335M的参数量。RoBERTa使用BERT-large的架构(见细节：https://github.com/pytorch/fairseq/tree/master/examples/roberta)
	roberta-large-mnli	24个层，1024个隐藏节点，16个heads，335M的参数量。roberta-large在MNLI的微调结果.(见细节：https://github.com/pytorch/fairseq/tree/master/examples/roberta)
	distilroberta-base	6个层，768个隐藏节点，12个heads，82M的参数量。从roberta-base蒸馏模型的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	roberta-base-openai-detector	12个层，768个隐藏节点，12个heads，125M的参数量。对15亿参数的OpenAI GPT-2模型进行roberta-base的微调。(见细节：https://github.com/openai/gpt-2-output-dataset/tree/master/detector)
	roberta-large-openai-detector	24个层，1024个隐藏节点，16个heads，335M的参数量。对15亿参数的OpenAI GPT-2模型进行roberta-large的微调。(见细节：https://github.com/openai/gpt-2-output-dataset/tree/master/detector)
DistilBERT	distilbert-base-uncased	6个层，768个隐藏节点，12个heads，66M的参数量。从bert-base-uncased蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilbert-base-uncased-distilled-squad	6个层，768个隐藏节点，12个heads，66M的参数量。额外带一个线性层从bert-base-uncased蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilbert-base-cased	6个层，768个隐藏节点，12个heads，65M的参数量。从bert-base-cased蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilbert-base-cased-distilled-squad	6个层，768个隐藏节点，12个heads，65M的参数量。额外带一个问答层从bert-base-cased蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilgpt2	6个层，768个隐藏节点，12个heads，82M的参数量。从gpt2模型蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilbert-base-german-cased	6个层，768个隐藏节点，12个heads，66M的参数量。从bert-base-german-dbmdz-cased模型蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
	distilbert-base-multilingual-cased	6个层，768个隐藏节点，12个heads，134M的参数量。从bert-base-multilingual-cased模型蒸馏的结果(见细节：https://github.com/huggingface/transformers/tree/master/examples/distillation)
CTRL	ctrl	48个层，1280个隐藏节点，16个heads，16亿的参数量。Salesforce的大型CTRL英文模型
CamemBERT	camembert-base	12个层，768个隐藏节点，12个heads，110M的参数量。使用BERT-base架构的CamemBERT(见细节：https://github.com/pytorch/fairseq/tree/master/examples/camembert)
ALBERT	albert-base-v1	12个重复的层，embebdding维数128，768个隐藏层，12个heads, 11M参数量。ALBERT基本模型(见细节：https://github.com/google-research/ALBERT)
	albert-large-v1	24个重复的层，embebdding维数128，1024个隐藏层，16个heads, 17M参数量。ALBERT large model(见细节：https://github.com/google-research/ALBERT)
	albert-xlarge-v1	24个重复的层，embebdding维数128，2048个隐藏层，16个heads, 58M参数量。ALBERT xlarge model(见细节：https://github.com/google-research/ALBERT)
	albert-xxlarge-v1	12个重复的层，embebdding维数128，4096个隐藏层，64个heads, 223M参数量。ALBERT xxlarge model(见细节：https://github.com/google-research/ALBERT)
	albert-base-v2	12个重复的层，embebdding维数128，768个隐藏层，12个heads, 11M参数量。ALBERT没有dropout的base模型, 额外训练数据和更长的训练时间(见细节：https://github.com/google-research/ALBERT)
	albert-large-v2	24个重复的层，embebdding维数128，1024个隐藏层，16个heads, 17M参数量。ALBERT没有dropout的large模型, 额外训练数据和更长的训练时间(见细节：https://github.com/google-research/ALBERT)
	albert-xlarge-v2	24个重复的层，embebdding维数128，2048个隐藏层，16给heads, 58M参数量。ALBERT没有dropout的xlarge模型, 额外训练数据和更长的训练时间(见细节：https://github.com/google-research/ALBERT)
	albert-xxlarge-v2	12个重复的层，embebdding维数128，4096个隐藏层，64个heads, 223M参数量。ALBERT没有dropout的xxlarge模型, 额外训练数据和更长的训练时间(见细节：https://github.com/google-research/ALBERT)
T5	t5-small	6个层，512个隐藏节点,2048前向隐藏状态，8个heads，60M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。
	t5-base	12个层，768个隐藏节点,3072前向隐藏状态，12个heads，220M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。
	t5-large	24个层，1024个隐藏节点,4096前向隐藏状态，16个heads，770M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。
	t5-3B	24个层，1024个隐藏节点,16384前向隐藏状态，32个heads，28亿的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。
	t5-11B	24个层，1024个隐藏节点,65536前向隐藏状态，128个heads，110亿的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。
XLM-RoBERTa	xlm-roberta-base	12个层，768个隐藏节点,3072前向隐藏状态，8个heads，125M的参数量。对新的创建的100种语言的2.5 TB的CommonCrawl数据进行的训练。
XLM-RoBERTa	xlm-roberta-large	24个层，1024个隐藏节点,4096前向隐藏状态，16个heads，355M的参数量。对新的创建的100种语言的2.5 TB的CommonCrawl数据进行的训练。
FlauBERT	flaubert-small-cased	6个层，512个隐藏节点,512前向隐藏状态，8个heads，54M的参数量。FlauBERT小架构(见细节：https://github.com/getalp/Flaubert)
	flaubert-base-uncased	12个层，768个隐藏节点，12个heads，137M的参数量。FlauBERT base架构的不区分大小写上的训练(见细节：https://github.com/getalp/Flaubert)
	flaubert-base-cased	12个层，768个隐藏节点，12个heads，138M的参数量。FlauBERT base架构的区分大小写上的训练(见细节：https://github.com/getalp/Flaubert)
	flaubert-large-cased	24个层，1024个隐藏节点，16个heads，373M的参数量。FlauBERT large架构(见细节：)
Bart	bart-large	12个层，1024个隐藏节点，16个heads，406M的参数量。(见细节：https://github.com/pytorch/fairseq/tree/master/examples/bart)
Bart	bart-large-mnli	增加1百万参数量的2个classification层的head，带classification head的bart-large large的架构