transformers 预训练模型
作者|huggingface
编译|VK
来源|Github
这里的预训练模型是当前提供的预训练模型的完整列表,以及每个模型的简短介绍。
有关包含社区上传模型的列表,请参阅https://huggingface.co/models
体系架构 | 名称 | 模型的细节 |
---|---|---|
BERT | bert-base-uncased | 12个层,768个隐藏节点,12个heads,110M参数量。在小写英语文本上训练。 |
bert-large-uncased | 24个层,1024个隐藏节点,16个heads,340M参数量。在小写英语文本上训练。 | |
bert-base-cased | 12个层,768个隐藏节点,12个heads,110M参数量。在区分大小写的英语文本上训练。 | |
bert-large-cased | 24个层,1024个隐藏节点,16个heads,340M参数量。在区分大小写的英语文本上训练。 | |
bert-base-multilingual-uncased | (原始,不推荐)12个层,768个隐藏节点,12个heads,110M参数量。用维基百科的前102种语言在小写文本上训练(见细节:https://github.com/google-research/bert/blob/master/multilingual.md)。 | |
bert-base-multilingual-cased | (新的,推荐)12个层,768个隐藏节点,12个heads,110M参数量。用维基百科的前104种语言在区分大小写的文本上训练(见细节:https://github.com/google-research/bert/blob/master/multilingual.md)。 | |
bert-base-chinese | 12个层,768个隐藏节点,12个heads,110M参数量。在中文简体和繁体中文上训练。 | |
bert-base-german-cased | 12个层,768个隐藏节点,12个heads,110M参数量。通过Deepset.ai在区分文本大小写的德语上训练。(见细节:https://deepset.ai/german-bert) | |
bert-large-uncased-whole-word-masking | 24个层,1024个隐藏节点,16个heads,340M参数量。在小写英语文本上使用Whole-Word-Masking训练(见细节:https://github.com/google-research/bert/#bert). | |
bert-large-cased-whole-word-masking | 24个层,1024个隐藏节点,16个heads,340M参数量。在区分大小写的英语文本上使用Whole-Word-Masking训练(见细节:https://github.com/google-research/bert/#bert). | |
bert-large-uncased-whole-word-masking-finetuned-squad | 24个层,1024个隐藏节点,16个heads,340M参数量。用bert-large-uncased-whole-word-masking模型在SQuAD微调的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples). | |
bert-large-cased-whole-word-masking-finetuned-squad | 24个层,1024个隐藏节点,16个heads,340M参数量。用bert-large-cased-whole-word-masking模型在SQuAD微调的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples) | |
bert-base-cased-finetuned-mrpc | 12个层,768个隐藏节点,12个heads,110M参数量。用bert-base-cased模型在MRPC微调的结果(见细节:https://huggingface.co/transformers/examples.html) | |
bert-base-german-dbmdz-cased | 12个层,768个隐藏节点,12个heads,110M参数量。用DBMDZ对区分大小写的德语文本的训练(见细节:https://github.com/dbmdz/berts) | |
bert-base-german-dbmdz-uncased | 12个层,768个隐藏节点,12个heads,110M参数量。用DBMDZ对小写德语文本的训练(见细节:https://github.com/dbmdz/berts) | |
bert-base-japanese | 12个层,768个隐藏节点,12个heads,110M参数量。该模型是日语模型,文本用MeCab和WordPiece来标记。(见细节:https://github.com/cl-tohoku/bert-japanese) | |
bert-base-japanese-whole-word-masking | 12个层,768个隐藏节点,12个heads,110M参数量。使用Whole-Word-Masking在日语上的训练,文字用MeCab和WordPiece来标记。(见细节:https://github.com/cl-tohoku/bert-japanese) | |
bert-base-japanese-char | 12个层,768个隐藏节点,12个heads,110M参数量。该模型是日语模型。 在日语上字符级的训练。(见细节:https://github.com/cl-tohoku/bert-japanese) | |
bert-base-japanese-char-whole-word-masking | 12个层,768个隐藏节点,12个heads,110M参数量。该模型是日语模型。使用Whole-Word-Masking在日语上字符级的训练。(见细节:https://github.com/cl-tohoku/bert-japanese) | |
bert-base-finnish-cased-v1 | 12个层,768个隐藏节点,12个heads,110M参数量。训练在区分大小写的芬兰文本。(见细节: turkunlp.org) | |
bert-base-finnish-uncased-v1 | 12个层,768个隐藏节点,12个heads,110M参数量。训练小写的芬兰文本。(见细节: turkunlp.org) | |
bert-base-dutch-cased | 12个层,768个隐藏节点,12个heads,110M参数量。在区分大小写的荷兰文上训练。(见细节:https://github.com/cl-tohoku/bert-japanese) | |
GPT | openai-gpt | 12个层,768个隐藏节点,12个heads,110M参数量。OpenAI GPT的英语模型 |
GPT-2 | gpt2 | 12个层,768个隐藏节点,12个heads,117M参数量。OpenAI GPT-2的英语模型 |
gpt2-medium | 24个层,1024个隐藏节点,16个heads,345M参数量。OpenAI GPT-2的英语模型 | |
gpt2-large | 36个层,1280个隐藏节点,20个heads,774M参数量。OpenAI GPT-2的英语模型 | |
gpt2-xl | 48个层,1600个隐藏节点,25个heads,1558M参数量。OpenAI GPT-2的英语模型 | |
Transformer-XL | transfo-xl-wt103 | 18个层,1024个隐藏节点,16个heads,257M参数量。WIKITEXT-103上训练的英语模型 |
XLNet | xlnet-base-cased | 12个层,768个隐藏节点,12个heads,110M参数量。XLNet的英语模型 |
xlnet-large-cased | 24个层,1024个隐藏节点,16个heads,340M参数量。XLNet的大型英语模型 | |
XLM | xlm-mlm-en-2048 | 12个层,2048个隐藏节点,16个heads。XLM的英语模型 |
xlm-mlm-ende-1024 | 6个层,1024个隐藏节点,8个heads。在英语和德语的维基语料上训练的XLM的英语-德语模型 | |
xlm-mlm-enfr-1024 | 6个层,1024个隐藏节点,8个heads。在英语和法语的维基语料上训练的LM的英语-法语模型 | |
xlm-mlm-enro-1024 | 6个层,1024个隐藏节点,8个heads。XLM的英语-罗马尼亚多语言模型 | |
xlm-mlm-xnli15-1024 | 12个层,1024个隐藏节点,8个heads。用MLM进行15种XNLI语言的预训练的XLM的模型。 | |
xlm-mlm-tlm-xnli15-1024 | 12个层,1024个隐藏节点,8个heads。用MLM+TLM进行15种XNLI语言的预训练的XLM的模型。 | |
xlm-clm-enfr-1024 | 6个层,1024个隐藏节点,8个heads。在英语和法语的维基语料上用CLM训练英语-法语的XLM模型 | |
xlm-clm-ende-1024 | 6个层,1024个隐藏节点,8个heads。在英语和德语的维基语料上用CLM训练英语-德语的XLM模型 | |
xlm-mlm-17-1280 | 16个层,1280个隐藏节点,16个heads。在17个语言上用MLM训练的XLM模型 | |
xlm-mlm-100-1280 | 16个层,1280个隐藏节点,16个heads。在100个语言上用MLM训练的XLM模型 | |
RoBERTa | roberta-base | 12个层,768个隐藏节点,12个heads,125M的参数量。RoBERTa使用BERT-base的架构(见细节:https://github.com/pytorch/fairseq/tree/master/examples/roberta) |
roberta-large | 24个层,1024个隐藏节点,16个heads,335M的参数量。RoBERTa使用BERT-large的架构(见细节:https://github.com/pytorch/fairseq/tree/master/examples/roberta) | |
roberta-large-mnli | 24个层,1024个隐藏节点,16个heads,335M的参数量。roberta-large在MNLI的微调结果.(见细节:https://github.com/pytorch/fairseq/tree/master/examples/roberta) | |
distilroberta-base | 6个层,768个隐藏节点,12个heads,82M的参数量。从roberta-base蒸馏模型的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
roberta-base-openai-detector | 12个层,768个隐藏节点,12个heads,125M的参数量。对15亿参数的OpenAI GPT-2模型进行roberta-base的微调。(见细节:https://github.com/openai/gpt-2-output-dataset/tree/master/detector) | |
roberta-large-openai-detector | 24个层,1024个隐藏节点,16个heads,335M的参数量。对15亿参数的OpenAI GPT-2模型进行roberta-large的微调。(见细节:https://github.com/openai/gpt-2-output-dataset/tree/master/detector) | |
DistilBERT | distilbert-base-uncased | 6个层,768个隐藏节点,12个heads,66M的参数量。从bert-base-uncased蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) |
distilbert-base-uncased-distilled-squad | 6个层,768个隐藏节点,12个heads,66M的参数量。额外带一个线性层从bert-base-uncased蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
distilbert-base-cased | 6个层,768个隐藏节点,12个heads,65M的参数量。从bert-base-cased蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
distilbert-base-cased-distilled-squad | 6个层,768个隐藏节点,12个heads,65M的参数量。额外带一个问答层从bert-base-cased蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
distilgpt2 | 6个层,768个隐藏节点,12个heads,82M的参数量。从gpt2模型蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
distilbert-base-german-cased | 6个层,768个隐藏节点,12个heads,66M的参数量。从bert-base-german-dbmdz-cased模型蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
distilbert-base-multilingual-cased | 6个层,768个隐藏节点,12个heads,134M的参数量。从bert-base-multilingual-cased模型蒸馏的结果(见细节:https://github.com/huggingface/transformers/tree/master/examples/distillation) | |
CTRL | ctrl | 48个层,1280个隐藏节点,16个heads,16亿的参数量。Salesforce的大型CTRL英文模型 |
CamemBERT | camembert-base | 12个层,768个隐藏节点,12个heads,110M的参数量。使用BERT-base架构的CamemBERT(见细节:https://github.com/pytorch/fairseq/tree/master/examples/camembert) |
ALBERT | albert-base-v1 | 12个重复的层,embebdding维数128,768个隐藏层,12个heads, 11M参数量。ALBERT基本模型(见细节:https://github.com/google-research/ALBERT) |
albert-large-v1 | 24个重复的层,embebdding维数128,1024个隐藏层,16个heads, 17M参数量。ALBERT large model(见细节:https://github.com/google-research/ALBERT) | |
albert-xlarge-v1 | 24个重复的层,embebdding维数128,2048个隐藏层,16个heads, 58M参数量。ALBERT xlarge model(见细节:https://github.com/google-research/ALBERT) | |
albert-xxlarge-v1 | 12个重复的层,embebdding维数128,4096个隐藏层,64个heads, 223M参数量。ALBERT xxlarge model(见细节:https://github.com/google-research/ALBERT) | |
albert-base-v2 | 12个重复的层,embebdding维数128,768个隐藏层,12个heads, 11M参数量。ALBERT没有dropout的base模型, 额外训练数据和更长的训练时间(见细节:https://github.com/google-research/ALBERT) | |
albert-large-v2 | 24个重复的层,embebdding维数128,1024个隐藏层,16个heads, 17M参数量。ALBERT没有dropout的large模型, 额外训练数据和更长的训练时间(见细节:https://github.com/google-research/ALBERT) | |
albert-xlarge-v2 | 24个重复的层,embebdding维数128,2048个隐藏层,16给heads, 58M参数量。ALBERT没有dropout的xlarge模型, 额外训练数据和更长的训练时间(见细节:https://github.com/google-research/ALBERT) | |
albert-xxlarge-v2 | 12个重复的层,embebdding维数128,4096个隐藏层,64个heads, 223M参数量。ALBERT没有dropout的xxlarge模型, 额外训练数据和更长的训练时间(见细节:https://github.com/google-research/ALBERT) | |
T5 | t5-small | 6个层,512个隐藏节点,2048前向隐藏状态,8个heads,60M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。 |
t5-base | 12个层,768个隐藏节点,3072前向隐藏状态,12个heads,220M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。 | |
t5-large | 24个层,1024个隐藏节点,4096前向隐藏状态,16个heads,770M的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。 | |
t5-3B | 24个层,1024个隐藏节点,16384前向隐藏状态,32个heads,28亿的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。 | |
t5-11B | 24个层,1024个隐藏节点,65536前向隐藏状态,128个heads,110亿的参数量。在Colossal Clean Crawled Corpus(C4)英语文本上的训练。 | |
XLM-RoBERTa | xlm-roberta-base | 12个层,768个隐藏节点,3072前向隐藏状态,8个heads,125M的参数量。对新的创建的100种语言的2.5 TB的CommonCrawl数据进行的训练。 |
xlm-roberta-large | 24个层,1024个隐藏节点,4096前向隐藏状态,16个heads,355M的参数量。对新的创建的100种语言的2.5 TB的CommonCrawl数据进行的训练。 | |
FlauBERT | flaubert-small-cased | 6个层,512个隐藏节点,512前向隐藏状态,8个heads,54M的参数量。FlauBERT小架构(见细节:https://github.com/getalp/Flaubert) |
flaubert-base-uncased | 12个层,768个隐藏节点,12个heads,137M的参数量。FlauBERT base架构的不区分大小写上的训练(见细节:https://github.com/getalp/Flaubert) | |
flaubert-base-cased | 12个层,768个隐藏节点,12个heads,138M的参数量。FlauBERT base架构的区分大小写上的训练(见细节:https://github.com/getalp/Flaubert) | |
flaubert-large-cased | 24个层,1024个隐藏节点,16个heads,373M的参数量。FlauBERT large架构(见细节:) | |
Bart | bart-large | 12个层,1024个隐藏节点,16个heads,406M的参数量。(见细节:https://github.com/pytorch/fairseq/tree/master/examples/bart) |
bart-large-mnli | 增加1百万参数量的2个classification层的head,带classification head的bart-large large的架构 |
原文链接:https://huggingface.co/transformers/pretrained_models.html
原创文章,作者:pytorch,如若转载,请注明出处:https://pytorchchina.com/2020/03/04/transformers-%e9%a2%84%e8%ae%ad%e7%bb%83%e6%a8%a1%e5%9e%8b/