We follow on from our two previous posts

Opportunities and Risks of foundation models

In this post, we understand the taxonomy of TPTLM – Transformer based pre-trained language models

The post is based on a paper which covers this topic extensively:

AMMUS : A Survey of Transformer-based Pretrained Models in Natural …

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sa…

Transformer based pre-trained language models (TPTLM) are a complex and fast growing area of AI so I recommend this paper as a good way to understand and navigate the landscape

We can classify TPTLM from four perspectives

Pretraining Corpus
Model Architecture
Type of SSL (self-supervised learning) and
Extensions

Pretraining Corpus-based models

General pretraining: Models like GPT-1 , BERT etc are pretrained on general corpus. For example, GPT-1

is pretrained on Books corpus while BERT and UniLM are pretrained on English Wikipedia and Books corpus.

This form of training is more general from multiple sources of information

Social Media-based: you could train on models using social media

Language-based: Models could be trained on languages either monolingual or multilingual.

Architecture

TPTLM could be classified based on their architecture. A T-PTLM can be pretrained using a stack of encoders or decoders or both.

Hence, you could have architectures based on

Encoder-based
Decoder-based
Encoder-Decoder based

SSL self supervised learning

Self supervised learning – SSL is one of the key ingredients in building T-PTLMs.

A T-PTLM can be developed by pretraining using Generative, Contrastive or Adversarial, or Hybrid SSL. Hence, based on SSLs you could have

Generative SSL
Contrastive SSL
Adversarial SSL
Hybrid SSL

Extensions

Based on extensions, you can classify TPTLMs according to the following categories

Compact T-PTLMs: aim to reduce the size of the T-PTLMs and make them faster using a variety of model compression techniques like pruning, parameter sharing, knowledge distillation, and quantization.
Character-based T-PTLMs: CharacterBERT uses CharCNN+Highway layer to generate word representations from character embeddings and then apply transformer encoder layers. ex AlphaBERT
Green T-PTLMs: focus on environmentally friendly methods
Sentence-based T-PTLMs: extend T-PTLMs like BERT to generate quality sentence embeddings.
Tokenization-Free T-PLTMs: avoid the use of explicit tokenizers to split input sequences to cater for languages such as Chinese or That that do not use white space or punctuations as word separators.
Large Scale T-PTLMs: Performance of T-PTLMs is strongly related to the scale rather than the depth or width of the model. These models aim to increase the parameters of the model
Knowledge Enriched T-PTLMs: T-PTLMs are developed by pretraining over large volumes of text data. During pretraining, the model learns
Long-Sequence T-PTLMs: self-attention variants like sparse self attention and linearized self-attention are proposed to reduce its complexity and hence extend T-PTLMs to long input sequences
Efficient T-PTLMs: ex DeBERTa which improves the BERT model using disentangled attention mechanism and enhanced masked decoder.

Conclusion

This is a complex area and I hope the taxonomy above is useful. The paper I referred provides more and makes a great effort at explain such a complex landscape

The post is based on a paper which covers this topic extensively: (also image source from the paper)

AMMUS : A Survey of Transformer-based Pretrained Models in Natural …

Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sa…

A taxonomy of Transformer based pre-trained language models (TPTLM)

Pretraining Corpus-based models

Architecture

SSL self supervised learning

Extensions

Conclusion

Leave a Reply Cancel reply

A taxonomy of Transformer based pre-trained language models (TPTLM)

Pretraining Corpus-based models

Architecture

SSL  self supervised learning

Extensions

Conclusion

Leave a Reply Cancel reply

SSL self supervised learning