开源LLM

August 30, 2023

LLMs

1）我们经常说GPT是transformer右侧部分，其实GPT不包含transformer右侧中间的Multi-Head Attention模块。

2）LLaMA在GPT基础上对多个组件优化。

Alt text

https://dugas.ch/artificial_curiosity/GPT_architecture.html

https://zhuanlan.zhihu.com/p/636784644

Alt text

三种主流架构的注意力模式比较。

Causal Decoder // 单向注意力

1）LLaMA
- 1.1 LLaMA 从零训练：Baichuan-7B/13B（百川智能）、Qwen-7B（阿里云）
- 1.2 LLaMA 增量微调：BELLE-7B/13B（链家）
2）GPT、ChatGPT、BLOOM
Prefix Decoder // Encoder双向注意力、Decoder单向注意力、Encoder&Decoder共享参数

1）ChatGLM-6B、ChatGLM2-6B（清华、智谱）
Encoder-Decoder // Encoder双向注意力、Decoder单向注意力

1） Flan-T5、Transformer

https://www.cnblogs.com/heyjjjjj/p/17488423.html

https://zhuanlan.zhihu.com/p/626310493

https://arxiv.org/pdf/2303.18223.pdf

Alt text

https://zhuanlan.zhihu.com/p/651747035

https://zhuanlan.zhihu.com/p/644815089

Learned、Relative、RoPE、ALiBi

https://zhuanlan.zhihu.com/p/650469278

SwiGLU、ReLU、GeLU、GeGLU

Pre RMSNorm、Pre LayerNorm、Post LayerNorm

Full attention、Sparse attention、Multi-query attention、FlashAttention

https://zhuanlan.zhihu.com/p/644815089

https://zhuanlan.zhihu.com/p/642112610

https://zhuanlan.zhihu.com/p/639276066