๐Ÿ”  GPT๋ž€?

GPT๋Š” "Generative Pre-trained Transformer"์˜ ์•ฝ์ž์ž…๋‹ˆ๋‹ค.

๋‹จ์–ด ์˜๋ฏธ
Generative ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ
Pre-trained ๋ฏธ๋ฆฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ํ›„ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์‚ฌ์šฉ
Transformer ํŠธ๋žœ์Šคํฌ๋จธ๋ผ๋Š” ๋”ฅ๋Ÿฌ๋‹ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ (ํŠนํžˆ ๋””์ฝ”๋” ๊ตฌ์กฐ๋งŒ ์‚ฌ์šฉ)
 

๐Ÿ”ง GPT์˜ ๊ตฌ์กฐ: ๋””์ฝ”๋”๋งŒ ์‚ฌ์šฉํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ

GPT๋Š” ํŠธ๋žœ์Šคํฌ๋จธ(Transformer) ๊ตฌ์กฐ ์ค‘ ๋””์ฝ”๋”๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์–ด์š”:

๐Ÿ“Œ GPT ๋””์ฝ”๋”์˜ ๊ตฌ์„ฑ ์š”์†Œ

 

  • Input Tokens: ์ž…๋ ฅ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜์—ฌ ์ˆซ์ž ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • Token Embedding: ๊ฐ ํ† ํฐ์„ ๊ณ ์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋งคํ•‘ํ•˜์—ฌ ์˜๋ฏธ๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
  • Positional Encoding: ํ† ํฐ์˜ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  • Transformer Decoder Blocks: ๋‹ค์ˆ˜์˜ ๋””์ฝ”๋” ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ๋ธ”๋ก์€ ๋‹ค์Œ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:
    • Masked Multi-Head Self-Attention: ํ˜„์žฌ ์œ„์น˜์˜ ํ† ํฐ์ด ์ด์ „ ํ† ํฐ๋“ค๋งŒ์„ ์ฐธ์กฐํ•˜๋„๋ก ๋งˆ์Šคํ‚นํ•˜์—ฌ ์ž๊ธฐ ์ฃผ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • Feed Forward Network (FFN): ๊ฐ ํ† ํฐ ์œ„์น˜์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” ์™„์ „ ์—ฐ๊ฒฐ ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค.
    • Residual Connections & Layer Normalization: ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์ž”์ฐจ ์—ฐ๊ฒฐ๊ณผ ์ธต ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Linear Layer: ๋””์ฝ”๋” ๋ธ”๋ก์˜ ์ถœ๋ ฅ์„ ์–ดํœ˜ ํฌ๊ธฐ๋งŒํผ์˜ ์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • Softmax: ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • Output Probabilities: ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‹ค์Œ์— ์˜ฌ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ๋‹จ์–ด๊ฐ€ ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.

 

  1.  

๐Ÿ” ์ด ๊ตฌ์กฐ๋Š” ์—ฌ๋Ÿฌ ์ธต์œผ๋กœ ๋ฐ˜๋ณต (์˜ˆ: GPT-2๋Š” 12์ธต, GPT-3๋Š” 96์ธต)


๐Ÿ” GPT๋Š” ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋‚˜?

๐Ÿง  ํ•™์Šต ๋ฐฉ์‹: ์–ธ์–ด ๋ชจ๋ธ๋ง (Language Modeling)

GPT๋Š” “์•ž์— ๋‚˜์˜จ ๋‹จ์–ด๋“ค์„ ๋ณด๊ณ , ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ”์„ ๋ชฉํ‘œ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

graph TD
  A[Input Tokens] --> B[Token Embedding]
  B --> C[Positional Encoding]
  C --> D[Transformer Decoder Blocks]
  D --> E[Linear Layer]
  E --> F[Softmax]
  F --> G[Output Probabilities]

 

์˜ˆ์‹œ:

 
์ž…๋ ฅ: ๋‚˜๋Š” ์˜ค๋Š˜
  → ์ถœ๋ ฅ: ๋‚ ์”จ๊ฐ€ (์˜ˆ์ธก๋œ ๋‹ค์Œ ๋‹จ์–ด)
  → ์ž…๋ ฅ ํ™•์žฅ: ๋‚˜๋Š” ์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€
  → ์ถœ๋ ฅ: ์ข‹๋‹ค
  → ๋ฐ˜๋ณต…

 

 

์ด๋Ÿฌํ•œ ์‹์œผ๋กœ ์ˆ˜๋งŽ์€ ํ…์ŠคํŠธ๋ฅผ ํ•™์Šตํ•˜๋ฉด์„œ
๋‹จ์–ด์˜ ์˜๋ฏธ, ๋ฌธ๋ฒ•, ๋ฌธ์žฅ ๊ตฌ์กฐ, ๋ฌธ๋งฅ๊นŒ์ง€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


๐Ÿงช GPT์˜ ํ›ˆ๋ จ ๊ณผ์ •

1. Pre-training (์‚ฌ์ „ ํ•™์Šต)

  • ์ธํ„ฐ๋„ท, ์œ„ํ‚ค๋ฐฑ๊ณผ, ๋‰ด์Šค ๋“ฑ ๋ฐฉ๋Œ€ํ•œ ํ…์ŠคํŠธ๋กœ ํ•™์Šต
  • ๋ชฉํ‘œ: ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก

2. Fine-tuning (๋ฏธ์„ธ ์กฐ์ •)

  • ํŠน์ • ์ž‘์—…(์˜ˆ: ๋ฒˆ์—ญ, ์š”์•ฝ, ๋Œ€ํ™” ๋“ฑ)์— ๋งž๊ฒŒ ์ถ”๊ฐ€ ํ•™์Šต
  • ์˜ˆ: ChatGPT๋Š” ๋Œ€ํ™”์— ํŠนํ™”๋˜๋„๋ก ํŠœ๋‹๋จ

3. Reinforcement Learning from Human Feedback (RLHF)

(GPT-3.5 ์ดํ›„)

  • ์‚ฌ๋žŒ์ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ๋” "์‚ฌ๋žŒ๋‹ต๊ฒŒ" ๋‹ค๋“ฌ์Œ

๐Ÿ“ˆ GPT ์‹œ๋ฆฌ์ฆˆ ๋ฐœ์ „

๋ฒ„์ „ ์—ฐ๋„  ํŠน์ง•
GPT-1 2018 ๊ฐœ๋… ์ฆ๋ช…, 117M ํŒŒ๋ผ๋ฏธํ„ฐ
GPT-2 2019 ์–ธ์–ด ์ƒ์„ฑ ๋Šฅ๋ ฅ ํ–ฅ์ƒ, 1.5B ํŒŒ๋ผ๋ฏธํ„ฐ
GPT-3 2020 175B ํŒŒ๋ผ๋ฏธํ„ฐ, ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์ˆ˜ํ–‰
GPT-3.5 2022 ChatGPT์— ์‚ฌ์šฉ, ๋Œ€ํ™” ์ตœ์ ํ™”
GPT-4 2023 ์ถ”๋ก ๋ ฅ ์ฆ๊ฐ€, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ง€์› (์ด๋ฏธ์ง€ ํฌํ•จ)
GPT-4o 2024 Omni ๋ชจ๋ธ, ์‹ค์‹œ๊ฐ„ ์˜ค๋””์˜ค/๋น„์ „/ํ…์ŠคํŠธ ํ†ตํ•ฉ ๊ฐ€๋Šฅ
 

 


๐Ÿงฉ GPT vs BERT vs T5

ํ•ญ๋ชฉ GPT BERT T5
๊ตฌ์กฐ ๋””์ฝ”๋” ์ธ์ฝ”๋” ์ธ์ฝ”๋” + ๋””์ฝ”๋”
๋ฐฉํ–ฅ์„ฑ ๋‹จ๋ฐฉํ–ฅ (์™ผ→์˜ค) ์–‘๋ฐฉํ–ฅ ์–‘๋ฐฉํ–ฅ
๋ชฉ์  ์ƒ์„ฑ ์ดํ•ด ๋ณ€ํ™˜ (์ž…๋ ฅ→์ถœ๋ ฅ)
์˜ˆ ์ž‘๋ฌธ, ๋Œ€ํ™”, ์š”์•ฝ ๋ถ„๋ฅ˜, QA ๋ฒˆ์—ญ, ์š”์•ฝ
 

๐Ÿ“š GPT์˜ ํ™œ์šฉ ๋ถ„์•ผ

๋ถ„์•ผ ์ ์šฉ ์˜ˆ์‹œ
๋ฌธ์„œ ์ƒ์„ฑ ๊ธ€์“ฐ๊ธฐ, ๋ณด๊ณ ์„œ ์ž‘์„ฑ, ์†Œ์„ค ์“ฐ๊ธฐ
์ฝ”๋“œ ์ƒ์„ฑ Copilot, ์ฝ”๋“œ ์ž๋™์™„์„ฑ
๋Œ€ํ™” AI ChatGPT, AI ์ƒ๋‹ด์‚ฌ
๊ฒ€์ƒ‰/์š”์•ฝ ๊ธด ๊ธ€ ์š”์•ฝ, ๋ฌธ์„œ ๊ฒ€์ƒ‰ ์š”์•ฝ
์ฐฝ์˜ ์ž‘์—… ์‹œ, ๊ด‘๊ณ  ๋ฌธ๊ตฌ, ์•„์ด๋””์–ด ๋„์ถœ
 

โœ… ์ •๋ฆฌ ์š”์•ฝ

ํ•ญ๋ชฉ ์„ค๋ช…
GPT๋ž€? ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ
๊ตฌ์กฐ ํŠธ๋žœ์Šคํฌ๋จธ ๋””์ฝ”๋”๋งŒ ์‚ฌ์šฉ, Masked Attention ๊ตฌ์กฐ
์ž‘๋™ ๋ฐฉ์‹ ์•ž ๋‹จ์–ด๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก (์–ธ์–ด ๋ชจ๋ธ๋ง)
๊ฐ•์  ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ ์ƒ์„ฑ, ๋ฌธ๋งฅ ํŒŒ์•…, ๋‹ค์–‘ํ•œ ์ž‘์—… ์ ์šฉ
์‚ฌ์šฉ ์˜ˆ ๋Œ€ํ™”, ์ž‘๋ฌธ, ์š”์•ฝ, ๊ฒ€์ƒ‰, ์ฝ”๋”ฉ ๋“ฑ

 

์ฐธ๊ณ  : https://jalammar.github.io/illustrated-gpt2/

 

The Illustrated GPT-2 (Visualizing Transformer Language Models)

Discussions: Hacker News (64 points, 3 comments), Reddit r/MachineLearning (219 points, 18 comments) Translations: Simplified Chinese, French, Korean, Russian, Turkish This year, we saw a dazzling application of machine learning. The OpenAI GPT-2 exhibited

jalammar.github.io

https://dugas.ch/artificial_curiosity/GPT_architecture.html

 

The GPT-3 Architecture, on a Napkin

HOW DEEP IS THE MACHINE? The Artificial Curiosity Series The GPT-3 Architecture, on a Napkin There are so many brilliant posts on GPT-3, demonstrating what it can do, pondering its consequences, vizualizing how it works. With all these out there, it still

dugas.ch

 

+ Recent posts