๐Ÿ“Œ 1.   Ragas ๋ž€? 

RAGAS๋Š” Retrieval-Augmented Generation ์‹œ์Šคํ…œ์„ ํ‰๊ฐ€ํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ๋‹จ์ผ ๋ฌธํ•ญ ๋‹จ์œ„์˜ ํ‰๊ฐ€๋ถ€ํ„ฐ ์ „์ฒด ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ํ‰๊ฐ€๊นŒ์ง€ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
ํŠนํžˆ, ์ฐธ์กฐ ์—†๋Š” ํ‰๊ฐ€(Reference-free evaluation) ๋ฐฉ์‹์œผ๋กœ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜์ง€ ์•Š์•„๋„ ์ž๋™์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ๋ชฉ์ :

  • RAG ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ์„ธ๋ถ„ํ™”๋œ ์ง€ํ‘œ๋กœ ์ธก์ •
  • ๊ฒ€์ƒ‰ (retriever) ๋ฐ ์ƒ์„ฑ (generator) ๋ชจ๋“ˆ์˜ ํ’ˆ์งˆ ํŒŒ์•…
  • LLM ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ง„๋‹จ ๋ฐ ๊ฐœ์„ 

๐Ÿ” 2. ํ‰๊ฐ€ ์ง€ํ‘œ (Metrics)

RAGAS๋Š” ํฌ๊ฒŒ ๋‹ค์Œ 4๊ฐ€์ง€ ์ฃผ์š” ์ง€ํ‘œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

โ‘  Faithfulness (์ •ํ™•์„ฑ)

  • ์ƒ์„ฑ๋œ ์‘๋‹ต์ด ๊ฒ€์ƒ‰๋œ ์ปจํ…์ŠคํŠธ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ ์ธก์ •
  • ๊ฑฐ์ง“ ์ •๋ณด๋‚˜ ๋งฅ๋ฝ ์™ธ ์ •๋ณด๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ๋‚ฎ์€ ์ ์ˆ˜๋ฅผ ์คŒ
  • ์‚ฌ์šฉ ๊ธฐ์ˆ : LLM ๊ธฐ๋ฐ˜ ํ‰๊ฐ€ ๋˜๋Š” ์ž์—ฐ์–ด ์ถ”๋ก (NLI)

โ‘ก Answer Relevancy (์‘๋‹ต ๊ด€๋ จ์„ฑ)

  • ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ์‘๋‹ต์ด ์‹ค์ œ๋กœ ์œ ์˜๋ฏธํ•œ์ง€๋ฅผ ํ‰๊ฐ€
  • ๋Œ€๋‹ต์ด ์งˆ๋ฌธ๊ณผ ๋ฌด๊ด€ํ•˜๋ฉด ๋‚ฎ์€ ์ ์ˆ˜

โ‘ข Context Precision (๋ฌธ๋งฅ ์ •๋ฐ€๋„)

  • ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ ์ค‘์—์„œ ์งˆ๋ฌธ์— ์‹ค์ œ๋กœ ์œ ์šฉํ•œ ๋ฌธ์„œ๊ฐ€ ํฌํ•จ๋˜์—ˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€
  • ๊ฒ€์ƒ‰๊ธฐ(retriever)์˜ ํ’ˆ์งˆ์„ ์ธก์ •ํ•˜๋Š” ๋ฐ ํ™œ์šฉ

โ‘ฃ Context Recall (๋ฌธ๋งฅ ์žฌํ˜„์œจ)

  • ์‘๋‹ต ์ƒ์„ฑ์— ํ•„์š”ํ•œ ๋ฌธ์„œ๊ฐ€ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ๋น ์ง์—†์ด ํฌํ•จ๋˜์—ˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€

[๋ณด์กฐ ์ง€ํ‘œ]

  • Context Recall๊ณผ Context Precision์€ retrieval ๋ถ€๋ถ„์— ์ง‘์ค‘
  • Faithfulness์™€ Answer Relevancy๋Š” generation ๋ถ€๋ถ„์— ์ง‘์ค‘

โš™๏ธ 3. RAGAS ํ‰๊ฐ€ ๊ตฌ์„ฑ์š”์†Œ

ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

ํ•„๋“œ ์„ค๋ช…
question ์‚ฌ์šฉ์ž ์งˆ๋ฌธ
contexts ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ ๋ฆฌ์ŠคํŠธ
answer ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‘๋‹ต
ground_truth (์„ ํƒ) ์ •๋‹ต(์ฐธ์กฐ ์‘๋‹ต, ์„ ํƒ์ )
 

๐Ÿงช 4. RAGAS ์‚ฌ์šฉ ๋ฐฉ๋ฒ• (Python ์˜ˆ์‹œ)

์„ค์น˜:

pip install ragas
 
 

ํ‰๊ฐ€ ์ฝ”๋“œ ์˜ˆ์‹œ:

from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
 
from ragas import evaluate
 
from datasets import Dataset
 
 
# ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
data = Dataset.from_dict({
       "question": [...],
       "contexts": [...], # List of strings
       "answer": [...], # ๋ชจ๋ธ ์ƒ์„ฑ ์‘๋‹ต
       "ground_truth": [...] # ์„ ํƒ ์‚ฌํ•ญ
})
 
 
# ํ‰๊ฐ€ ์‹คํ–‰
results = evaluate(data, metrics=[
        faithfulness,
answer_relevancy,
context_precision,
context_recall
 
])
 
print(results)
 
 

 

๊ฒฐ๊ณผ ์˜ˆ์‹œ:

 
{
'faithfulness': 0.83,
'answer_relevancy': 0.91,
'context_precision': 0.75,
'context_recall': 0.65
}

 

 


๐Ÿ’ก 5. RAGAS์˜ ์žฅ์ 

์žฅ์ ์„ค๋ช…
์ฐธ์กฐ ์—†์ด ํ‰๊ฐ€ ๊ฐ€๋Šฅ ์ •๋‹ต์ด ์—†๋Š” QA ์‹œ์Šคํ…œ์—์„œ๋„ ํ™œ์šฉ ๊ฐ€๋Šฅ
๊ตฌ์„ฑ ์š”์†Œ๋ณ„ ํ‰๊ฐ€ Retrieval๊ณผ Generation ์„ฑ๋Šฅ์„ ๊ตฌ๋ถ„ํ•˜์—ฌ ๋ถ„์„ ๊ฐ€๋Šฅ
ํ™•์žฅ์„ฑ HuggingFace, LangChain, LlamaIndex ๋“ฑ ๋‹ค์–‘ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์— ์ ์šฉ ๊ฐ€๋Šฅ
์‹œ๊ฐํ™” ๋ฐ ๋””๋ฒ„๊น… ๊ฐ€๋Šฅ ํŠน์ • ์ฟผ๋ฆฌ์˜ ๋ฌธ์ œ๋ฅผ ์‰ฝ๊ฒŒ ์ถ”์  ๊ฐ€๋Šฅ
 

๐Ÿ“š 6. ์ฃผ์š” ํ™œ์šฉ ์‚ฌ๋ก€

  • ChatGPT Plugin์ด๋‚˜ LangChain Agent์˜ ์„ฑ๋Šฅ ์ง„๋‹จ
  • ๋‚ด๋ถ€ QA ์ฑ—๋ด‡์˜ ์‘๋‹ต ํ’ˆ์งˆ ๋ถ„์„
  • RAG ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์„  ์ „/ํ›„ ๋น„๊ต

๐Ÿ“Ž ์ฐธ๊ณ  ๋งํฌ

+ Recent posts