[논문 리뷰] A PARADIGM SHIFT IN MACHINE TRANSLATION:BOOSTING TRANSLATION PERFORMANCE OF LARGELANGUAGE MODELS (23.09)

728x90

SMALL

생성형 LLM은 다양한 NLP 테스크에 대해 놀라운 진전을 이뤄냈습니다. 그러나, 이러한 진전은 번역 문제에 대해서는 반영되지 않아습니다, 특히 중간 규모 모델(7B 또는 13B 파라미터)은 기존의 supervised encoder-decoder 번역 모델들보다 뒤쳐져 있습니다. 기존 연구들은 중간 규모 모델의 번역 능력을 향상시키기 위해 시도한 적이 있지만, 그들의 성과는 제한적이었습니다. 본 연구에서는 전통적으로 번역 모델이 의지했던 방대한 parallel data가 필요하지 않은, 번역 테스크를 위해 특별히 설계된 LLM에 대한 새로운 fine-tuning 접근 방식을 제안한다. 2단계의 fine-tuning 으로 구성된 접근 방식: 단일 언어에 대한 초기 fine-tuning + 소규모 고품질 parallel data에 대한 fine-tuning. 우리는 이러한 전략을 통해 발전된 LLM인 ALMA(Advanced Language Model-based translator)를 소개합니다. LLaMA-2를 기본 모델로 만들어진 우리의 모델 결과는 WMT’21 (2 directions)과 WMT’22 (8 directions)의 테스트 데이터셋에 대해 zero-shot으로 10개의 번역 지시문에 대해 평균적으로 12 BLEU, 12 COMET 보다 높은 성능을 달성했습니다. 7B 또는 13B의 파라미터를 가지고 우리의 성능은 NLLB-54B model (NLLB TEAM et al., 2022), 그리고 GPT3.5-text-davinci-003 보다 훨씬 뛰어납니다. 이러한 방법은 기계 번역의 새로운 학습 패러다임의 기반을 설립했습니다. (코드)

* parallel data: 두 개 이상의 언어 또는 언어 버전간에 상응하는 텍스트 데이터
ex) 영어: "Hello, how are you?", 스페인어: "Hola, ¿cómo estás?"

* WMT'21 (2 directions): 2021년 WMT 대회의 테스트 데이터 세트(2 directions - 양방향 번역 데이터 셋)

ex) 영어-프랑스어, 프랑스어-영어의 두 가지 번역 방향을 평가할 수 있는 데이터

* WMT'22 (8 directions): 2022년 WMT 대회의 테스트 데이터 세트(여덟 가지 번역 방향을 포함한 데이터셋)

* BLEU (Bilingual Evaluation Understudy):

* COMET (Comet Metric):

1. Introduction

Translation performance of contemporary decoder-only LLM translation systems based on LLaMA (Yang et al., 2023; Zhang et al., 2023c), and zero-shot performance of LLaMA, for the WMT’22 test data across 8 directions (translating to or from English for German, Czech, Chinese, and Russian). Benchmark comparisons also include two leading translation models, NLLB-54B and GPT-3.5-text-davinci-003. Our systems, developed on LLaMA-2 with 7B and 13B parameters, surpass previous models by an impressive margin of nearly 10 BLEU and 7 COMET. Furthermore, they even slightly outperform GPT-3.5 and NLLB-54B on average.

NLLB-54B(번역 모델의 SOTA) 보다 BLEU, COMET 평가지표에서 좋은 성능을 보임
(GPT-3.5와 같은) LLM 모델보다 작은 model size임에도 불구하고 BLEU, COMET 평가지표에서 좋은 성능을 보임
ALMA모델은 LLaMA-2를 base model로 새로운 2개의 전략으로 fine-tuning 진행
(2개의 전략 - 단일 언어에 대한 초기 fine-tuning + 소규모 고품질 parallel data fine-tuning)
1B 단일 언어 토큰을 fine-tuning 하는 것만으로도 10개의 번역에 대해 NLLB-54B와 비슷한 성능을 낼 수 있고,
이 성능은 16개의 MI200GPU로 18시간 학습시키면 가능함

2. Preliminary

2.1 Task Definition

log-likelihood loss of the parallel sentence (x, y)

The prompt used for training and evaluation. [source language] and [target language] represent the full name of the language, e.g., Translate this from German to English. Note that we do not compute loss for the prompt.

위 함수는 손실함수에 대한 정의이다.
x: source sentence
y: target sentence
I: prompt template
θ: model parameter
T: target sentence length
yt: t-th target token
CLM(Causal Language Modeling)은 번역 작업에 다른 모델링 방법에 비해 더 적합하다.

2.2 A Backbone LLM For Translation

Averaged zero-shot translation performance on 10 directions: cs↔en, de↔en, is↔en, zh↔en, ru↔en, where is↔en is from WMT’21 test data and the others from WMT’22 test data.

base model을 정할 때 제로 샷 번역 성능을 우선적으로 평가
제로 샷 평가: 영어 중심의 5개 언어 쌍에 대해 제로 샷 평가 진행(test data: WMT’21, WMT’22)
결과적으로 BLEU와 COMET평가를 기반으로 LLaMA-2와 MPT-7B 선택

3. DO LLMS HAVE AN APPETITE FOR PARALLEL DATA?

3.1 EXPERIMENTAL DESIGN

LLaMA-2, MPT-7B를 중점으로 실행 진행
영어 -> 러시아어 언어쌍에 집중
전처리한 7500만개(75M) parallel sentences를 사용하며 데이터 양을 5단계로 나누었음(10K, 100K, 1M, 5M, and 20M)
5단계로 나눈 데이터들을 프롬프트 템플릿을 활용해 파라미터를 업데이트 진행

3.2 OBSERVATIONS

BLEU and COMET scores obtained during the fine-tuning of MPT-7B and LLaMA-2- 7B across each data step for en→ru. Additionally, we present the results for NLLB-54B and a 7B model trained from scratch. A notable decline in LLaMA-2-7B’s COMET score suggests that substantial parallel data might dilute its pre-existing knowledge.

표와 같은 실험 결과가 나왔음
- LLaMA-2-7B는 10K 및 100K에서 정점에 도달하는 반면에 MPT-7B는 계속해서 성능이 증가함
LLaMA-2는 10K, 100K정도의 훈련 데이터만 있으면 정점에 갈 수 있음
- 더 많은 데이터는 오히려 기존 지식을 없애는 경향을 보임 (데이터가 많아졌을 때 오히려 성능이 낮아짐)
parallel 데이터가 너무 많으면 기존 지식을 없앤다.
- 사전 지식이 없는 상태로 20M개의 데이터를 처음부터 학습시켜 테스트 해본 결과 노란색 세모의 성능을 보임
(즉, 새로 학습한 모델과 20M으로 fine-tuning한 모델이 큰 차이가 없기 때문에 LLM의 기존 지식을 삭제했다고 생각함)
그래서 LLM은 방대한 양을 학습 데이터로 이용하는 학습으로 접근해서는 안된다.

4. A NEW TRAINING RECIPE

새로운 학습 방법, 단일 언어에 대한 파인튜닝 + 고품질 paralllel 데이터로 파인튜닝 -> ALMA
단일 언어에 대한 파인튜닝 (Monolingual Data Fine-tuning)
- LLaMA와 같은 LLM들은 주로 영어 중심의 데이터로 학습되어 있음. (그래서 다국어 번역 성능이 낮음)
- 비영어권 언어의 단일 언어 데이터를 fine-tuning해서 해당 언어에 대한 역량을 향상시킴
- 영어를 잊어버리지 않도록 fine-tuning 중에 영어 단일 언어 데이터도 추가
고품질 데이터에 대한 파인튜닝 (High-Quality Data Fine-tuning)
- 고품질의 작은 parallel data만 필요
- 사람이 직접 작성한 데이터셋(from WMT test data)과 Flores-200 데이터셋을 활용
- full-weight와 light-weight LoRA 파인튜닝을 고려한다.

5. EXPERIMENTS

5.1 Data

parallel 학습 데이터(58K): 사람이 작성한 테스트 데이터 셋(from WMT’17 to WMT’20) + 테스트 셋(From Flores-200)
테스트 데이터: 10개(5개쌍)의 번역 방향(cs-en, de-en, is-en, zh-en, ru-en)
is-en 데이터: WMT’21, 그외 데이터: WMT’22
parallel 검증 데이터(8K): 앞에서 사용한 테스트 데이터를 제외한 데이터 셋(from WMT’21)
단일 데이터 셋: OSCAR (무자기로 단일 언어 데이터 셋을 비율대로 선택)
20%(de), 14%(cs), 8%(is), 19%(zh), 22%(ru), 17%(en)

5.2 TRAINING SETUP

ALMA-7B/AMLA-13B
- (LLaMA-2-7B or LLaMA-13B 모델에 대해서)
monolingual data로 full-weight fune-tuning을 진행하고 높은 품질의 parallel data로 fune-tuning 진행
ALMA-7B-LoRA/AMLA-13B-LoRA
- (LLaMA-2-7B or LLaMA-13B 모델에 대해서)
monolingual data로 full-weight fune-tuning을 진행하고 높은 품질의 parallel data로 LoRA fune-tuning 진행
- LoRA rank: 16, 파라미터 업데이트는 파라미터 0.1%만 진행(7B기준-7.7M, 13B기준-12M)
Batch size: 256, warm-up ratio: 0.01, 최대 시퀀스 토큰 수: 512
monolingual data fine-tuning은 LLaMA-2-7B: 20B 토큰 학습, LLaMA-2-13B: 12B 토큰 학습

5.3 BASELINES

Table 1: The overall results in en→xx. ALMA models significantly outperform all prior similar studies and are comparable to SoTA models. We categorize BLEU and COMET scores into three groups: scores that are more than 10 points below the higher value of GPT-4/GPT-3.5-T are emphasized in deep red boxes, those that are more than 5 points below are emphasized in shallow red boxes, and all other scores are emphasized in green boxes. Bold numbers represent the highest scores among ALMA models and prior similar studies.

Prior Similar Studiesdhk와 SoTA Models에 대해서 번역 결과(en->xx) 수치를 비교하였음

5.4 RESULTS

Table 2: The overall results in xx→en. ALMA models significantly outperform all prior similar studies and are comparable to SoTA models. The color and boldface are the same in Table 1.

결론적으로 NLLB-54B, GPT-3.5-D에 비해 좋은 성능을 내지만, GPT-3.5-T와 GPT-4에 비해서는 성능이 약간 떨어진다.

6. ANALYSIS

6.1 HOW MUCH MONOLINGUAL DATA TO USE?

ALMA 모델의 가장 좋은 설정은, 20B 또는 12B 토큰으로 fine-tuning 하는 것이다.
fine-tuning이 된 1B개의 단일 언어 토큰을 사용할 때마다 모델을 저장하고 성능평가 진행

6.2 THE EFFECT OF MONOLINGUAL DATA AND PARALLEL DATA QUALITY

Figure 5: The average performance of ALMA-7B at the completion of each 1B-token fine-tuning. The scores in the figure are averaged across 10 directions

Table 3: Ablation study on the effect of monolingual data and parallel data quality. The backbone model is LLaMA-2-7B. A red cross (✘) in the table denotes the omission of monolingual data finetuning or parallel data (indicative of zero-shot translation). A green check (✔) signifies that the model undergoes fine-tuning with monolingual data.

6.3 IS MORE HUMAN-WRITTEN PARALLEL DATA BETTER?

Table 4: The performance of LLaMa-2-7B (post stage 1 fine-tuning) when fine-tuned exclusively on Flores versus when fine-tuned on both WMT and Flores.

6.4 PARALLEL DATA FINE-TUNING VS. IN-CONTEXT LEARNING

Table 5: The performance between 5-shot ICL and stage 2 fine-tuning using the LLaMA-2-13B model post stage 1 as the backbone. Our findings indicate that the quality of shots affects ICL performance. Notably, stage 2 fine-tuning markedly surpasses the 5- shot ICL and ICL does not help more on stage 2.

7. CONCLUSION

기존 번역 모델처럼 광범위한 parallel 데이터를 수집할 필요 없다는 것을 보여줌
비영어에 대한 LLM 성능을 올리기 위해 2단계 프로세스를 포함한 새로운 학습법 제시
10개의 번역에 대해서 ALMA 모델은 NLLB-54B 및 gpt-3.5-D 보다도 BLEU, COMET 평가 지표에서 성능이 뛰어남

728x90

LIST

'논문 > Large Language Model (LLM)' 카테고리의 다른 글

[논문 리뷰] Principled Instructions Are All You Need forQuestioning LLaMA-1/2, GPT-3.5/4 (24.01) (1)	2024.03.03
[논문 리뷰] Data-Juicer: A One-Stop Data Processing System for LargeLanguage Models (23.10) (this version, v2) (0)	2023.11.03

시간의 농도