OpenAIのFineTuningを試してみる

ズィストが 2023年2月9日 - 11:59 に投稿

OpenAIとは

OpenAIは2015年に設立された人工知能を研究する非営利団体となる。 OpenAIではAPIを提供しており、そのAPIでできることの例が以下になる。

テキスト補間
- お店などのキャッチコピーを考える
- 文章を元にカテゴリ分けを実施する
- 文章からキーワードを抽出する
コード補間
- 自然言語からソースコードを生成
- ソースコードの説明
- ソースコードで間違っている箇所を教えてくれる
FineTuning
- openAIではモデルをいくつか提供しているがそのモデルだけでは期待されている回答が得られない事がある
- FIneTuningを利用して学習させることで期待されている回答を得ることができる

実施する内容

今回はFineTuningによりどれだけ精度が上がるかを検証する。最初にOpenAIで提供されているモデルを利用してテキスト補間の精度を測定する。その後にFineTuningを実施し、再びテキスト補間の精度を測定する。

測定するデータ

pythonのtensorflow_datasetのimdb_reviewを使用して測定する。 imdb_rebiewは映画の感想とその評価(positive or negative)のデータが存在する。また、imdb_reviewはトレーニングデータとテストデータが存在しており、精度の測定にはテストデータをFineTuningにはトレーニングデータを利用する。

精度の定義

imdb_reviewの映画の感想を元にテキスト補間を実行し、positive or negativeに分類する。その結果がテストデータに存在する評価と一致している件数と不一致の件数をカウントする。一致している件数が多いほど精度が良く、不一致の件数が多いほど精度が悪くなる。

FineTuning実施前の精度

測定方法

テキスト補間を利用してFineTuning実施前の精度を測定する。測定条件は以下の通りとする。

model: curie
prompt

Classify following sentence as Positive or Negative
#####
There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come. -> Positive
#####
{imdb_reviewの映画の感想} ->

max_token: 1
stop: tive

最初に「以下の文をPositiveかnegativeに分類してください」と記載し、その後に区切り文字で区切る。その後に実際の例を記載する。ここでは->の後にPositiveかNegativeを表示するようにしている。その後に区切り文字で区切って実際に補間する文章と->を入力する。 positive or negativeを返すことを期待しているため、レスポンスの最大トークン数は1とし、tiveの文字列が出力されたらそれ以上出力しないようにしている。

実際に以下のようなテストデータを測定するとする。

                                              prompt completion
0  This was an absolutely terrible movie. Don't b...   Negative

この場合のpromptは以下のようになる。

Classify following sentence as Positive or Negative
#####
There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come. -> Positive
#####
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it. ->

上記を実行し、Completionの結果とテストデータに記載されている評価が一致するかどうかを確認する。これを50回繰り返してFineTuningを実施する前の精度を測定する。

測定結果

Completionの結果とテストデータに記載されている評価が一致した件数は31件、不一致となった件数は19件となり精度は62%となる

FineTuningの手順

FineTuningは以下の手順で実施する。各手順の詳細は次章で説明する。

トレーニングデータの用意
CLIを利用してトレーニングデータの整形
2を使用してFineTuningを実施

トレーニングデータの用意

トレーニングデータは以下のようなjsonlドキュメントで用意する必要がある。

{"prompt":"{promptの内容}","completion":"{期待する結果}"}
{"prompt":"{promptの内容}","completion":"{期待する結果}"}
{"prompt":"{promptの内容}","completion":"{期待する結果}"}

imdb_reviewからトレーニングデータを抽出するコードは以下のようになる。

import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
import collections

# imdbのレビューをトレーニングデータとテストデータで分けて読み込み
train_data,test_data = tfds.load(name="imdb_reviews",split=['train','test'],as_supervised=True)
# 使用するトレーニングデータの数を指定
n_train = 100
#トレーニングデータを文章とラベルに分ける
train_examples_batch,train_labels_batch = next(iter(train_data.batch(n_train)))
# レビュー文章を抽出
x_train_decoded = [x.decode() for x in train_examples_batch.numpy()]
# 評価を抽出
label = {0:'Negative', 1:'Positive'}
y_train_decoded = [label[train_labels_batch.numpy()[n]] for n in range(n_train)]
df_train = pd.DataFrame(list(zip(x_train_decoded, y_train_decoded)), columns=['prompt', 'completion'])
# トレーニングデータをJSONLファイルに出力
file_name = 'IMDB_train_100_for_fine_tune.jsonl'
train_jsonl = df_train[:100].to_json(orient='records', force_ascii=False, lines=True)
with open(file_name, mode='w') as f:
      f.write(train_jsonl)
f.close()

CLIを利用してトレーニングデータの整形

上記で作成したトレーニングデータを使用してopenai CLIツールで整形する。これはpythonではなくshで実行するため注意が必要である。

$ openai tools fine_tunes.prepare_data -f <JSONLファイル>
Analyzing...

- Your file contains 100 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `IMDB_train_100_for_fine_tune_prepared_train.jsonl` and `IMDB_train_100_for_fine_tune_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "IMDB_train_100_for_fine_tune_prepared_train.jsonl" -v "IMDB_train_100_for_fine_tune_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " Positive"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["tive"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 4.73 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

2.CLIを利用してトレーニングデータの整形

上記で作成したトレーニングデータを使用してopenai CLIツールで整形する。これはpythonではなくshで実行するため注意が必要である。

$ openai tools fine_tunes.prepare_data -f <JSONLファイル>
Analyzing...

- Your file contains 100 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details

Based on the analysis we will perform the following actions:
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `IMDB_train_100_for_fine_tune_prepared_train.jsonl` and `IMDB_train_100_for_fine_tune_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "IMDB_train_100_for_fine_tune_prepared_train.jsonl" -v "IMDB_train_100_for_fine_tune_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " Positive"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["tive"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 4.73 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

上記を実行すると2つのファイルが作成される。

IMDB_train_100_for_fine_tune_prepared_train.jsonl
IMDB_train_100_for_fine_tune_prepared_valid.jsonl

IMDB_train_100_for_fine_tune_prepared_train.jsonlは80件のトレーニングデータが出力されている。
IMDB_train_100_for_fine_tune_prepared_valid.jsonlは20件の検証データが出力されている。

3. 2を利用してFineTuningを実施する

CLIを利用してFineTuningを実施する。FineTuningを実施するためにAPIキーを環境変数に設定する必要があるため以下を実行する。

$ export OPENAI_API_KEY="<OPENAI_API_KEY>"

以下を実施してFineTuningを実施する。

$ openai api fine_tunes.create -t "IMDB_train_100_for_fine_tune_prepared_train.jsonl" -m curie -v "IMDB_train_100_for_fine_tune_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " Positive"

今回指定したオプションは以下の通り

-t
- トレーニングファイルを指定
-m
- ベースとなるモデルを指定
-v
- 検証ファイルを指定
- 検証ファイルを指定することでトレーニング中に定期的に検証データを使用してFinerTuningがどの程度機能するかを検証し結果をファイルに出力する
--compute_classification_metrics
- 分類タスクの場合上記を指定すると結果ファイルに追加で情報を出力する
--classification_positive_class
- 二項分類タスクの場合は陽性クラスを指定する必要がある
- ここでは陽性クラスとして Positiveを指定している

上記を実行すると以下のことが行われる。

トレーニングファイル、検証ファイルのアップロード
FineTuningジョブの作成
FineTuningの実施
作成したモデルのアップロード
結果ファイルのアップロード(結果ファイルを作成するオプションを指定した場合のみ)

FinTuning実施中はターミナルに以下のように出力される。

Upload progress: 100% 102k/102k [00:00<00:00, 40.5Mit
Uploaded file from IMDB_train_100_for_fine_tune_prepared_train.jsonl: file-vPKxxxxxxxxxxxxxxxxxx
Upload progress: 100% 24.9k/24.9k [00:00<00:00, 9.94M
Uploaded file from IMDB_train_100_for_fine_tune_prepared_valid.jsonl: file-4oBxxxxxxxxxxxxxxxxxx
Created fine-tune: ft-M3Ixxxxxxxxxxxxxxxxxxxxx
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-02-07 08:19:12] Created fine-tune: ft-M3Ixxxxxxxxxxxxxxxxxxxxx
[2023-02-07 08:23:47] Fine-tune costs $0.27
[2023-02-07 08:23:47] Fine-tune enqueued. Queue number: 18
==== 中略 ====
[2023-02-07 08:56:15] Fine-tune is in the queue. Queue number: 0
[2023-02-07 08:57:06] Fine-tune started
[2023-02-07 08:58:07] Completed epoch 1/4
[2023-02-07 08:58:32] Completed epoch 2/4
[2023-02-07 08:58:55] Completed epoch 3/4
[2023-02-07 08:59:18] Completed epoch 4/4
[2023-02-07 08:59:38] Uploaded model: curie:ft-personal-yyyy-mm-dd-hh-mm-ss
[2023-02-07 08:59:39] Uploaded result file: file-xxxxxxxxxxxxxxxxxxxxxxxx
[2023-02-07 08:59:39] Fine-tune succeeded

Job complete! Status: succeeded
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-yyyy-mm-dd-hh-mm-ss -p <YOUR_PROMPT>

上記の場合、以下のものが作成されたことがわかる。

FineTuneを実施したモデル
- curie:ft-personal-yyyy-mm-dd-hh-mm-ss
結果ファイル
- file-xxxxxxxxxxxxxxxxxxxxxxxx
- 結果ファイルはファイルIDが表示されるため注意が必要である

このモデルを使用してテキスト補間を実施する場合は上記のモデルを指定すればよい。
また、結果ファイルは以下のようなリクエストを投げることで取得することができる。

curl https://api.openai.com/v1/files/{ファイルID}/content \
  -H 'Authorization: Bearer YOUR_API_KEY' > file

FineTuning実施後の精度

測定方法

FineTuning実施後は以下の条件で測定する。測定条件は以下の通りとする。

model: curie:ft-personal-yyyy-mm-dd-hh-mm-ss
- curieをベースにFineTuningを実施したもの
prompt

{imdb_reviewの映画の感想} ->

max_token: 1
stop: tive

FineTuningを実施していない状態ではpromptに指示と例を記載したが、FineTuningを実施した後はこれらは不要となる。実際のpromptの例は以下のようになる。

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it. ->

上記を実行し、Completionの結果とテストデータに記載されている評価が一致するかどうかを確認する。これを50回繰り返してFineTuningを実施した後の精度を測定する。

測定結果

Completionの結果とテストデータに記載されている評価が一致した件数は50件となり精度は100%となった。

まとめ

最初にテキスト補間を実行する方法を確認するためにOpenAIで用意されているモデルを使用してテキスト補間を実施した。このときの精度は62%となった。
参考文献では同じモデルを使用して同じ測定方法で測定した結果、精度は60%となっているため再現できていることも確認できた。
その後、FineTuiningを実施してテキスト補間を実施した結果精度は100%となった。
FineTuningで使用するトレーニングデータとして合計80件（positive:40件,negative:40件）のデータを使用した。公式ドキュメントでは分類タスクの場合、分類ごとに少なくても100件のトレーニングデータを用意することを推奨しているが、それよりも少ないトレーニングデータでFineTuningを実施しても効果があることが判明した。
また、FineTuningにより精度が大幅に向上したこともわかる。

参考文献

閲覧数 1636

コメントを追加

名前

CAPTCHA

この質問はあなたが人間の訪問者であるかどうかをテストし、自動化されたスパム送信を防ぐためのものです。

OpenAIのFineTuningを試してみる

タグ

OpenAIとは

実施する内容

測定するデータ

精度の定義

FineTuning実施前の精度

測定方法

測定結果

FineTuningの手順

トレーニングデータの用意

CLIを利用してトレーニングデータの整形

2.CLIを利用してトレーニングデータの整形

3. 2を利用してFineTuningを実施する

FineTuning実施後の精度

測定方法

測定結果

まとめ

参考文献

コメントを追加

プレーンテキスト