boostcampaitech7 · github-classroom · Sep 6, 2024 · Sep 10, 2024 · Sep 10, 2024 · Sep 10, 2024
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,16 @@
+checkpoint/*
+!checkpoint/.gitkeep
+data/*
+!data/.gitkeep
+experiments/*
+!experiments/.gitkeep
+output/*
+!output/.gitkeep
+tb_logs/*
+!tb_logs/.gitkeep
+lightning_logs/*
+!lightning_logs/.gitkeep
+# 모든 __pycache__ 폴더 무시
+**/__pycache__/
+.idea/*
+.DS_Store
diff --git a/README.md b/README.md
@@ -0,0 +1,122 @@
+![alt text](banner.png)
+
+# Lv.1 NLP 기초 프로젝트 : 문장 간 유사도 측정(STS)
+
+</div>
+
+## **프로젝트 개요**
+> 진행 기간: 24년 9월 10일 ~ 24년 9월 26일
+
+> 데이터셋: 
+> - 학습 데이터셋 9,324개
+> - 검증 데이터셋 550개
+> - 평가 데이터는 1,100개  
+>
+> 평가 데이터의 50%는 Public 점수 계산에 활용되어 실시간 리더보드에 표기가 되고, 남은 50%는 Private 결과 계산에 활용되었습니다.
+
+부스트캠프AI Tech 7기의 Level1 과정으로 NLP 기초 대회입니다. 주제는 ‘문장 간 유사도 측정’으로, 두 문장이 얼마나 유사한지를 수치화하는 자연어처리 N21 태스크인 의미 유사도 판별(Semantic Text Similarity, 이하 STS)을 진행했습니다. 학습 데이터에 주어진 문장 두 개와 유사도 점수를 기반으로 평가 데이터의 두 문장 간의 유사도를 0과 5 사이의 값으로 예측하는 모델을 구축하였습니다.
+
+
+## **프로젝트 구조**
+```
+📦project1
+ ┣ 📂config
+ ┃ ┗ 📜config.yaml
+ ┣ 📂data
+ ┣ 📂model
+ ┃ ┗ 📜model.py
+ ┣ 📂output
+ ┣ 📂tb_logs
+ ┣ 📂utils
+ ┃ ┣ 📂ensemble
+ ┃ ┣ 📂preprocess
+ ┣ 📜README.md
+ ┣ 📜inference.py
+ ┣ 📜requirements.txt
+ ┗ 📜train.py
+```
+
+## **Contributors**
+
+<table align='center'>
+  <tr>
+    <td align="center">
+      <img src="https://github.com/yeseoLee.png" alt="이예서" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/yeseoLee">
+        <img src="https://img.shields.io/badge/%EC%9D%B4%EC%98%88%EC%84%9C-grey?style=for-the-badge&logo=github" alt="badge 이예서"/>
+      </a>    
+    </td>
+    <td align="center">
+      <img src="https://github.com/Sujinkim-625.png" alt="김수진" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/Sujinkim-625">
+        <img src="https://img.shields.io/badge/%EA%B9%80%EC%88%98%EC%A7%84-grey?style=for-the-badge&logo=github" alt="badge 김수진"/>
+      </a>    
+    </td>
+    <td align="center">
+      <img src="https://github.com/nevertmr.png" alt="김민서" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/nevertmr">
+        <img src="https://img.shields.io/badge/%EA%B9%80%EB%AF%BC%EC%84%9C-grey?style=for-the-badge&logo=github" alt="badge 김민서"/>
+      </a>
+    </td>
+    <td align="center">
+      <img src="https://github.com/koreannn.png" alt="홍성재" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/koreannn">
+        <img src="https://img.shields.io/badge/%ED%99%8D%EC%84%B1%EC%9E%AC-grey?style=for-the-badge&logo=github" alt="badge 홍성재"/>
+      </a>
+    </td>
+    <td align="center">
+      <img src="https://github.com/Effyee.png" alt="양가연" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/Effyee">
+        <img src="https://img.shields.io/badge/%EC%96%91%EA%B0%80%EC%97%B0-grey?style=for-the-badge&logo=github" alt="badge 양가연"/>
+      </a>
+    </td>
+    <td align="center">
+      <img src="https://github.com/hsmin9809.png" alt="홍성민" width="100" height="100" style="border-radius: 50%;"/><br>
+      <a href="https://github.com/hsmin9809">
+        <img src="https://img.shields.io/badge/%ED%99%8D%EC%84%B1%EB%AF%BC-grey?style=for-the-badge&logo=github" alt="badge 홍성민"/>
+      </a> 
+    </td>
+  </tr>
+</table>
+
+## 역할분담
+
+| 이름   | 역할                                                                                                                                                                                                                                                                  |
+| ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 김민서 | 베이스라인 코드 구현, 텐서보드 기능 구현, 허깅페이스 내 모델 Search, 모델링 및 튜닝(`klue/roberta-large`, `klue/roberta-base`, `team-lucid/deberta-v3-base-korean`, `deliciouscat/kf-deberta-base-cross-sts`, `upskyy/kf-deberta-multitask`, `kakaobank/kf-deberta-base`, `klue/bert-base`), 앙상블(`soft voting`, `weighted voting`) |
+| 김수진 | Task에 적합한 모델 Search, 데이터 증강(`swap`), 데이터 분할, 모델링 및 튜닝(`snunlp/KR-ELECTRA-discriminator`), 앙상블(`weighted voting`)                                                                                                                        |
+| 양가연 | 데이터 전처리(`hanspell`, `soynlp`), 데이터 증강(`copied_sentence`, `swap`, `synonym replacement`, `undersampling`, `masking`), 모델링 및 튜닝(`kykim/electra-kor-base`, `snunlp/KR-ELECTRA-discriminator`, `klue/roberta-large`, `WandB`), 앙상블(`weighted voting`)         |
+| 이예서 | EDA(`Label 분포`, `Source 분포`, `Sentence length 분석`), 데이터 전처리(`특수문자 제거`, `초성 대체`, `띄어쓰기/맞춤법 교정`), 데이터 증강(`sentence swap`, `sentence copy`, `korEDA(SR, RI, RS)`, `K-TACC(BERT_RMR, ADVERB)`), 앙상블(`weighted voting`)                       |
+| 홍성민 | 모델링 및 튜닝(`kykim/KR-ELECTRA-Base`), 앙상블(`weighted voting`), 베이스라인 코드 수정과 기능 추가                                                                                                                                                           |
+| 홍성재 | 하이퍼 파라미터 튜닝(`BS`, `Epoch`, `LR`), 모델 최적화 및 앙상블(`Koelectra-base-v3-discriminator`, `roberta-small`, `bert-base-multilingual-cased` / `Soft voting`)                                                                                     |
+
+## Dependencies
+* torch==2.1.0
+* transformers==4.35.2
+* pytorch-lightning==2.1.2
+
+## Usage
+1. Setting
+```
+$ pip install -r requirements.txt
+```
+2. Training
+```angular2html
+$ python3 train.py
+```
+3. Inference
+```angular2html
+$ python3 inference.py
+```
+
+## 프로젝트 타임라인
+
+<img width="2715" alt="Gantt chart template (Community) (3)" src="https://github.com/user-attachments/assets/3a300753-f0f4-4d86-81ea-df66ed29ad9a">
+
+## 프로젝트 수행결과
+
+<img width="3456" alt="Gantt chart template (Community) (4)" src="https://github.com/user-attachments/assets/02560fce-076e-4b82-b3a7-c35539615da1">
+
+## 리더보드 결과
+![image](https://github.com/user-attachments/assets/e666e639-3bfe-4bed-95b1-4fd3a93ed745)
+
diff --git a/banner.png b/banner.png
diff --git a/checkpoint/.gitkeep b/checkpoint/.gitkeep
diff --git a/config/config.yaml b/config/config.yaml
@@ -0,0 +1,27 @@
+user_name: seongmin # 실험자 이름
+model:
+  model_name: klue/roberta-small
+early_stopping:
+  min_delta: 0.0
+  mode: max
+  monitor: val_pearson
+  patience: 5
+  verbose: False
+train:
+    batch_size: 8
+    learning_rate: 1e-5
+    max_epoch: 10
+    LossF: torch.nn.MSELoss
+    optim: torch.optim.AdamW
+    ## LossF와 optim은 torch.nn과 torch.optim을 꼭 적어야 합니다
+    shuffle: True
+data:
+    train_path: ./data/raw/train_01.csv
+    dev_path: ./data/raw/dev.csv
+    test_path: ./data/raw/dev.csv
+    predict_path: ./data/raw/test.csv
+    checkpoint_path: ./checkpoint/
+    output_path: ./output/
+    submission_path: ./data/sample_submission.csv
+    val_path: ./data/dev.csv
+seed: 42
diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/inference.py b/inference.py
@@ -0,0 +1,68 @@
+import argparse
+import yaml
+import pandas as pd
+import os
+from tqdm.auto import tqdm
+
+import torch
+
+# import transformers
+# import pandas as pd
+
+import pytorch_lightning as pl
+
+# import wandb
+##############################
+from utils import data_pipeline
+
+
+def get_latest_experiment_folder(base_path="./experiments"):
+    # base_path 내의 폴더 리스트 가져오기
+    experiment_folders = [
+        f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))
+    ]
+
+    # 폴더가 없을 경우 None 반환
+    if not experiment_folders:
+        return None
+
+    # 폴더 생성 시간 기준으로 정렬 (가장 최근 폴더가 마지막에 위치)
+    experiment_folders.sort(
+        key=lambda x: os.path.getmtime(os.path.join(base_path, x)), reverse=True
+    )
+
+    # 가장 최근에 생성된 폴더 반환
+    return experiment_folders[0]
+
+
+if __name__ == "__main__":
+
+    # baseline_config 설정 불러오기
+    with open("./config/config.yaml", encoding="utf-8") as f:
+        CFG = yaml.load(f, Loader=yaml.FullLoader)
+
+    # 저장된 폴더 이름 가장 최근걸로 불러오기
+    exp_name = get_latest_experiment_folder()
+
+    # dataloader / model 설정
+    dataloader = data_pipeline.Dataloader(CFG)
+    model = torch.load(f"./experiments/{exp_name}/model.pt")
+    # trainer 인스턴스 생성
+    trainer = pl.Trainer(
+        accelerator="gpu",
+        devices=1,
+        max_epochs=CFG["train"]["max_epoch"],
+        log_every_n_steps=1,
+    )
+
+    # Inference part
+    predictions = trainer.predict(model=model, datamodule=dataloader)
+    ## datamodule에서 predict_dataloader 호출
+
+    # 예측된 결과를 형식에 맞게 반올림하여 준비합니다.
+    predictions = list(round(float(i), 1) for i in torch.cat(predictions))
+
+    # output 형식을 불러와서 예측된 결과로 바꿔주고, output.csv로 출력합니다.
+    output = pd.read_csv("./data/raw/sample_submission.csv")
+    output["target"] = predictions
+    output.to_csv(f"./output/output_({exp_name}).csv", index=False)
diff --git a/model/model.py b/model/model.py
@@ -0,0 +1,108 @@
+import torch
+import transformers
+import torchmetrics
+import pytorch_lightning as pl
+
+
+class Model(pl.LightningModule):
+    def __init__(self, CFG):
+        super().__init__()
+        self.save_hyperparameters()
+
+        # 문자열로 표현된 loss와 optimizer를 함수로 변환
+        self.model_name = CFG["model"]["model_name"]
+        self.lr = float(CFG["train"]["learning_rate"])
+        self.loss_func = eval(CFG["train"]["LossF"])()
+        # self.optim은 configure_optimizers에서 사용
+        self.optim = eval(CFG["train"]["optim"])
+
+        ## CFG의 model_name으로 설정된 모델 불러오기
+        self.plm = transformers.AutoModelForSequenceClassification.from_pretrained(
+            pretrained_model_name_or_path=self.model_name, num_labels=1
+        )
+
+    def forward(self, x):
+        x = self.plm(x)["logits"]
+
+        return x
+
+    def training_step(self, batch, batch_idx):
+        x, y = batch
+        logits = self(x)
+        loss = self.loss_func(logits, y.float())
+
+        # 피어슨 계수 계산
+        pearson = torchmetrics.functional.pearson_corrcoef(
+            logits.squeeze(), y.squeeze()
+        )
+
+        # 기존코드
+        # self.log("train_loss", loss)
+
+        # 에포크 단위로 로그 기록
+        self.log("loss/train", loss, on_step=True, on_epoch=True)
+        self.log("pearson/train", pearson, on_step=True, on_epoch=True)
+
+        # 가로축을 에포크 기반으로 설정
+        self.logger.experiment.add_scalar("loss/train_epoch", loss, self.current_epoch)
+        self.logger.experiment.add_scalar(
+            "pearson/train_epoch", pearson, self.current_epoch
+        )
+
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        x, y = batch
+        logits = self(x)
+        loss = self.loss_func(logits, y.float())
+
+        # 피어슨 계수 계산
+        pearson = torchmetrics.functional.pearson_corrcoef(
+            logits.squeeze(), y.squeeze()
+        )
+
+        # 기존코드
+        # self.log("val_loss", loss)
+        # self.log("val_pearson", torchmetrics.functional.pearson_corrcoef(logits.squeeze(), y.squeeze()))
+
+        # 에포크 단위로 로그 기록
+        self.log("loss/val", loss, on_step=False, on_epoch=True)
+        self.log("pearson/val", pearson, on_step=True, on_epoch=True)
+
+        # 가로축을 에포크 기반으로 설정
+        self.logger.experiment.add_scalar("loss/val_epoch", loss, self.current_epoch)
+        self.logger.experiment.add_scalar(
+            "pearson/val_epoch", pearson, self.current_epoch
+        )
+
+        return loss
+
+    def test_step(self, batch, batch_idx):
+        x, y = batch
+        logits = self(x)
+
+        # 피어슨 계수 계산
+        pearson = torchmetrics.functional.pearson_corrcoef(
+            logits.squeeze(), y.squeeze()
+        )
+
+        # 기존코드
+        # self.log("test_pearson", torchmetrics.functional.pearson_corrcoef(logits.squeeze(), y.squeeze()))
+
+        # 에포크 단위로 로그 기록
+        self.log("pearson/test", pearson, on_step=True, on_epoch=True)
+
+        # 가로축을 에포크 기반으로 설정
+        self.logger.experiment.add_scalar(
+            "pearson/test_epoch", pearson, self.current_epoch
+        )
+
+    def predict_step(self, batch, batch_idx):
+        x = batch
+        logits = self(x)
+
+        return logits.squeeze()
+
+    def configure_optimizers(self):
+        optimizer = self.optim(self.parameters(), lr=self.lr)
+        return optimizer
diff --git a/output/.gitkeep b/output/.gitkeep
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,16 @@
+transformers==4.35.2
+wandb==0.18.0
+torchmetrics==1.2.0
+torch==2.1.0
+tokenizers==0.15.0
+seaborn==0.13.2
+pytorch-lightning==2.1.2
+pandas==2.1.3
+matplotlib==3.9.2
+hydra-core==1.3.2
+huggingface-hub==0.19.4
+scikit-learn==1.2.2
+scipy==1.10.1
+numpy==1.24.3
+joblib==1.2.0
+tqdm
diff --git a/tb_logs/.gitkeep b/tb_logs/.gitkeep