Skip to content

technometrics-lab/0-Factual_Consistency_Through_Entity_Triplets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

LLM-Resilient Bibliometrics: Factual Consistency Through Entity Triplet Extraction

This github repository provides the code that belongs to the paper "LLM-Resilient Bibliometrics: Factual Consistency Through Entity Triplet Extraction". The code provides the full pipeline from raw arXiv pdf's to processed entity triplets of the shape (subject, predicate, object).

Structure 📚

-- src
    |--- load_data.py: This file loads the pdf's and converts them to text files
    |--- preprocessing.py: This file preprocesses the text files
    |--- extract_claims.py: This file extracts the core claims from the text files
    |--- extract_triplets.py: This file extracts the triplets from the text files
    |--- clustering.py: This file clusters the triplets based on the subject, object, or both
    |--- helpers.py: This file provides a helper function for logging

-- requirements.txt: File with environment requirements

Requirements 🔍

Environment

The file requirements.txt contains the requirements needed, which are compatible with python 3.11.7. Using the following code snippet, an environment can be created:

conda create -name <env_name> python=3.11.7 
pip install -r requirements.txt

Data

Before using the code, you need to load data and the claim extraction model.

  • The code is designed for the extraction of triplets from arXiv papers. These articles are publicly available in a Google Cloud bucket, for more information read this Kaggle page.
  • The claim extraction model originates from the paper of Wei et al. "ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning". The trained WC-BiLSTM model is available in this drive. This model should be placed in a folder "models/claim_model" in the root directory.

Usage of the code 📝

The code is designed for the extraction of triplets from arXiv papers. There are several things to take into account when using the code:

  • First define your set of target papers, from which you want to extract triplets. Put these papers, together with the metadata in a folder. Now you are ready to load the data with load_data.py and preprocess the text with preprocessing.py.
  • Now the claims can be extracted with claim_extraction.py, make sure that the claim extraction model is in the correct place. Afterwards, the triplets can be extracted with triplet_extraction.py.
  • Finally, you can cluster the triplets with clustering.py based on the embedding of the subject, object or of both.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages