We evaluated the performance of various Large Language Models (LLMs) in generating Data Management Plans (DMPs) that complied with the National Institutes of Health (NIH) requirements. This evaluation was based on analyses from two datasets: DMP_Automatic_Evaluation_Analysis and DMP_Human_Evaluation_Analysis.
In Phase 1, we will try different strategies to improve the performance of LLMs, such as Retrieval Augmented Generation (RAG) and prompt engineering. We will also start building dmpchef.org. It will only let users request drafts for NIH DMPs in this Phase. By the end of this Phase, we will learn how much LLMs can be improved by being tuned towards a specific DMP generation task (NIH DMP in this case).