NLS Fellowship 2024

This repository contains a series of Jupyter Notebooks developed during the National Librarian’s Research Fellowship in Digital Scholarship 2024-25 to interact with the Archive of Tomorrow - Talking about Health web archive collection metadata. The notebooks guide users through various data processing and analysis steps, from metadata conversion to visualization.

Project Structure

Step_1_JSON_metadata_to_CSV.ipynb: Converts JSON metadata files into CSV format for easier analysis.
Step_2_Link_checker.ipynb: Validates URLs in the dataset to ensure they are accessible.
Step_3_Enhancing_the_metadata.ipynb: Enriches the existing metadata with additional information using Newspaper3k package.
Step_4_LLM_Summaries.ipynb: Generates summaries using LLMs.
Step_5_keyword_wordcloud.ipynb: Creates word clouds based on extracted keywords.
Step_6_Text analysis and visualisation: Coming soon.
Experiments: Some codes and outputs for performance evaluation.
AoT_data_enhanced_labeled: The final enhanced dataset. It contains a URL valid at the time of uploading, scraped information (title, keywords, summaries, LLM-generated summaries, manual category labels, predicted category labels, and the confidence score of the prediction.
Data: The data directory contains datasets used across the notebooks. Ensure that the required data files are present before running the notebooks.

Setup Instructions

To set up the environment and run the notebooks locally:

Clone the Repository:

git clone https://github.com/aurigandrea/NLS-Fellowship-2024.git

Create a virtual environment (optional):

python -m venv venv
source venv/bin/activate

On Windows: venv\Scripts\activate

Install requirements:
```
pip install -r requirements.txt
```
Lanuch jupyter notebook
```
jupyter notebook
```

🔗 Launch on Binder

Or launch a specific notebook directly (make sure to grab the relevant data or use your own):

👤 Author and license

Andrea Kocsis

Developed during the NLS Fellowship 2024.

Feel free to explore, use, or build upon this work! MIT License.

Data owner: National Library of Scotland, CC_BY4 Find out more about the project here:. Find the original data and the Datasheet for Dataset here:

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Experiments		Experiments
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
AoT_data_enhanced_labelled.csv		AoT_data_enhanced_labelled.csv
LICENSE		LICENSE
README.md		README.md
Step_1_JSON_metadata_to_CSV.ipynb		Step_1_JSON_metadata_to_CSV.ipynb
Step_2_Link_checker.ipynb		Step_2_Link_checker.ipynb
Step_3_Enhancing the metadata.ipynb		Step_3_Enhancing the metadata.ipynb
Step_4_LLM_Summaries.ipynb		Step_4_LLM_Summaries.ipynb
Step_5_keyword_wordcloud.ipynb		Step_5_keyword_wordcloud.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLS Fellowship 2024

Project Structure

Setup Instructions

🔗 Launch on Binder

👤 Author and license

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aurigandrea/NLS-Fellowship-2024

Folders and files

Latest commit

History

Repository files navigation

NLS Fellowship 2024

Project Structure

Setup Instructions

🔗 Launch on Binder

👤 Author and license

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages