Skip to content

Books dataset #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

originalankur
Copy link
Contributor

Extracted from open library dataset.

# Books Dataset

The books.json is a subset from the openlibrary [books datasets](https://openlibrary.org/developers/dumps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to add the CC0 1.0 universal license here I think: https://openlibrary.org/help/faq/using#ownership

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haroenv To the best of my knowledge when it comes to CC0 1.0 universal license following rules apply.

  • You may use the dataset for commercial purposes.
  • No need to cite or reference the license.
  • Attribution is optional, not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haroenv if you insist will add a copy in the folder. Do advice.

@Haroenv Haroenv requested review from pixelastic and chuckmeyer July 2, 2025 13:57
@pixelastic
Copy link
Contributor

Hey @originalankur, thanks for the PR.

I had a look at the content of the file, and I'm afraid some of the books might contain sensitive content (at least one suspicious case of doxxing, and mentions of child pornography), that we don't really want in our public list of data.

I cleaned the list and shrinked the number of books to ~24k rather than ~33k (which also puts the file size at 49MB, right below the suggested 50MB github limit).
You can find my clean version in the books-clean branch.

Can you pull it in to replace your version, please?

@originalankur
Copy link
Contributor Author

@pixelastic Thank you for cleaning the data, I should have thought of this. I will update the PR. Thanks Tim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants