Skip to content

Commit c81f36c

Browse files
committed
feat: add upload and download modules for external services
Add upload support for Figshare, Zenodo, Internet Archive, and triplestore. Add download support for Figshare.
1 parent 5915b8d commit c81f36c

File tree

12 files changed

+993
-26
lines changed

12 files changed

+993
-26
lines changed

README.md

Lines changed: 76 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,95 @@
11
# Piccione
22

3-
[![Run tests](https://github.com/arcangelo7/piccione/actions/workflows/tests.yml/badge.svg)](https://github.com/arcangelo7/piccione/actions/workflows/tests.yml)
3+
Pronounced *Py-ccione*.
4+
5+
[![Run tests](https://github.com/opencitations/piccione/actions/workflows/tests.yml/badge.svg)](https://github.com/opencitations/piccione/actions/workflows/tests.yml)
46
[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)
57

6-
A Python toolkit for uploading and downloading data to external repositories and cloud services..
8+
**PICCIONE** - Python Interface for Cloud Content Ingest and Outbound Network Export
9+
10+
A Python toolkit for uploading and downloading data to external repositories and cloud services.
711

812
## Installation
913

1014
```bash
1115
pip install piccione
1216
```
1317

14-
## Usage
18+
## Modules
19+
20+
### Upload
21+
22+
#### Figshare
23+
Upload files to Figshare.
24+
25+
```bash
26+
python -m piccione.upload.on_figshare config.yaml
27+
```
28+
29+
Configuration file format:
30+
```yaml
31+
TOKEN: your_figshare_token
32+
ARTICLE_ID: 12345678
33+
files_to_upload:
34+
- /path/to/file1.zip
35+
- /path/to/file2.zip
36+
```
37+
38+
#### Zenodo
39+
Upload files to Zenodo.
40+
41+
```bash
42+
python -m piccione.upload.on_zenodo config.yaml
43+
```
44+
45+
Configuration file format:
46+
```yaml
47+
access_token: your_zenodo_token
48+
project_id: 12345678
49+
zenodo_url: https://zenodo.org
50+
files:
51+
- /path/to/file1.zip
52+
- /path/to/file2.zip
53+
```
54+
55+
#### Internet Archive
56+
Upload files to the Internet Archive.
57+
58+
```bash
59+
python -m piccione.upload.on_internet_archive config.yaml
60+
```
61+
62+
Configuration file format:
63+
```yaml
64+
identifier: my-archive-item
65+
access_key: your_access_key
66+
secret_key: your_secret_key
67+
file_paths:
68+
- /path/to/file1.zip
69+
metadata:
70+
title: My Archive Item
71+
description: Description of the item
72+
```
1573
16-
```python
17-
from piccione import example_function
74+
#### Triplestore
75+
Execute SPARQL UPDATE queries on a triplestore.
1876
19-
result = example_function()
20-
print(result)
77+
```bash
78+
python -m piccione.upload.on_triplestore http://localhost:8890/sparql /path/to/sparql/folder
79+
```
80+
81+
### Download
82+
83+
#### Figshare
84+
Download all files from a Figshare article.
85+
86+
```bash
87+
python -m piccione.download.from_figshare 12345678 -o /output/directory
2188
```
2289

2390
## Documentation
2491

25-
Full documentation is available at: https://arcangelo7.github.io/piccione/
92+
Full documentation is available at: https://opencitations.github.io/piccione/
2693

2794
## Development
2895

@@ -31,11 +98,8 @@ This project uses [UV](https://docs.astral.sh/uv/) for dependency management.
3198
### Setup
3299

33100
```bash
34-
# Clone the repository
35-
git clone https://github.com/arcangelo7/piccione.git
101+
git clone https://github.com/opencitations/piccione.git
36102
cd piccione
37-
38-
# Install dependencies
39103
uv sync --all-extras --dev
40104
```
41105

pyproject.toml

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,19 @@ classifiers = [
1717
"Programming Language :: Python :: 3.12",
1818
"Programming Language :: Python :: 3.13",
1919
]
20-
dependencies = []
20+
dependencies = [
21+
"internetarchive>=5.7.1",
22+
"pyyaml>=6.0.3",
23+
"redis>=7.1.0",
24+
"requests>=2.32.5",
25+
"sparqlite>=1.0.0",
26+
"tqdm>=4.67.1",
27+
]
2128

2229
[project.urls]
23-
Homepage = "https://github.com/arcangelo7/piccione"
24-
Documentation = "https://arcangelo7.github.io/piccione/"
25-
Repository = "https://github.com/arcangelo7/piccione"
30+
Homepage = "https://github.com/opencitations/piccione"
31+
Documentation = "https://opencitations.github.io/piccione/"
32+
Repository = "https://github.com/opencitations/piccione"
2633

2734
[dependency-groups]
2835
dev = [

src/piccione/__init__.py

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,3 @@
1-
"""A Python toolkit for uploading and downloading data to external repositories and cloud services.."""
1+
"""PICCIONE - Python Interface for Cloud Content Ingest and Outbound Network Export."""
22

33
__version__ = "0.1.0"
4-
5-
6-
def example_function() -> str:
7-
"""Return a greeting message.
8-
9-
Returns:
10-
A greeting string.
11-
"""
12-
return "Hello from piccione!"

src/piccione/download/__init__.py

Whitespace-only changes.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
#!/usr/bin/python
2+
# -*- coding: utf-8 -*-
3+
# Copyright (c) 2025, Arcangelo Massari <[email protected]>
4+
#
5+
# Permission to use, copy, modify, and/or distribute this software for any purpose
6+
# with or without fee is hereby granted, provided that the above copyright notice
7+
# and this permission notice appear in all copies.
8+
#
9+
# THE SOFTWARE IS PROVIDED 'AS IS' AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
10+
# REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
11+
# FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT,
12+
# OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
13+
# DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
14+
# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
15+
# SOFTWARE.
16+
17+
"""
18+
Download files from a Figshare article using the Figshare API.
19+
20+
This script downloads all files associated with a Figshare article ID.
21+
It uses the public Figshare API which works reliably unlike direct wget/curl
22+
on Figshare URLs.
23+
"""
24+
25+
import argparse
26+
import hashlib
27+
import sys
28+
from pathlib import Path
29+
30+
import requests
31+
from tqdm import tqdm
32+
33+
BASE_URL = "https://api.figshare.com/v2"
34+
CHUNK_SIZE = 8192
35+
36+
37+
def get_article_metadata(article_id):
38+
"""Retrieve article metadata from Figshare API."""
39+
url = f"{BASE_URL}/articles/{article_id}"
40+
response = requests.get(url)
41+
response.raise_for_status()
42+
article_data = response.json()
43+
44+
# Figshare API has a default limit of 10 files. We need to fetch files separately with pagination.
45+
files_url = f"{BASE_URL}/articles/{article_id}/files"
46+
files_response = requests.get(files_url, params={"page_size": 1000})
47+
files_response.raise_for_status()
48+
article_data['files'] = files_response.json()
49+
50+
return article_data
51+
52+
53+
def download_file(download_url, output_path, expected_size, expected_md5=None):
54+
"""Download a file from URL with progress bar and optional MD5 verification."""
55+
response = requests.get(download_url, stream=True)
56+
response.raise_for_status()
57+
58+
md5_hash = hashlib.md5()
59+
60+
with open(output_path, 'wb') as f:
61+
with tqdm(total=expected_size, unit='B', unit_scale=True, unit_divisor=1024) as pbar:
62+
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
63+
f.write(chunk)
64+
md5_hash.update(chunk)
65+
pbar.update(len(chunk))
66+
67+
if expected_md5:
68+
actual_md5 = md5_hash.hexdigest()
69+
if actual_md5 != expected_md5:
70+
raise ValueError(f"MD5 mismatch: expected {expected_md5}, got {actual_md5}")
71+
print(f" MD5 checksum verified: {actual_md5}")
72+
73+
74+
def main():
75+
parser = argparse.ArgumentParser(
76+
description="Download files from a Figshare article"
77+
)
78+
parser.add_argument(
79+
"article_id",
80+
type=int,
81+
help="Figshare article ID"
82+
)
83+
parser.add_argument(
84+
"-o", "--output-dir",
85+
type=Path,
86+
default=Path("."),
87+
help="Output directory for downloaded files (default: current directory)"
88+
)
89+
90+
args = parser.parse_args()
91+
92+
args.output_dir.mkdir(parents=True, exist_ok=True)
93+
94+
print(f"Fetching metadata for article {args.article_id}...")
95+
metadata = get_article_metadata(args.article_id)
96+
97+
files = metadata.get("files", [])
98+
if not files:
99+
print("No files found in this article")
100+
return 1
101+
102+
print(f"\nFound {len(files)} file(s) to download:")
103+
for f in files:
104+
size_mb = f['size'] / (1024 * 1024)
105+
print(f" - {f['name']} ({size_mb:.2f} MB)")
106+
107+
print(f"\nDownloading to: {args.output_dir.absolute()}\n")
108+
109+
for file_info in files:
110+
filename = file_info['name']
111+
download_url = file_info['download_url']
112+
size = file_info['size']
113+
md5 = file_info.get('supplied_md5')
114+
115+
output_path = args.output_dir / filename
116+
117+
print(f"Downloading {filename}...")
118+
download_file(download_url, output_path, size, md5)
119+
print(f" Saved to {output_path}\n")
120+
121+
print("All files downloaded successfully")
122+
return 0
123+
124+
125+
if __name__ == "__main__":
126+
sys.exit(main())

src/piccione/upload/__init__.py

Whitespace-only changes.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
from typing import Set
2+
3+
import redis
4+
from redis.exceptions import ConnectionError as RedisConnectionError
5+
6+
7+
class CacheManager:
8+
REDIS_KEY = "processed_files"
9+
10+
def __init__(
11+
self,
12+
redis_host: str = "localhost",
13+
redis_port: int = 6379,
14+
redis_db: int = 4,
15+
):
16+
self._redis = None
17+
self.redis_host = redis_host
18+
self.redis_port = redis_port
19+
self.redis_db = redis_db
20+
self.processed_files: Set[str] = set()
21+
22+
self._init_cache()
23+
24+
def _init_redis(self) -> None:
25+
"""Initialize Redis connection."""
26+
try:
27+
self._redis = redis.Redis(
28+
host=self.redis_host,
29+
port=self.redis_port,
30+
db=self.redis_db,
31+
decode_responses=True,
32+
)
33+
self._redis.ping()
34+
except RedisConnectionError:
35+
raise RuntimeError("Redis is not available. Cache requires Redis.")
36+
37+
def _init_cache(self) -> None:
38+
"""Initialize cache from Redis."""
39+
self._init_redis()
40+
existing_redis_files = self._redis.smembers(self.REDIS_KEY)
41+
self.processed_files.update(existing_redis_files)
42+
43+
def add(self, filename: str) -> None:
44+
"""Add a file to the cache."""
45+
self.processed_files.add(filename)
46+
self._redis.sadd(self.REDIS_KEY, filename)
47+
48+
def __contains__(self, filename: str) -> bool:
49+
"""Check if a file is in the cache."""
50+
return filename in self.processed_files
51+
52+
def get_all(self) -> Set[str]:
53+
"""Return all files in the cache."""
54+
self.processed_files.update(self._redis.smembers(self.REDIS_KEY))
55+
return self.processed_files

0 commit comments

Comments
 (0)