The Open Research Converter (ORC) is a tool designed to allow users to convert proprietary and licensed bibliometric datasets to a shareable format through OpenAlex's API (API documentation found here).
The Open Research Converter has a demo running at orc-demo.gesis.org where you can trial the functionality. This url may be subject to change or to removal after a period of time.
Bibliometrics and in particular Scientometrics suffers from a lack of reproducibility, wherein the databases used to perform bibliometrics are often proprietary and therefore bound by copyright and access agreements which forbid sharing the underlying data used to create the scientific insights shared in papers.
OpenAlex released in 2022 and is a open-source bibliometric database compiled by Our Research which releases its data with a maximally permissive copyright (specifically under the CC0 1.0 Universal deed), allowing free sharing of all data. This has allowed bibliometric researchers to download and interrogate the data as they see fit, and enables sharing of data.
However, dealing with OpenAlex data can be cumbersome. The methods of access are currently via the website, API, or a data dump, each of which have challenges for researchers associated with it. Namely, to use the website limits the amount of information available to be displayed and may require downloading and then processing the data further to achieve the desired insights, to use the API requires a level of technical knowledge and is rate limited by OpenAlex, and the data dumps are very large (approximately 300GB at time of writing) and also require technical knowledge in the processing and interrogation of the data.
Easing the barrier of access to OpenAlex is a current theme of work in the bibliometrics community, for example @massimo_2024 have created a tool in the R programming language, openalexR, capable of bulk collection of OpenAlex data and processing this data from OpenAlex's JSON based data format to a tabular format. Similarly OpenAlex Networks is a Python library for generation of OpenAlex datasets and processing of citation and coauthorship networks. OpenAlexNet is a C# wrapper for OpenAlex enabling searching of OpenAlex.
Currently OpenAlex has no easy method for researchers to convert their datasets from proprietary formats to OpenAlex. While it is possible to manually convert smaller datasets using OpenAlex's website, or download the OpenAlex data dump and process this to enable matching.
We provide here in the Open Research Coverter a tool utilising the OpenAlex API enabling simple bulk conversion of bilbiometric data to a shareable format.
- Open Research Converter
If you wish to use the ORC without installing locally:
- Navigate to https://orc-demo.gesis.org
- Fill your the email address into the email box
- This is so that OpenAlex can monitor traffic, and places your requests in the "polite pool", where responses are faster and more consistent.
- Input your DOI data:
- The ORC expects a comma separated list of DOIs in the text box
- The ORC does not mind whether DOIs are prefaced with "https://doi.org/"
- Via csv file
- Browse to select a csv file, this shall be read into the text box
- The ORC expects a single column of DOIs with a header
- Therefore if the first row contains a DOI this will be not be parsed into the text box
- Via copy and paste into the text box
- You can also manually copy and paste your DOI data into the text box
- The ORC can accept thousands of DOIs, though this may take a few minutes.
- Click Submit
- A waiting animation should play in the right hand output box, if this flashes and then disappears your query may have been unsuccessful. Please try one more time, and then check your input.
- Wait for Output
- If your query is successful, then in the output box the first 50 OpenAlex IDs corresponding to your DOIs will be returned.
- If you have more submitted than 50 DOIs, then click "download CSV" to download a csv file with the DOI in the first column and the corresponding OpenAlex ID in the second column.
Should you wish to run the ORC locally using Docker, please follow these steps:
Prerequisites: Docker and Docker Compose installed
Step 1: Set up environment variables
The root .env file is required as it configures which nginx configuration to use.
Via makefile (Linux/macOS):
make set_envsOr manually:
# IMPORTANT: Copy the root .env.template first
cp .env.template .env
# Then copy the service-specific env files
cp src/env_templates/backend.env.template src/env/backend.env
cp src/env_templates/frontend.env.template src/env/frontend.env
cp src/env_templates/js.env.template src/env/js.env
cp src/env_templates/nginx.env.template src/env/nginx.envNote: The root .env file sets LOCAL_OR_PRODUCTION=local, which tells Docker which nginx config to use (local.default.conf vs prod.default.conf). Without this file, docker-compose will fail with ".default.conf: not found".
Step 2: Build and run
docker compose up --build -d
# Or via makefile: make runStep 3: Access the application
Navigate to http://localhost or http://127.0.0.1
(Note, if your browser gives secure connection is not available, please check you are not using https)
For development without Docker, you can run the backend and frontend separately.
Prerequisites:
- Python 3.11+
- Node.js 16+ (or 22 for latest)
- Poetry (Python package manager)
Backend Setup:
# From project root
poetry install
# Run the backend server on port 8001
poetry run python -m quart --app src.orc.backend.orc_backend.app run --port 8001Frontend Setup:
cd src/orc/frontend/orc-demo
# Install dependencies
npm install
# Configure API URL for local development
# Edit .env or create one with:
echo "REACT_APP_DEV_URL=http://localhost:8001" > .env
echo "REACT_APP_ENV=dev" >> .env
# Start the development server
npm startCORS Configuration:
When running frontend and backend separately, you may encounter CORS issues. Two solutions:
-
Add proxy to package.json (recommended for development):
{ "proxy": "http://localhost:8001" }Then change
REACT_APP_DEV_URLto empty string or/. -
Add CORS headers to backend (for testing only - not recommended for production)
- The ORC is still in development and may contain bugs, for example:
- If items are not found in OpenAlex, they may not be returned leading to a smaller number of items in the output
- If an error happens on the backend it may not inform the frontend properly, leading to a failure (when the waiting ring disappears) without informing the user as to why.
The ORC exposes a REST API for programmatic access. Full OpenAPI specification is available at src/orc/backend/orc_backend/openapi.yaml.
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/ |
API information page |
| GET | /api/healthcheck |
Check OpenAlex API connectivity (returns 418 if healthy) |
| POST | /api/start_processing |
Convert DOIs to OpenAlex IDs |
| POST | /api/process_all |
Convert DOIs and get full OpenAlex metadata |
curl -X POST https://orc-demo.gesis.org/api/start_processing \
-H "Content-Type: application/json" \
-d '{"email": "your@email.com", "input_data": "10.1038/nature12373, 10.1126/science.1231143"}'[{
"job_id": "uuid-string",
"output_data": ["https://openalex.org/W2102245935", "https://openalex.org/W2015936098"],
"output_full": "doi, oa_id\n...",
"submitted_count": 2,
"found_count": 2,
"missing_dois": [],
"invalid_dois": ["not-a-doi"]
}]The response includes:
submitted_count: Number of valid DOIs submitted for processingfound_count: Number of DOIs found in OpenAlexmissing_dois: List of valid DOIs not found in OpenAlexinvalid_dois: List of input strings that failed DOI format validation
This section describes the complete flow from when a user submits DOIs to when results are returned.
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ 1. User enters email and DOIs (via text input or CSV upload) │
│ 2. User clicks "Submit" │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React) │
│ 3. Validates email format (regex check) │
│ 4. Sends POST request to /api/start_processing with email and DOI list │
│ 5. Displays loading animation while waiting │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKEND API (app.py) │
│ 6. Receives request at /start_processing endpoint │
│ 7. Creates OpenResearchConverter instance │
│ 8. Calls process() method with email and input data │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (open_research_converter.py) │
│ 9. generate_new_job() - Creates unique job ID (UUID) │
│ 10. _receive_data() - Stores raw input in job dictionary │
│ 11. _validate_input_data() - Validates: │
│ • Job ID exists │
│ • Email is present and valid │
│ • Partitions DOIs into valid and invalid (Step 11a) │
│ • Invalid DOIs are stored separately and reported to the user │
│ • Processing continues with valid DOIs only │
│ 12. Normalizes DOIs to standard format (https://doi.org/...) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ REQUESTER (requester.py) │
│ 13. _chunk_input_data() - Splits DOIs into chunks of 50 │
│ 14. _prepare_chunks() - Formats each chunk into OpenAlex API query │
│ • Creates filter query: works?filter=doi:DOI1|DOI2|DOI3... │
│ • Adds email to "polite pool" for better rate limits │
│ 15. _process_aio() - Sends concurrent requests using aiometer │
│ • Respects rate limits (max 10 requests/second) │
│ • Implements exponential backoff on failures │
│ 16. Collects responses and extracts DOI → OpenAlex ID pairs │
│ 17. Compares returned DOIs against submitted DOIs │
│ 18. Tracks missing DOIs (submitted but not found in OpenAlex) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ RESPONSE ASSEMBLY │
│ 19. return_data() - Formats final response: │
│ • output_data: List of OpenAlex IDs │
│ • output_full: CSV string (doi, oa_id) │
│ • submitted_count: Valid DOIs submitted for processing │
│ • found_count: DOIs successfully matched │
│ • missing_dois: DOIs not found in OpenAlex │
│ • invalid_dois: Input strings that failed DOI format validation │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React) │
│ 20. Receives JSON response │
│ 21. Displays counter: "Found X/Y" (found_count/submitted_count) │
│ 22. Shows first 50 OpenAlex IDs in output box │
│ 23. Enables "Download CSV" button for full results │
│ 24. If invalid DOIs exist, shows expandable section to view/download them │
│ 25. If missing DOIs exist, shows expandable section to view/download them │
└─────────────────────────────────────────────────────────────────────────────┘
The ORC functions in a containerised environment. To run this using the makefile type make run.
There are three containers that are initialised, a nginx container that acts as a reverse proxy, a frontend container that serves a JavaScript based website, and a backend container which has the processing and API interface.
Acts as reverse proxy for front- and back-end containers. Copies in robots and 404 html pages and has two potential configurations, local and prod. Which of these is chosen is selected by the .env in the TLD.
Local.default.conf is a simpler configuration designed for running the ORC locally. If you wish to deploy to a server to host ORC and wish to enable SSH, prod.default.conf allows for this configuration using certbot. The commands to trial and run the certbot authentication are in the makefile certificates_dry_run and certificates_create_and_load respectively. Further certbot configuration is found in the docker-compose.yml.
A separate README detailing the Frontend container can be found at src/orc/frontend/orc-demo/README.md
Exposes port 8001 for app traffic.
Utilises Gunicorn for serving the app with hard coded parameters (assistance for injecting these parameters into the entrypoint command without using shell style or bash -c... would be appreciated). These can be changed in the ENTRYPOINT command in the Dockerfile.
- app.py
- Contains async API to interface with the JavaScript Application
- Route
/- hello_world- Returns root HTML with noindex Robots
- Route
/healthcheck- Queries OpenAlex to check there is a working connection
- Route
/start_processing(Steps 6-8)- Queries OpenAlex for WorkIDs
- Route
/process_all(Steps 6-8)- Queries OpenAlex for full bibliographic records
- open_research_converter.py
- OpenResearchConverter
- Contains code to coordinate processing the input DOIs (data) and returned values from OpenAlex (superclass of OpenAlexRequester)
- generate_new_job (Step 9)
- Creates UUID for job and assigns memory in dictionary for data
- process
- Checks input data is correctly formatted and begins querying OpenAlex for WorkIDs
- process_all
- Checks input data is correctly formatted and begins querying OpenAlex for full bibliometric data
- return_data (Step 19)
- Formats and returns data to frontend
- Private Functions:
- _recieve_data (Step 10)
- Stores input data with best effort to reformat correctly
- _validate_input_data (Step 11)
- Checks job exists, email exists and is correctly formatted, and partitions DOIs into valid and invalid
- _partition_dois (Step 11a)
- Separates input strings into valid and invalid DOIs; invalid DOIs are stored and reported, valid DOIs proceed to processing
- _validate_uuid
- Checks the UUID is in the job dictionary
- _validate_email
- Checks the email is a string. (Email regex exists on the frontend to check it is correctly formatted)
- _validate_data
- Checks the data is a list of valid dois (with or without
https://doi.org/prefix).
- Checks the data is a list of valid dois (with or without
- _doi_list_formatter (Step 12)
- Normalizes DOIs to include the https://doi.org/ prefix
- _check_ready
- Checks the formatted data (post _validate_data) is in the dictionary
- _recieve_data (Step 10)
- OpenResearchConverter
- requester.py
- OpenAlexRequester
- Base class for accessing OpenAlex API using asynchronous httpx client and exponential backoff in case of rate limit breaking.
- health_check
- Tests connection to OpenAlex API
- Private Functions
- _process_aio (Steps 15-18)
- Coordinates processing the data (chunking, formatting requests) and sending requests to OpenAlex to return WorkIDs with aiometer. Collects responses, compares returned DOIs against submitted, and tracks missing DOIs.
- _process_all (Steps 15-18)
- Coordinates processing the data (chunking, formatting requests) and sending requests to OpenAlex to return full bibliographic records with aiometer. Collects responses, compares returned DOIs against submitted, and tracks missing DOIs.
- _prepare_chunks (Step 14)
- Takes DOI chunk and formats into a request to OpenAlex API for WorkIDs
- _prepare_chunks_full (Step 14)
- Takes DOI chunk and formats into a request to OpenAlex API for full bibliographic data
- _chunk_input_data (Step 13)
- Splits data into 'chunks' of 50 DOIs
- _doi_str_formatter (Step 12)
- Regularises DOIs to https prefix and lowercase
- _fetch
- Sends requests to OpenAlex API using aioclient and implements exponential backoff
- _process_aio (Steps 15-18)
- OpenAlexRequester
The ORC was built with a Gitlab CI/CD specific to GESIS. We have included in v1.1.0 a thinner Github CI/CD template. The majority of commands and testing used can be replicated via the Makefile. We include the structure of the current Gitlab CI/CD here:
- Build
- Lint
- Ruff
ruff check ./src
- Ruff
- Test
- Coverage using Pytest
poetry run coverage run -m pytest -m "" ./tests
- Bandit
bandit -c pyproject.toml -r ./src/ --format txt > bandit.txt
- Pyright
pyright ./src --outputjson > report_raw.json
- Coverage using Pytest
- Deploy
Frontend Testing must be run from src/orc/frontend/orc-demo/ with npm test.
Backend Tests can be found in tests/. A csv of DOIs from Jason Priem (founder of OpenAlex) and the associated OpenAlex WorkIDs can be found in tests/fixtures/priem.csv. Similarly in test_requester.py and test_open_research_converter.py in tests/ one may find lists of DOIs and associated WorkIDs used for testing the ORC. A guide for creating your own test set is found in tests/fixtures/extraction.md.
All dependency management for the backend is managed by poetry. For the frontend it is captured in package.json and package-lock.json.
Following PEP621, configuration for core project metadata is stored in the pyproject.toml where possible.
- [B1] - Error handling is currently not performed on the frontend, leading to the process occasionally stopping without informing the user
- [B2] - Reports of DOI input string ending in comma failing.
- [M1] - For items that may exist in other databases without a DOI but contain enough information to confidently match (e.g. author names, title, publishing date, &c.), extending the ORCs capability to match these records.
- [m1] - Better handling of items which do not exist in OpenAlex (return "Not found" or similar rather than dropping)
- [m2] - Improving test coverage and quality
- [m3] - Reinstating Typecheck for the backend
- [m4] -
Implement frontend Testing - [m5] - Standardising .env variable names and values (local/dev/prod/production)
- [m6] - Implement frontend logging
- [m7] -
Change the bind mount for certbot to a docker volume. - [m8] - Adding ability to change gunicorn parameters via ARG/ENV in the backend container. (see
Functionality/Backend Container)
Please raise github issues with bugs. Any frontend development experience would be greatly appreciated.
- This project was configured for use on a development container - this will automatically install the project and install development dependencies inside it. (A template version of this project will shortly be publicly released)
- To add dependencies to the python module use poetry add
- To enable production change:
src/env/js.envREACT_APP_ENV: "dev" to "production".envLOCAL_OR_PRODUCTION: "local" to "prod"
- Most useful commands have been captured in the makefile, this also can assist with figuring out what fits where
- When docker compose up is run, the logs are captured in a newly created folder /logs/, this is bind mounted to your filesystem.
If you are having difficulties using the ORC locally or at orc-demo.gesis.org please reach out to Jack Culbert at jack.culbert@gesis.org
- Jack H. Culbert - Lead Developer - ORCID, LinkedIn, Github
- Muhammad Ahsan Shahid - Frontend Developer - ORCID, LinkedIn, Github
- Philipp Mayr - Team Lead - ORCID
This work was funded by the Federal Ministry of Education and Research via funding numbers: 16WIK2301B / 16WIK2301E, The OpenBib project. We acknowledge support by Federal Ministry of Education and Research, Germany under grant number 01PQ17001, the Competence Network for Bibliometrics.
Jack Culbert, and Philipp Mayr received additional funding by the European Union under the Horizon Europe grant OMINO – Overcoming Multilevel INformation Overload under grant number 101086321
As of release of v1.1.0 on the 5th of November 2024: This software has been submitted to JOSS, citation details pending.
Please remember to also cite the OpenAlex work:
@article{priem2022openalex,
title={OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts},
author={Priem, Jason and Piwowar, Heather and Orr, Richard},
journal={arXiv preprint arXiv:2205.01833},
year={2022}
}This code is licenced under GPL-3.0, or later.