ATRIUM PDF Text Extraction

Converts PDFs into the ATRIUM JSON format described below:

#########################################################
# T4.1.2 PDF report text extraction -
# suggested JSON output per PDF report.
# Reports may be represented by separate JSON files, 
# or as separate records in a JSONL format file.
# Note - we don't need to include text for individual 
# identified sections as we have full text plus start/end 
# character positions of these sections
#########################################################
{
	"meta": {
		(any metadata about the source document)
	},
	"text": (full text contents of the document),
	"sections": [
		{ 
			"start": (number),	<-- character position
			"end": (number),	<-- character position
			"type": (string) 	<-- e.g. "title", "subtitle", "abstract", "conclusions", "references" etc.
		}
	]
	
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pdf_to_json.py		pdf_to_json.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ATRIUM PDF Text Extraction

About

Uh oh!

Releases

Packages

Languages

License

GateNLP/atrium-text_extraction

Folders and files

Latest commit

History

Repository files navigation

ATRIUM PDF Text Extraction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages