Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
e023c8f
Update build approach for Alpine
nickschuch Feb 16, 2021
e206af3
Build for multiple architectures
nickschuch Jan 6, 2022
17f880f
Update CirlceCI
nickschuch Jan 6, 2022
7f3ba7a
Enable Docker experimental features
nickschuch Jan 6, 2022
d6270b0
Updated config.yml
nickschuch Jan 6, 2022
a9576a2
Enable Docker experimental features
nickschuch Jan 6, 2022
f1ccc41
Merge branch 'master' of github.com:skpr/docconv
nickschuch Jan 6, 2022
d29c8a2
Fix amend
nickschuch Jan 6, 2022
b39674f
Fix bin path
nickschuch Jan 6, 2022
91069b5
Fix bin path
nickschuch Jan 6, 2022
74eea62
Adds Alpine 3.16 and 3.17
nickschuch Jun 11, 2023
910d62c
Rebuild
nickschuch Jun 11, 2023
ce125ac
Only 3.16
nickschuch Jun 11, 2023
07406fd
Bump to Alpine 3.17 (#2)
nickschuch Jun 19, 2023
b708913
Reinstate Alpine 3.15
nickschuch Jun 19, 2023
8f23247
Reinstate Golang 1.19
nickschuch Jun 19, 2023
4204205
Update config.yml
nickschuch Sep 29, 2023
99d4c89
Bump deps. Only build/push Alpine 3.17/3.18
nickschuch Sep 29, 2023
a9501f1
Bump to Alpine 3.19 (#3)
nickschuch Apr 16, 2024
a2eca43
Bump dependencies
nickschuch Jun 4, 2024
20cf302
Bump to Alpine 3.20 (Remove 3.17/3.18) (#4)
nickschuch Jul 25, 2024
9c08bbe
Update config.yml (#5)
nickschuch Aug 22, 2024
a29d773
Remove Docker version
nickschuch Sep 9, 2024
26d304a
chore: update go version to 1.23 and all dependencies (#6)
parthpnx Nov 12, 2024
b30a39c
Alpine 3.21 (#7)
nickschuch May 20, 2025
091ed48
Bump to Go 1.24 (#8)
nickschuch Aug 12, 2025
340e95f
readme: fix typo
jonathaningram Mar 28, 2021
78a31fd
use GitHub actions instead of Travis
jonathaningram Mar 28, 2021
7a8bdb0
add Sourcegraph badge to README
jonathaningram Mar 28, 2021
2d300dd
remove path separator from ioutil.TempFile prefix
jonathaningram Apr 9, 2021
3c9fbdc
docd: remove unused convertPath function
jonathaningram Apr 9, 2021
d1836ad
add note about ignored error check
jonathaningram Apr 9, 2021
69fa1c7
pptx_test: check returned error before deferring f.Close()
jonathaningram Apr 9, 2021
4c3bc02
rtf: don't ignore lines less than 5 characters long (#91)
senekor Apr 12, 2021
6792221
actions: stop building for Go 1.13
jonathaningram Apr 26, 2021
8f2bbf3
add test for TestConvertHTML (#93)
jonathaningram Apr 27, 2021
ba68c29
doc: improve metadata parsing so that titles can be reliably extracte…
dhowden Aug 17, 2021
4ea2c56
Updated dependency for poppler and removed bash arg check (#100)
justinkoke Aug 18, 2021
520b719
docd: refactor Dockerfile and publish to DockerHub (#101)
jsok Aug 18, 2021
14931be
.github: add workflow_dispatch trigger to docd
jsok Aug 19, 2021
d9d6bdb
.github: tag docd image with git SHA
jsok Aug 19, 2021
8787d35
docd/appengine: use 1.2.0 release
jsok Aug 20, 2021
aad8d24
README: add an example using curl (#102)
jsok Aug 20, 2021
1b86309
update: generate iWork proto files with latest version of protoc (#107)
helenamariano Jun 28, 2022
a02864a
client: improve client error messages (#113)
dhowden Jun 28, 2022
ac91322
Fix remote code execution vulnerability in the PDF OCR converter (#110)
helenamariano Jul 7, 2022
b8f9330
fix unbounded memory consumption vulnerability (#111)
helenamariano Jul 18, 2022
6009400
convert-doc: don't use doc if there was an error (#117)
jonathaningram Sep 5, 2022
2a8f9c9
docd/appengine: use 1.2.1 release (#120)
jonathaningram Sep 20, 2022
aedf291
.github: add Go 1.19 to matrix (#122)
jonathaningram Sep 20, 2022
4ae3532
docd: use Go 1.19 to build base image (#123)
jonathaningram Sep 20, 2022
79ad5cb
docd: add recovery middleware and allow reporting errors to GCP (#121)
jonathaningram Sep 20, 2022
1a220ca
docd: fix Dockerfile after change to Go 1.19 (#124)
jonathaningram Sep 20, 2022
7cf17a2
docd: fix binary paths (#125)
jonathaningram Sep 20, 2022
f1e3618
docd/appengine: use 1.3.2 release (#126)
jonathaningram Sep 20, 2022
b31fe59
docd: return JSON errors when things fail (#127)
jonathaningram Sep 20, 2022
5fd9b83
docd: fallback to AppEngine env vars when flags not provided (#128)
jonathaningram Sep 20, 2022
2b906ef
docd/appengine: enable error reporting and bump version (#129)
jonathaningram Sep 20, 2022
39bb88a
docd: install ca-certificates (#130)
jonathaningram Sep 20, 2022
e0e9efe
docd/appengine: use 1.3.5 release (#131)
jonathaningram Sep 20, 2022
ef634ba
increase min Go version to 1.19 and bump golang.org/x/net@0.7.0 (#135)
dependabot[bot] Mar 6, 2023
446da91
readme: update code.sajari.com/docconv/docd installation instructions…
rhaist Mar 6, 2023
ffe195a
doc: add simple test for ConvertDoc for .doc files (#143)
jonathaningram Sep 13, 2023
a08dc9b
doc: channel response before return (#134)
pixge Sep 13, 2023
fcf9650
all: remove refs to deprecated io/ioutil (#140)
testwill Sep 13, 2023
e7d2930
.github: test against Go 1.21 (#144)
jonathaningram Sep 13, 2023
5dbac9e
docd: bump to Go 1.21 and debian:12 (#145)
jonathaningram Sep 13, 2023
533191d
docd/appengine: bump to 1.3.7 (#146)
jonathaningram Sep 13, 2023
05986be
docd: add slog and JSON logging (#149)
jonathaningram Oct 30, 2023
3e8927b
docd/appengine: bump to 1.3.8 (#150)
jonathaningram Oct 30, 2023
8a9cc2b
readme: add macOS instructions for dependencies section (#151)
jonathaningram Oct 30, 2023
7b61cfa
v2: initial commit (#153)
jonathaningram Oct 31, 2023
32cfc7d
docd: don't panic on broken pipe errors (#155)
jonathaningram Nov 20, 2023
6c4e072
go.mod: use github.com/sajari/msoleps@race-fix (#156)
jonathaningram Nov 20, 2023
f75b8ae
docd/appengine: double memory (#157)
jonathaningram Nov 20, 2023
111413c
docd/appengine: use 2 instances (#158)
jonathaningram Nov 20, 2023
7e58553
add linux/arm64 platform to docd image (#159)
jonathaningram Nov 20, 2023
b7d56fb
docd: fix Docker workflow (#160)
jonathaningram Nov 20, 2023
8d1d7f3
Merge pull request from GHSA-3qm5-5hmp-8c6w
jupenur Jan 18, 2024
7b30bf6
go.mod: use the latest upstream richardlehane/msoleps dep (#164)
agcom Mar 3, 2024
d0fb907
Update build approach for Alpine
nickschuch Feb 16, 2021
048f2e4
Rebuild
nickschuch Jun 11, 2023
3e5f2b8
Bump to Alpine 3.17 (#2)
nickschuch Jun 19, 2023
a2d64de
Bump deps. Only build/push Alpine 3.17/3.18
nickschuch Sep 29, 2023
3839885
chore: update go version to 1.23 and all dependencies (#6)
parthpnx Nov 12, 2024
339a820
Bump to Go 1.24 (#8)
nickschuch Aug 12, 2025
fdec8d1
Delete vendor
kimpepper Dec 4, 2025
7bc9c03
Update workflow triggers
kimpepper Dec 4, 2025
78777fd
Add goreleaser and mise
kimpepper Dec 5, 2025
c061061
Install mise
kimpepper Dec 5, 2025
d4fb12d
Add more space
kimpepper Dec 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: 🏗️ Build

on:
push:
branches: [main]
pull_request: ~

env:
GO_VERSION: "1.25"

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: Set up Go ${{ env.GO_VERSION }}
uses: actions/setup-go@v6
with:
go-version: ${{ env.GO_VERSION }}

- name: 📦 Install Mise
run: |
curl https://mise.run | sh
mise install

- name: Install dependencies
run: sudo apt install wv unrtf tidy

- name: Build
run: mise build

- name: Test
run: mise test
31 changes: 31 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: 🚀 Publish Release

on:
push:
tags:
- 'v?[0-9]+.[0-9]+.[0-9]+(-alpha[0-9]+|-beta[0-9]+)?'

env:
GO_VERSION: "1.25"

jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: 📦 Install Mise
run: |
curl https://mise.run | sh
mise install

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Run GoReleaser Snapshot
run: mise release
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
34 changes: 34 additions & 0 deletions .github/workflows/snapshot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: 👨‍🔧 Build Snapshot Release

on:
workflow_dispatch: ~

env:
GO_VERSION: "1.25"

jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: Set up Go
uses: actions/setup-go@v6
with:
go-version: ${{ env.GO_VERSION }}

- name: 📦 Install Mise
run: |
curl https://mise.run | sh
mise install

- name: Set up QEMU
uses: docker/setup-qemu-action@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Run GoReleaser Snapshot
run: mise snapshot
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@

sajari-convert
*tests/
/dist
/vendor
50 changes: 50 additions & 0 deletions .goreleaser.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
project_name: docconv

version: 2

builds:
- id: docconv-build
main: ./docd
binary: docconv
ldflags:
- -extldflags '-static'
env:
- CGO_ENABLED=0
goos: [ linux ]
goarch: [ amd64, arm64 ]

release:
prerelease: auto
name_template: "docconv {{.Version}}"
github:
owner: skpr
name: docconv

nfpms:
- id: docconv-package
package_name: docconv
license: MIT
maintainer: Skpr
homepage: https://github.com/skpr/docconv
description: Fork of https://github.com/sajari/docconv to add additional package formats.
formats: [ apk ]
dependencies:
- tesseract-ocr
- poppler-utils
overrides:
apk:
dependencies:
- tesseract-ocr
- tesseract-ocr-dev
- poppler-utils

dockers_v2:
- images:
- "docker.io/skpr/docconv"
- "ghcr.io/skpr/docconv"
tags:
- "{{ .Version }}"
- "latest"
platforms:
- linux/amd64
- linux/arm64
25 changes: 25 additions & 0 deletions .mise.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[tools]
go = "1.25"
"ubi:goreleaser/goreleaser" = "latest"

[env]
CGO_ENABLED=0
LGFLAGS="-extldflags '-static'"

[tasks.build]
description = "Build the application"
run = '''
go build -v -o ./dist/docconv ./docd
'''

[tasks.test]
description = "Run tests"
run = "go test -v ./..."

[tasks.snapshot]
description = "Create a snapshot release with Goreleaser"
run = "goreleaser release --snapshot --clean"

[tasks.release]
description = "Create a release with Goreleaser"
run = "goreleaser release --clean"
10 changes: 0 additions & 10 deletions .travis.yml

This file was deleted.

8 changes: 8 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
ARG ALPINE_VERSION=3.21
FROM alpine:$ALPINE_VERSION

RUN apk add --no-cache tesseract-ocr tesseract-ocr-dev poppler-utils

ARG TARGETPLATFORM
COPY $TARGETPLATFORM/docconv /usr/bin/docconv
ENTRYPOINT ["/usr/bin/docconv"]
96 changes: 63 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,69 @@
# docconv

[![GoDoc](https://godoc.org/code.sajari.com/docconv?status.svg)](https://godoc.org/code.sajari.com/docconv)
[![Build Status](https://travis-ci.org/sajari/docconv.svg?branch=master)](https://travis-ci.org/sajari/docconv)
[![Go reference](https://pkg.go.dev/badge/code.sajari.com/docconv/v2.svg)](https://pkg.go.dev/code.sajari.com/docconv/v2)
[![Build status](https://github.com/sajari/docconv/workflows/Go/badge.svg?branch=master)](https://github.com/sajari/docconv/actions)
[![Report card](https://goreportcard.com/badge/code.sajari.com/docconv/v2)](https://goreportcard.com/report/code.sajari.com/docconv/v2)
[![Sourcegraph](https://sourcegraph.com/github.com/sajari/docconv/v2/-/badge.svg)](https://sourcegraph.com/github.com/sajari/docconv/v2)

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

> **Note for returning users:** the Go import path for this package been moved to `code.sajari.com/docconv`.

## Installation

If you haven't setup Go before, you first need to [install Go](https://golang.org/doc/install).

To fetch and build the code:

$ go get code.sajari.com/docconv/...
```console
$ go install code.sajari.com/docconv/v2/docd@latest
```

See `go help install` for details on the installation location of the installed `docd` executable. Make sure that the full path to the executable is in your `PATH` environment variable.

This will also build the command line tool `docd` into `$GOPATH/bin`. Make sure that `$GOPATH/bin` is in your `PATH` environment variable.
## Build

```
docker run -it -v $(pwd):/go/src/github.com/skpr/docconv -w /go/src/github.com/skpr/docconv golang:1.23-alpine3.11 /bin/sh -c "./build.sh 3.11"
docker run -it -v $(pwd):/go/src/github.com/skpr/docconv -w /go/src/github.com/skpr/docconv golang:1.23-alpine3.12 /bin/sh -c "./build.sh 3.12"
docker run -it -v $(pwd):/go/src/github.com/skpr/docconv -w /go/src/github.com/skpr/docconv golang:1.23-alpine3.13 /bin/sh -c "./build.sh 3.13"
```

## Dependencies

tidy, wv, popplerutils, unrtf, https://github.com/JalfResi/justext
- tidy
- wv
- popplerutils
- unrtf
- https://github.com/JalfResi/justext

### Debian-based Linux

Example install of dependencies (not all systems):
```console
$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
```

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext
### macOS

```console
$ brew install poppler-qt5 wv unrtf tidy-html5
$ go get github.com/JalfResi/justext
```

### Optional dependencies

To add image support to the `docconv` library you first need to [install and build gosseract](https://github.com/otiai10/gosseract/tree/v2.2.4).

Now you can add `-tags ocr` to any `go` command when building/fetching/testing `docconv` to include support for processing images:

$ go get -tags ocr code.sajari.com/docconv/...
```console
$ go get -tags ocr code.sajari.com/docconv/v2/...
```

This may complain on macOS, which you can fix by installing [tesseract](https://tesseract-ocr.github.io) via brew:

$ brew install tesseract
```console
$ brew install tesseract
```

## docd tool

Expand All @@ -48,27 +75,27 @@ The `docd` tool runs as either:

2. a service exposed from within a Docker container

This also runs as a service, but from within a Docker container. There are three build scripts:
This also runs as a service, but from within a Docker container.
Official images are published at https://hub.docker.com/r/sajari/docd.

- [./docd/debian.sh](./docd/debian.sh)
- [./docd/alpine.sh](./docd/alpine.sh)
- [./docd/appengine.sh](./docd/appengine.sh)
Optionally you can build it yourself:

The `debian` version uses the Debian package repository which can vary with builds. The `alpine` version uses a very cut down Linux distribution to produce a container ~40MB. It also locks the dependency versions for consistency, but may miss out on future updates. The `appengine` version is a flex based custom runtime for Google Cloud.
```console
$ cd docd
$ docker build -t docd .
```

3. via the command line.

Documents can be sent as an argument, e.g.

$ docd -input document.pdf
```console
$ docd -input document.pdf
```

### Optional flags

- `addr` - the bind address for the HTTP server, default is ":8888"
- `log-level`
- 0: errors & critical info
- 1: inclues 0 and logs each request as well
- 2: include 1 and logs the response payloads
- `readability-length-low` - sets the readability length low if the ?readability=1 parameter is set
- `readability-length-high` - sets the readability length high if the ?readability=1 parameter is set
- `readability-stopwords-low` - sets the readability stopwords low if the ?readability=1 parameter is set
Expand All @@ -79,11 +106,10 @@ The `docd` tool runs as either:

### How to start the service

$ # This will only log errors and critical info
$ docd -log-level 0

$ # This will run on port 8000 and log each request
$ docd -addr :8000 -log-level 1
```console
$ # This runs on port 8000
$ docd -addr :8000
```

## Example usage (code)

Expand All @@ -100,15 +126,14 @@ package main

import (
"fmt"
"log"

"code.sajari.com/docconv"
"code.sajari.com/docconv/v2"
)

func main() {
res, err := docconv.ConvertPath("your-file.pdf")
if err != nil {
log.Fatal(err)
// TODO: handle
}
fmt.Println(res)
}
Expand All @@ -121,9 +146,8 @@ package main

import (
"fmt"
"log"

"code.sajari.com/docconv/client"
"code.sajari.com/docconv/v2/client"
)

func main() {
Expand All @@ -132,8 +156,14 @@ func main() {

res, err := client.ConvertPath(c, "your-file.pdf")
if err != nil {
log.Fatal(err)
// TODO: handle
}
fmt.Println(res)
}
```

Alternatively, via a `curl`:

```console
$ curl -s -F input=@your-file.pdf http://localhost:8888/convert
```
6 changes: 6 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/sh

VERSION=$1

apk add make cmake gcc g++ poppler-utils wv lynx tesseract-ocr-dev
go build -o dist/docconv-${VERSION} -tags ocr github.com/skpr/docconv/docd
Loading