文件类型	扩展名	MIME类型
Word文档	.doc	application/msword
Word文档	.docx	application/vnd.openxmlformats-officedocument.wordprocessingml.document
OpenDocument文档	.odt	application/vnd.oasis.opendocument.text
Pages文档	.pages	application/vnd.apple.pages
PDF文档	.pdf	application/pdf
PowerPoint文档	.pptx	application/vnd.openxmlformats-officedocument.presentationml.presentation
RTF文档	.rtf	application/rtf
XML文件	.xml	text/xml
HTML文件	.xhtml/.html/.htm	text/html
JPEG图片	.jpg/.jpeg/.jpe/.jfif/.jfif-tbnl	image/jpeg
PNG图片	.png	image/png
TIFF图片	.tif	image/tif
TIFF图片	.tiff	image/tiff
文本文件	.txt	text/plain
Excel表格	.xls	application/vnd.ms-excel
Excel表格	.xlsx	application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
压缩文件	.zip/.7z/.rar	application/zip
其他文件	*	application/octet-stream

docconv

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go install github.com/dev4mobile/mupdf/v2/docd@latest

See go help install for details on the installation location of the installed docd executable. Make sure that the full path to the executable is in your PATH environment variable.

Dependencies

tidy
wv
popplerutils
unrtf
https://github.com/JalfResi/justext

Debian-based Linux

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

macOS

$ brew install poppler-qt5 wv unrtf tidy-html5
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr github.com/dev4mobile/mupdf/v2/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

a service on port 8888 (by default)

Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.
a service exposed from within a Docker container

This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

Optionally you can build it yourself:
```
$ cd docd
$ docker build -t docd .
```
via the command line.

Documents can be sent as an argument, e.g.
```
$ docd -input document.pdf
```

Optional flags

addr - the bind address for the HTTP server, default is ":8888"
readability-length-low - sets the readability length low if the ?readability=1 parameter is set
readability-length-high - sets the readability length high if the ?readability=1 parameter is set
readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This runs on port 8000
$ docd -addr :8000

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"

	"github.com/dev4mobile/mupdf/v2"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		// TODO: handle
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"

	"github.com/dev4mobile/mupdf/v2/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		// TODO: handle
	}
	fmt.Println(res)
}

Alternatively, via a curl:

$ curl -s -F [email protected] http://localhost:8888/convert