Categorygithub.com/dev4mobile/mupdf/v2
modulepackage
2.0.0-20250217040219-9c7962d519d4
Repository: https://github.com/dev4mobile/mupdf.git
Documentation: pkg.go.dev

# README

文件扩展名与MIME类型对应关系

文件类型扩展名MIME类型
Word文档.docapplication/msword
Word文档.docxapplication/vnd.openxmlformats-officedocument.wordprocessingml.document
OpenDocument文档.odtapplication/vnd.oasis.opendocument.text
Pages文档.pagesapplication/vnd.apple.pages
PDF文档.pdfapplication/pdf
PowerPoint文档.pptxapplication/vnd.openxmlformats-officedocument.presentationml.presentation
RTF文档.rtfapplication/rtf
XML文件.xmltext/xml
HTML文件.xhtml/.html/.htmtext/html
JPEG图片.jpg/.jpeg/.jpe/.jfif/.jfif-tbnlimage/jpeg
PNG图片.pngimage/png
TIFF图片.tifimage/tif
TIFF图片.tiffimage/tiff
文本文件.txttext/plain
Excel表格.xlsapplication/vnd.ms-excel
Excel表格.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet
压缩文件.zip/.7z/.rarapplication/zip
其他文件*application/octet-stream

docconv

Go reference Build status Report card Sourcegraph

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.

Installation

If you haven't setup Go before, you first need to install Go.

To fetch and build the code:

$ go install github.com/dev4mobile/mupdf/v2/docd@latest

See go help install for details on the installation location of the installed docd executable. Make sure that the full path to the executable is in your PATH environment variable.

Dependencies

Debian-based Linux

$ sudo apt-get install poppler-utils wv unrtf tidy
$ go get github.com/JalfResi/justext

macOS

$ brew install poppler-qt5 wv unrtf tidy-html5
$ go get github.com/JalfResi/justext

Optional dependencies

To add image support to the docconv library you first need to install and build gosseract.

Now you can add -tags ocr to any go command when building/fetching/testing docconv to include support for processing images:

$ go get -tags ocr github.com/dev4mobile/mupdf/v2/...

This may complain on macOS, which you can fix by installing tesseract via brew:

$ brew install tesseract

docd tool

The docd tool runs as either:

  1. a service on port 8888 (by default)

    Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.

  2. a service exposed from within a Docker container

    This also runs as a service, but from within a Docker container. Official images are published at https://hub.docker.com/r/sajari/docd.

    Optionally you can build it yourself:

    $ cd docd
    $ docker build -t docd .
    
  3. via the command line.

    Documents can be sent as an argument, e.g.

    $ docd -input document.pdf
    

Optional flags

  • addr - the bind address for the HTTP server, default is ":8888"
  • readability-length-low - sets the readability length low if the ?readability=1 parameter is set
  • readability-length-high - sets the readability length high if the ?readability=1 parameter is set
  • readability-stopwords-low - sets the readability stopwords low if the ?readability=1 parameter is set
  • readability-stopwords-high - sets the readability stopwords high if the ?readability=1 parameter is set
  • readability-max-link-density - sets the readability max link density if the ?readability=1 parameter is set
  • readability-max-heading-distance - sets the readability max heading distance if the ?readability=1 parameter is set
  • readability-use-classes - comma separated list of readability classes to use if the ?readability=1 parameter is set

How to start the service

$ # This runs on port 8000
$ docd -addr :8000

Example usage (code)

Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.

This should be enough to get you started though.

Use case 1: run locally

Note: this assumes you have the dependencies installed.

package main

import (
	"fmt"

	"github.com/dev4mobile/mupdf/v2"
)

func main() {
	res, err := docconv.ConvertPath("your-file.pdf")
	if err != nil {
		// TODO: handle
	}
	fmt.Println(res)
}

Use case 2: request over the network

package main

import (
	"fmt"

	"github.com/dev4mobile/mupdf/v2/client"
)

func main() {
	// Create a new client, using the default endpoint (localhost:8888)
	c := client.New()

	res, err := client.ConvertPath(c, "your-file.pdf")
	if err != nil {
		// TODO: handle
	}
	fmt.Println(res)
}

Alternatively, via a curl:

$ curl -s -F [email protected] http://localhost:8888/convert

# Packages

Package client defines types and functions for interacting with docconv HTTP servers.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
Package snappy implements the snappy block-based compression format.
No description provided by the author
No description provided by the author

# Functions

Convert a file to plain text.
ConvertDoc converts an MS Word .doc to text.
ConvertDocx converts an MS Word docx file to text.
ConvertHTML converts HTML into text.
ConvertImage converts images to text.
ConvertODT converts a ODT file to text.
ConvertPages converts a Pages file to text.
ConvertPath converts a local path to text.
ConvertPathReadability converts a local path to text, with the given readability option.
No description provided by the author
ConvertPDFText 使用 go-fitz 库对 PDF 进行解析,提取所有页面的文本内容,并返回 PDF 内容和元数据。.
ConvertPptx converts an MS PowerPoint pptx file to text.
ConvertRTF converts RTF files to text.
ConvertURL fetches the HTML page at the URL given in the io.Reader.
ConvertXls converts an Excel xls file to text.
ConvertXlsx converts an MS Word xlsx file to text.
ConvertXML converts an XML file to text.
ConvertZip converts an archive file to text.
DocxXMLToText converts Docx XML into plain text.
HTMLReadability extracts the readable text in an HTML document.
HTMLToText converts HTML to plain text.
MimeTypeByExtension returns a mimetype for the given extension, or application/octet-stream if none can be determined.
NewLocalFile ensures that there is a file which contains the data provided by r.
SetImageLanguages sets the languages parameter passed to gosseract.
Tidy attempts to tidy up XML.
XlsxXMLToText converts Docx XML into plain text.
XMLToMap converts XML to a nested string map.
XMLToText converts XML to plain text given how to treat elements.

# Variables

HTMLReadabilityOptionsValues are the global settings used for HTMLReadability.

# Structs

BodyResult 表示 PDF 内容.
HTMLReadabilityOptions is a type which defines parameters that are passed to the justext package.
LocalFile is a type which wraps an *os.File.
No description provided by the author
Response payload sent back to the requestor.