# README
Extract information from PDF documents
Dossier is a library for extracting textual information from PDF documents. It is written using the Go programming language.
Currently PDF is the only supported format (using MuPDF). Other formats can be implemented using custom parsers or by amending the library.
Sketches provide a declarative approach to locating information as an alternative to imperative/procedural access.
Sketches
Protocol buffers are used to define a sketch. The sketch protobuf definition documents available configuration options. Usually textproto will be the format used for writing sketches.
A web-based viewer is included in the command line utility. Screenshot of the viewer with an example sketch for invoices:
Invocation:
$ dossiercli web ./invoice.pdf ./sketch.textproto
2023/12/31 00:00:00 HTTP server listening on http://[::1]:8080
Installation
go get github.com/hansmi/dossier
Command line utility:
go install github.com/hansmi/dossier/cmd/dossiercli@latest
# Functions
AsPageElementVisitor returns a visitor function wrapper filtering for elements of type T.
NewDocument constructs a new document.
Configure a custom factory function to create parser instances.
Use a fixed parser for all documents without considering the content type.
# Variables
No description provided by the author
No description provided by the author
# Structs
No description provided by the author
No description provided by the author
No description provided by the author
# Interfaces
No description provided by the author
# Type aliases
No description provided by the author
No description provided by the author
No description provided by the author