Categorygithub.com/endfirstcorp/pdflib
modulepackage
0.0.0-20170818034626-99d89874ebf5
Repository: https://github.com/endfirstcorp/pdflib.git
Documentation: pkg.go.dev

# README

Build Status stability-stable GoDoc Coverage Status Go Report Card License: MIT

pdflib: a golang pdf processor

Package pdflib is a simple PDF processing library written in Go It provides both an API and a command line tool. Supported are all versions up to PDF 1.7 (ISO-32000).

Motivation

Reducing the size of large PDF files for mass mailings by optimization to the bare minimum. This can be achieved by analyzing a PDF's cross reference table, removing redundant embedded resources like font files or images and by always writing back the file maxing out PDF compression.

I also wanted to have my own swiss army knife for PDFs written entirely in Go that allows me to trim, split and merge PDF content.

Features

  • Validate (validates PDF files up to version 7.0)
  • Read (builds xref table from PDF file)
  • Write (writes xref table to PDF file)
  • Optimize (gets rid of redundancies like duplicate fonts, images)
  • Split (split a multi page PDF file into single page PDF files)
  • Merge (a set of PDF files into one consolidated PDF file)
  • Trim (generate a custom version of a PDF file)
  • Extract Images (extract all embedded images of a PDF file into a given dir)
  • Extract Fonts (extract all embedded fonts of a PDF file into a given dir)
  • Extract Pages (extract specific pages into a given dir)
  • Extract Content (extract the PDF-Source into given dir)
  • Extract Text (extract the text of the PDF to an io.Reader)

Installation

go get github.com/hhrutter/pdflib/cmd/...

Usage

pdflib is a tool for PDF manipulation written in Go.

Usage:

pdflib command [arguments]

The commands are:

validate	validate PDF against PDF 32000-1:2008 (PDF 1.7)
optimize	optimize PDF by getting rid of redundant page resources
split		split multi-page PDF into several single-page PDFs
merge		concatenate 2 or more PDFs
extract		extract images, fonts, content, pages out of a PDF
trim		create trimmed version of a PDF
version		print pdflib version

Single-letter Unix-style supported for commands and flags.

Use "pdflib help [command]" for more information about a command.

pdflib validate [-verbose] [-mode strict|relaxed] inFile
pdflib optimize [-verbose] [-stats csvFile] inFile [outFile]
pdflib split [-verbose] inFile outDir
pdflib merge [-verbose] outFile inFile1 inFile2 ...
pdflib extract [-verbose] -mode image|font|content|page [-pages pageSelection] inFile outDir
pdflib trim [-verbose] -pages pageSelection inFile outFile

Please read the documentation

Status

Version: 0.0.1-beta

The extraction code for font files and images is experimental and serves as proof of concept only.

To Do

  • validation of the less used page entry "PresSteps"
  • validation of the less used root entries "SpiderInfo", "Permissions", "Legal", "Collection"

I am looking for test PDFs using one of these features. If you have one and you can share let me know. I am also accepting PRs but right now only regarding the defined items on the todo list.

Disclaimer

Usage of pdflib assumes you know about and respect all copyrights of any PDF content you may be processing. This applies to the PDF files as such, their content and in particular all embedded resources like font files or images.

License

MIT

# Packages

Package bufio extends the stdlib bufio with additional support for the \r eol marker.
No description provided by the author
Package extract provides methods for extracting fonts, images, pages and page content.
Package filter contains implementations for PDF filters.
Package merge provides code for merging two PDFContexts.
Package optimize contains code for optimizing the resources of a PDF file.
Package read provides methods for parsing PDF files into memory.
Package types provides the PDFContext, representing an ecosystem for PDF processing.
Package validate contains validation code for ISO 32000-1:2008.
Package write contains code that writes PDF data from memory to a file.

# Functions

ExtractContent dumps "PDF source" files from fileIn into dirOut for selected pages.
ExtractContentCommand creates a new ExtractContentCommand.
ExtractFonts dumps embedded fontfiles from fileIn into dirOut for selected pages.
ExtractFontsCommand creates a new ExtractFontsCommand.
ExtractImages dumps embedded image resources from fileIn into dirOut for selected pages.
ExtractImagesCommand creates a new ExtractImagesCommand.
ExtractPages generates single page PDF files from fileIn in dirOut for selected pages.
ExtractPagesCommand creates a new ExtractPagesCommand.
ExtractText converts PDF into text.
Merge some PDF files together and write the result to fileOut.
MergeCommand creates a new MergeCommand.
Optimize reads in fileIn, does validation, optimization and writes the result to fileOut.
OptimizeCommand creates a new OptimizeCommand.
ParsePageSelection ensures a correct page selection expression.
Process executes a pdflib command.
Read reads in a PDF file and builds an internal structure holding its cross reference table aka the PDFContext.
Split generates a sequence of single page PDF files in dirOut creating one file for every page of inFile.
SplitCommand creates a new SplitCommand.
Trim generates a trimmed version of fileIn containing all pages selected.
TrimCommand creates a new TrimCommand.
Validate validates a PDF file against ISO-32000-1:2008.
ValidateCommand creates a new ValidateCommand.
Verbose controls logging output.
Write generates a PDF file for a given PDFContext.

# Constants

The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.
The available commands for the CLI.

# Structs

Command represents an execution context.