# README

Pure Go Full Text Search of PDF Files

This library implements full text search for PDF files.

The public APIs are in index_search.go.

The are some command lines programs that demonstrate the library's functionality.

examples/pdf_search_demo.go demonstrates the main APIs.
examples/pdf_search_verify.go verifies the consistency of the in-memory and on-disk APIs.
examples/index.go builds an index over a set of PDFs.
examples/search.go searches the index build by examples/index.go.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch
cd pdfsearch/examples
go build pdf_search_demo.go
go build pdf_search_verify.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

create indexes over PDF files,
search those indexes using full-text search, and
mark up PDF files with the locations of the search matches on pages.

It has 3 types of index

On-disk. These can be as large as your disk but are slower.
In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

# Packages

examples

No description provided by the author

# Functions

AddImageToPdf

AddImageToPdf adds an image to a specific page of a PDF.

ExposeErrors

ExposeErrors turns off recovery from panics in called libraries.

FromBytes

from2Bufs extracts a PdfIndex from the bytes in `data`.

IndexPdfFiles

IndexPdfFiles returns an index for the PDF files in `pathList`.

IndexPdfMem

IndexPdfMem returns a byte array that contains an index for PDF io.ReaderSeeker's in `rsList`.

IndexPdfReaders

IndexPdfReaders returns a PdfIndex over the PDF contents read by the io.ReaderSeeker's in `rsList`.

MarkupPdfResults

MarkupPdfResults adds rectangles to the text positions of all matches on their PDF pages, combines these pages together and writes the resulting PDF to `outPath`.

ReuseIndex

ReuseIndex returns an existing on-disk PdfIndex with directory `persistDir`.

SearchMem

SearchMem does a full-text search over the PdfIndex in `data` for `term` and returns up to `maxResults` matches.