Categorygithub.com/papercutsoftware/pdfsearch
modulepackage
0.0.0
Repository: https://github.com/papercutsoftware/pdfsearch.git
Documentation: pkg.go.dev

# README

Pure Go Full Text Search of PDF Files

This library implements full text search for PDF files.

The are some command lines programs that demonstrate the library's functionality.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch
cd pdfsearch/examples
go build pdf_search_demo.go
go build pdf_search_verify.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

  • create indexes over PDF files,
  • search those indexes using full-text search, and
  • mark up PDF files with the locations of the search matches on pages.

It has 3 types of index

  • On-disk. These can be as large as your disk but are slower.
  • In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
  • In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

# Packages

No description provided by the author

# Functions

AddImageToPdf adds an image to a specific page of a PDF.
ExposeErrors turns off recovery from panics in called libraries.
from2Bufs extracts a PdfIndex from the bytes in `data`.
IndexPdfFiles returns an index for the PDF files in `pathList`.
IndexPdfMem returns a byte array that contains an index for PDF io.ReaderSeeker's in `rsList`.
IndexPdfReaders returns a PdfIndex over the PDF contents read by the io.ReaderSeeker's in `rsList`.
MarkupPdfResults adds rectangles to the text positions of all matches on their PDF pages, combines these pages together and writes the resulting PDF to `outPath`.
ReuseIndex returns an existing on-disk PdfIndex with directory `persistDir`.
SearchMem does a full-text search over the PdfIndex in `data` for `term` and returns up to `maxResults` matches.

# Constants

DefaultMaxResults is the default maximum number of results returned.
DefaultPersistRoot is the default root for on-disk indexes.
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# Structs

ImageLocation specifies the location of a square image on a page.
PdfIndex is an opaque struct that describes an index over some PDF files.

# Type aliases

PagePosition is an enumerated position on a page.
PdfMatchSet makes doclib.PdfMatchSet public.