Categorygithub.com/papercutsoftware/pdfsearch
repositorypackage
0.0.0
Repository: https://github.com/papercutsoftware/pdfsearch.git
Documentation: pkg.go.dev

# Packages

No description provided by the author

# README

Pure Go Full Text Search of PDF Files

This library implements full text search for PDF files.

The are some command lines programs that demonstrate the library's functionality.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch
cd pdfsearch/examples
go build pdf_search_demo.go
go build pdf_search_verify.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

  • create indexes over PDF files,
  • search those indexes using full-text search, and
  • mark up PDF files with the locations of the search matches on pages.

It has 3 types of index

  • On-disk. These can be as large as your disk but are slower.
  • In-memory with the index stored in a Go struct. Faster but limited to (virtual) memory size.
  • In-memory with the index serialized to a []byte. Useful for non-Go callers such as web apps.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.