Categorygithub.com/danlock/gogosseract
repositorypackage
0.0.11-0ad3421
Repository: https://github.com/danlock/gogosseract.git
Documentation: pkg.go.dev

# README

gogosseract

Coverage Go Report Card Go Reference

A reimplementation of https://github.com/otiai10/gosseract without CGo, running Tesseract compiled to WASM with Emscripten via Wazero

The WASM is generated from my personal fork of robertknight's well written tesseract-wasm project.

Note that Tesseract is only compiled with support for the Tesseract LSTM neural network OCR engine, and not for "classic" Tesseract.

Training Data

Tesseract requires training data in order to accurately recognize text. The official source is here. Strategies for dealing with this include downloading it at runtime, or embedding the file within your Go binary using go:embed at compile time.

Accuracy

Tesseract can work better if the input images are preprocessed. See this page for tips.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

Examples

Using Tesseract to parse text from an image.

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // While Tesseract's output is very useful for debugging, you have the option to silence or redirect it
    cfg.Stderr = io.Discard
    cfg.Stdout = io.Discard
    // Compile the Tesseract WASM and run it, loading in the TrainingData and setting any Config Variables provided
    tess, err := gogosseract.New(ctx, cfg)
    handleErr(err)
    // Load the image, without parsing it.
    err = tess.LoadImage(ctx, imageFile, gogosseract.LoadImageOptions{})
    handleErr(err)

    text, err = tess.GetText(ctx, func(progress int32) { log.Printf("Tesseract parsing is %d%% complete.", progress) })
    handleErr(err)
    // Closing the Tesseract instance will clean up everything used by Tesseract and it's WASM module
    handleErr(tess.Close(ctx))

Using a Pool of Tesseract workers for thread safe concurrent image parsing.

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // Create 10 Tesseract instances that can process image requests concurrently.
	pool, err := gogosseract.NewPool(ctx, 10, gogosseract.PoolConfig{Config: cfg})
    handleErr(err)
    defer Pool.Close()

    // ParseImage loads the image and waits until the Tesseract worker sends back your result.
    hocr, err := pool.ParseImage(ctx, img, gogosseract.ParseImageOptions{
        IsHOCR: true,
    })
    handleErr(err)