Categorygithub.com/intelligencex/fileconversion
modulepackage
0.0.0-20191030112448-1b64e2d06ace
Repository: https://github.com/intelligencex/fileconversion.git
Documentation: pkg.go.dev

# README

fileconversion

This is a Go library to convert various file formats into plaintext and provide related useful functions.

This library is used for https://intelx.io and was successfully tested over 184 million individual files. It is partly written from scratch, partly forked from open source and partly a rewrite of existing code. Many existing libraries lack stability and functionality and this libraries solves that.

We welcome any contributions - please open issues for any feature requests, bugs, and other related issues.

It supports following file formats for plaintext conversion:

  • Word: DOC, DOCX, RTF, ODT
  • Excel: XLS, XLSX, ODS
  • PowerPoint: PPTX
  • PDF
  • Ebook: EPUB, MOBI
  • Website: HTML

Functions for compressed and container files:

  • Decompress files: GZ, BZ, BZ2, XZ
  • Extract files from containers: ZIP, RAR, 7Z, TAR

Picture related functions:

  • Check if pictures are excessively large
  • Compress (and convert) pictures to JPEG: GIF, JPEG, PNG, BMP, TIFF
  • Resize and compress pictures
  • Extract pictures from PDF files

To download this library:

go get -u github.com/IntelligenceX/fileconversion

And then use it like:

package main

import (
	"bytes"
	"fmt"
	"os"

	"github.com/IntelligenceX/fileconversion"
)

const sizeLimit = 2 * 1024 * 1024 // 2 MB

func main() {
	// extract text from an XLSX file
	file, err := os.Open("Test.xlsx")
	if err != nil {
		fmt.Printf("Error opening file: %s\n", err)
		return
	}

	defer file.Close()
	stat, _ := file.Stat()

	buffer := bytes.NewBuffer(make([]byte, 0, sizeLimit))

	fileconversion.XLSX2Text(file, stat.Size(), buffer, sizeLimit, -1)

	fmt.Println(buffer.String())
}

Functions

The package exports the following functions:

XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)
DOCX2Text(file io.ReaderAt, size int64) (string, error)
EPUB2Text(file io.ReaderAt, size int64, limit int64) (string, error)
HTML2Text(reader io.Reader) (pageText string, err error)
HTML2TextAndLinks(reader io.Reader, baseURL string) (pageText string, links []string, err error)
Mobi2Text(file io.ReadSeeker) (string, error)
ODS2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
ODT2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64) (written int64, err error)
PDFListContentStreams(f io.ReadSeeker, w io.Writer, size int64) (written int64, err error)
PPTX2Text(file io.ReaderAt, size int64) (string, error)
RTF2Text(inputRtf string) string
XLS2Text(reader io.ReadSeeker, writer io.Writer, size int64) (written int64, err error)
XLSX2Text(file io.ReaderAt, size int64, writer io.Writer, limit int64, rowLimit int) (written int64, err error)

Picture functions:

IsExcessiveLargePicture(Picture []byte) (excessive bool, err error)
CompressJPEG(Picture []byte, quality int) (compressed []byte)
ResizeCompressPicture(Picture []byte, Quality int, MaxWidth, MaxHeight uint) 
PDFExtractImages(input io.ReadSeeker) (images []ImageResult, err error)

Compression and container file functions:

DecompressFile(data []byte) (decompressed []byte, valid bool)
ContainerExtractFiles(data []byte, callback func(name string, size int64, date time.Time, data []byte))

Dependencies

This library uses other go packages. Run the following command to download them:

go get -u github.com/nwaples/rardecode
go get -u github.com/saracen/go7z
go get -u github.com/ulikunitz/xz
go get -u github.com/mattetti/filebuffer
go get -u github.com/richardlehane/mscfb
go get -u github.com/taylorskalyo/goreader/epub
go get -u github.com/PuerkitoBio/goquery
go get -u github.com/ssor/bom
go get -u github.com/levigross/exp-html
go get -u github.com/neofight/mobi/convert
go get -u github.com/neofight/mobi/headers
go get -u github.com/unidoc/unipdf
go get -u github.com/nfnt/resize
go get -u github.com/tealeg/xlsx
go get -u gopkg.in/xmlpath.v2

Tests

There are no functional tests. The only test functions are used manually for debugging.

Forks

Other packages were tested and either found insufficient, or unstable. Many of the below listed packages were found to be unstable, cause crashes, as well as exhaust memory due to bad programming, bad input sanitizing and bad memory management.

License

This is free and unencumbered software released into the public domain.

Note that this package includes, or consists partly of forks or rewrite of existing open source code. Use at your own risk. Intelligence X does not provide any warranty for this library or any parts of it.

# Packages

No description provided by the author
No description provided by the author

# Functions

CompressJPEG compresses a JPEG picture according to the input Warning: If the image claims to be large (in terms of width & height), this may use a lot of memory.
ContainerExtractFiles extracts files from supported containers: ZIP, RAR, 7Z, TAR.
DecompressFile decompresses data.
DOC2Text converts a standard io.Reader from a Microsoft Word .doc binary file and returns a reader (actually a bytes.Buffer) which will output the plain text found in the .doc file.
DOCX2Text extracts text of a Word document Size is the full size of the input file.
EPUB2Text converts an EPUB ebook to text.
HTML2Text extracts the text from the HTML.
HTML2TextAndLinks extracts the text from the HTML and all links from <a> and <img> tags of a HTML If the base URL is provided, relative links will be converted to absolute ones.
InitPDFLicense initializes the PDF license.
IsExcessiveLargePicture checks if the picture has reasonable width and height, preventing potential DoS when decoding it This protects against this problem: If the image claims to be large (in terms of width & height), jpeg.Decode may use a lot of memory, see https://github.com/golang/go/issues/10532.
IsFileDOC checks if the data indicates a DOC file DOC has multiple signature according to https://filesignatures.net/index.php?search=doc&mode=EXT, D0 CF 11 E0 A1 B1 1A E1.
IsFileDOCX checks if the data indicates a DOCX file DOCX has a signature of 50 4B 03 04.
IsFileMOBI checks if the data indicates a MOBI file.
IsFilePPT checks if the data indicates a PPT file PPT has multiple signature according to https://www.filesignatures.net/index.php?page=search&search=PPT&mode=EXT, D0 CF 11 E0 A1 B1 1A E1.
IsFilePPTX checks if the data indicates a PPTX file PPTX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.
IsFileRTF checks if the data indicates a RTF file RTF has a signature of 7B 5C 72 74 66 31, or in string "{\rtf1".
IsFileXLS checks if the data indicates a XLS file XLS has a signature of D0 CF 11 E0 A1 B1 1A E1.
IsFileXLSX checks if the data indicates a XLSX file XLSX has a signature of 50 4B 03 04 Warning: This collides with ZIP, DOCX and other zip-based files.
IsFileZIP checks if the data indicates a ZIP file.
Mobi2Text converts a MOBI ebook to text.
ODS2Cells converts an ODS file to individual cells Size is the full size of the input file.
ODS2Text extracts text of an OpenDocument Spreadsheet Size is the full size of the input file.
ODT2Text extracts text of an OpenDocument Text file Size is the full size of the input file.
PDFExtractImages extracts all images from a PDF file.
PDFGetCreationDate tries to get the creation date.
PDFListContentStreams writes all text streams in a PDF to the writer It returns the number of characters attempted written (excluding "Page N" and new-lines) and an error, if any.
PPTX2Text extracts text of a PowerPoint document Size is the full size of the input file.
ResizeCompressPicture scales a picture down and compresses it.
RTF2Text removes rtf characters from string and returns the new string.
WordParse parses a word file.
XLS2Cells converts an XLS file to individual cells.
XLS2Text extracts text from an Excel sheet.
XLSX2Cells converts an XLSX file to individual cells Size is the full size of the input file.
XLSX2Text extracts text of an Excel sheet Size is the full size of the input file.

# Structs

ImageResult contains an extracted image.
PPTXDocument is a PPTX document loaded into memory.
PPTXSlide is a single PPTX slide.
WordDocument is a full word doc.
WordParagraph is a single paragraph.
WordRow ...
WordStyle ...

# Type aliases

SlideNumberSorter is used for sorting.