pkg.gl

Categorygithub.com/instill-ai/operatorpkgtextv0

package

0.10.0-beta

Repository: https://github.com/instill-ai/operator.git

Documentation: pkg.go.dev

# README

title: "Text" lang: "en-US" draft: false description: "Learn about how to set up a VDP Text operator https://github.com/instill-ai/vdp"

The Text component is an operator that allows users to extract and manipulate text from different sources. It can carry out the following tasks:

Convert To Text
Split By Token

Release Stage

Alpha

Configuration

The component configuration is defined and maintained here.

Supported Tasks

Convert To Text

Convert document to text.

Input	ID	Type	Description
Task ID (required)	`task`	string	`TASK_CONVERT_TO_TEXT`
Document (required)	`doc`	string	Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text

Output	ID	Type	Description
Body	`body`	string	Plain text converted from the document
Meta	`meta`	object	Metadata extracted from the document
MSecs	`msecs`	number	Time taken to convert the document
Error	`error`	string	Error message if any during the conversion process

Split By Token

Split text by token.

Input	ID	Type	Description
Task ID (required)	`task`	string	`TASK_SPLIT_BY_TOKEN`
Text (required)	`text`	string	Text to be split
Model (required)	`model`	string	ID of the model to use for tokenization
Chunk Token Size	`chunk_token_size`	integer	Number of tokens per text chunk

Output	ID	Type	Description
Token Count	`token_count`	integer	Total count of tokens in the input text
Text Chunks	`text_chunks`	array[string]	Text chunks after splitting
Number of Text Chunks	`chunk_num`	integer	Total number of output text chunks