Categorygithub.com/fljdin/fragment
repositorypackage
0.0.0-20240411145556-0120a268c021
Repository: https://github.com/fljdin/fragment.git
Documentation: pkg.go.dev

# Packages

No description provided by the author

# README

Fragment

tests go report

Fragment is a Go package designed to split text into fragments. It provides a convenient way to extract meaningful units of text from a larger body of text, while also supporting special rules to ignore delimiters or other rules within specific contexts.

Installation

Install package using the following command:

go get github.com/fljdin/fragment

Predefined languages

The fragment/languages package offers several predefined languages to help you split text into fragments based on language-specific rules.

  • The PgSQL defines delimiters for SQL statements and special PostgreSQL commands (like \g or \gexec). It also specifies various rules to handle comments, single-quoted and double-quoted string literals, transactions BEGIN-END blocks, and dollar-quoted strings.
package main

import (
    "fmt"
    "github.com/fljdin/fragment/languages"
)

func main() {
    // Define a PostgreSQL query containing multiple 
    // SQL statements
    queries := `
        SELECT * FROM employees;
        -- This is a comment
        INSERT INTO products VALUES (1, 'Product 1');
        BEGIN;
        UPDATE orders SET status = 'Shipped' 
         WHERE order_id = 100;
        COMMIT;
        $custom_tag$
        This is a custom tag.
        $custom_tag$
    `

    // Split an SQL file script into fragments using 
    // the PgSQL language configuration
    fragments := languages.PgSQL.Split(queries)

    // Print the extracted fragments
    for i, fragment := range fragments {
        fmt.Printf("Fragment %d:\n%s\n\n", i+1, fragment)
    }
}
  • Shell defines semicolon and newline delimiters, allowing for the accurate splitting of Shell code into meaningful fragments. The configuration also includes rules for handling single-quoted and double-quoted string literals, line continuation with backslashes, Shell-style comments, and Here-Documents (Heredocs).

  • XML defines a newline delimiter to split documents. The primary rule in this configuration identifies XML tags, both opening and closing tags, in a case-insensitive manner.

These predefined languages make it easy to split text into fragments based on language-specific rules, which can be useful in various text processing applications.

Usage

import "github.com/fljdin/fragment"

Every text input follows predefined language's delimiters and rules.

  • The Language struct defines the delimiters and rules required by text splitting.
  • A new fragment is built as soon as a Delimiter is detected when reading the text.
  • Each Rule consists of a start and stop condition that defines a context in which delimiters and other rules should be ignored.

Delimiter

The Delimiter struct determines whether the fragment must be built when it encounters a simple string or when it matches a regular expression. This distinction is mainly made for performance reasons.

// define a newline delimiter
newline := fragment.Delimiter{
    String: "\n",
}

// define a psql's meta-command delimiter
command := fragment.Delimiter{
    Regex: `\\g.*\n`,
}

StringRule

The StringRule struct defines simple string-based rules to detect start and stop of fragments. Here's a concise example of newline separated fragments with an exception rule when an escape character preceeds a newline (inpired from shell syntax).

// define a basic escape newline rule
escape := fragment.StringRule{
    Start: "\\",
    Stop:  "\n",
}

// define a new language to split lines from a text
text := fragment.Language{
    Delimiters: []fragment.Delimiter{newline},
    Rules:      []fragment.Rule{escape},
}

In some cases, delimiters may not be ignored by a rule. The following example shows how to define a comment that ends with a newline delimiter. The field Stop is replaced by StopAtDelim.

// define a one-line comment rule, inspired from shell 
comment := fragment.StringRule{
    Start:       "#",
    StopAtDelim: true,
}

// define a new language to split commands from a script
shell := fragment.Language{
    Delimiters: []fragment.Delimiter{newline},
    Rules:      []fragment.Rule{comment}
}

RegexRule

The RegexRule struct allows use of regular expressions to define rules for detecting the start and stop of fragments. Capture groups are supported in the Stop regex to dynamically replace placeholders by positional values found during Start regex's call.

The following example use a regular expression to find the start of a dollar-quoted string, in which we should ignore delimiters. To handle capture group, the RegexRule must be passed by pointer.

// define a postgresql dollar-quoted expression rule
dollar := &fragment.RegexRule{
    Start: `\$([a-zA-Z0-9_]*)\$`,
    Stop:  `\$\1\$`,
}

// define a new language to split queries from a text
pgsql := fragment.Language{
    Delimiters: []fragment.Delimiter{{String: ";"}},
    Rules:      []fragment.Rule{dollar}
}

To perform case-insensitive searches, use the (?i) flag within your regular expression pattern. This flag indicates that the pattern matching should be done without considering letter case. Here's an example of using the case-insensitive flag to create a rule for XML markup tags:

// define a markup rule with capture group and placeholder
markup := &fragment.RegexRule{
    Start: `(?i)<(\w+)>`,
    Stop:  `(?i)</\1>`,
}

// define a new language to split XML documents from a file
xml := fragment.Language{
    Delimiters: []fragment.Delimiter{newline},
    Rules:      []fragment.Rule{markup},
}

Splitting Text

You can use the Split method of the Language struct to split text into fragments based on the defined rules and delimiters. All leading and trailing white space for each fragment are removed.

// split the source text into fragments
fragments := text.Split(`
    Line 1
    Line 2 \
      on multiple lines
    Line 3
`)

for _, fragment := range fragments {
    fmt.Print("---- ")
    fmt.Println(fragment)
}

Will print:

---- Line 1
---- Line 2 \
      on multiple lines
---- Line 3

When dealing with an input stream, prefer use Read method and consume a string channel like this:

// open a string channel
ch := make(chan string)
go text.Read(ch, os.Stdin)

for fragment := range ch {
    fmt.Print("---- ")
    fmt.Println(fragment)
}

Testing

Unit tests are provided under languages package.

go test ./languages

Contributing

Contributions are welcome! Feel free to fork the repository, make changes, and submit pull requests. If you find any bugs or have suggestions for improvements, please create an issue.

License

This project is licensed under the MIT License.