# README
go-extract
Secure file decompression and extraction of following types:
- 7-Zip
- Brotli
- Bzip2
- GZip
- LZ4
- Snappy
- Tar
- Xz
- Zip
- Zlib
- Zstandard
Code Example
Add to go.mod
:
GOPRIVATE=github.com/hashicorp/go-extract go get github.com/hashicorp/go-extract
Usage in code:
import (
...
"github.com/hashicorp/go-extract"
"github.com/hashicorp/go-extract/config"
"github.com/hashicorp/go-extract/telemetry"
...
)
...
// open archive
archive, _ := os.Open(...)
// prepare context with timeout
ctx, cancel := context.WithTimeout(context.Background(), (time.Second * time.Duration(MaxExtractionTime)))
defer cancel()
// prepare logger
logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
// setup telemetry hook
telemetryToLog := func(ctx context.Context, td telemetry.Data) {
logger.Info("extraction finished", "telemetryData", td)
}
// prepare config (these are the default values)
config := config.NewConfig(
config.WithCacheInMemory(false), // cache to disk if input is a zip in a stream
config.WithContinueOnError(false), // fail on error
config.WithContinueOnUnsupportedFiles(false), // don't on unsupported files
config.WithCreateDestination(false), // do not try to create specified destination
config.WithCustomCreateDirMode(0750), // for not in-archive listed folders (respecting umask), default: drwxr-x---
config.WithCustomDecompressFileMode(0640), // for decompressed files (respecting umask), default: -rw-r-----
config.WithDenySymlinkExtraction(false), // allow symlink creation
config.WithExtractType("<ext>") // specify explicitly a file extension to determine extractor
config.WithFollowSymlinks(false), // do not follow symlinks during creation
config.WithLogger(logger), // adjust logger (default: io.Discard)
config.WithMaxExtractionSize(1 << (10 * 3)), // limit to 1 Gb (disable check: -1)
config.WithMaxFiles(1000), // only 1k files (including folders and symlinks) maximum (disable check: -1)
config.WithMaxInputSize(1 << (10 * 3)), // limit to 1 Gb (disable check: -1)
config.WithNoUntarAfterDecompression(false), // extract tar.gz combined
config.WithOverwrite(false), // don't replace existing files
config.WithPatterns("*.tf","modules/*.tf"), // normally, no patterns predefined
config.WithTelemetryHook(telemetryToLog), // adjust hook to receive telemetry from extraction
)
// extract archive
if err := extract.Unpack(ctx, archive, destinationPath, config); err != nil {
// handle error
}
...
[!TIP] If the library is used in a cgroup memory limited execution environment to extract Zip archives that are cached in memory (
config.WithCacheInMemory(true)
), make sure thatGOMEMLIMIT
is set in the execution environment to avoidOOM
error.Example:
$ export GOMEMLIMIT=1GiB
CLI Tool
You can use this library on the command line with the goextract
command.
Installation
GOPRIVATE=github.com/hashicorp/go-extract go install github.com/hashicorp/go-extract/cmd/goextract@latest
Manual Build and Installation
git clone [email protected]:hashicorp/go-extract.git
cd go-extract
make
make test
make install
Usage
$ goextract -h
Usage: goextract <archive> [<destination>] [flags]
A secure extraction utility
Arguments:
<archive> Path to archive. ("-" for STDIN)
[<destination>] Output directory/file.
Flags:
-h, --help Show context-sensitive help.
-C, --continue-on-error Continue extraction on error.
-S, --continue-on-unsupported-files Skip extraction of unsupported files.
-c, --create-destination Create destination directory if it does not exist.
--custom-create-dir-mode=750 File mode for created directories, which are not listed in the archive. (respecting umask)
--custom-decompress-file-mode=640 File mode for decompressed files. (respecting umask)
-D, --deny-symlinks Deny symlink extraction.
-F, --follow-symlinks [Dangerous!] Follow symlinks to directories during extraction.
--max-files=1000 Maximum files that are extracted before stop. (disable check: -1)
--max-extraction-size=1073741824 Maximum extraction size that allowed is (in bytes). (disable check: -1)
--max-extraction-time=60 Maximum time that an extraction should take (in seconds). (disable check: -1)
--max-input-size=1073741824 Maximum input size that allowed is (in bytes). (disable check: -1)
-N, --no-untar-after-decompression Disable combined extraction of tar.gz.
-O, --overwrite Overwrite if exist.
-P, --pattern=PATTERN,... Extracted objects need to match shell file name pattern.
-T, --telemetry Print telemetry data to log after extraction.
-t, --type="" Type of archive. (7z, br, bz2, gz, lz4, sz, tar, tgz, xz, zip, zst, zz)
-v, --verbose Verbose logging.
-V, --version Print release version information.
Telemetry data
It is possible to collect telemetry data ether by specifying a telemetry hook via the config option config.WithTelemetryHook(telemetryToLog)
or as a cli parameter -T, --telemetry
.
Here is an example collected telemetry data for the extraction of terraform-aws-iam-5.34.0.tar.gz
:
{
"LastExtractionError": "",
"ExtractedDirs": 51,
"ExtractionDuration": 48598584,
"ExtractionErrors": 0,
"ExtractedFiles": 241,
"ExtractionSize": 539085,
"ExtractedSymlinks": 0,
"ExtractedType": "tar+gzip",
"InputSize": 81477,
"PatternMismatches": 0,
"UnsupportedFiles": 0,
"LastUnsupportedFile": ""
}
Feature collection
- Filetypes
- zip (/jar)
- tar
- gzip
- tar.gz
- brotli
- bzip2
- flate
- xz
- snappy
- rar
- 7zip
- zstandard
- zlib
- lz4
- extraction size check
- max num of extracted files
- extraction time exhaustion
- input file size limitations
- context based cancelation
- option pattern for configuration
-
io.Reader
as source - symlink inside archive
- symlink to outside is detected
- symlink with absolute path is detected
- file with path traversal is detected
- file with absolute path is detected
- filetype detection based on magic bytes
- windows support
- tests for gzip
- function documentation
- check for windows
- Allow/deny symlinks in general
- Telemetry call back function
- Extraction filter with unix file name patterns
- Cache input on disk (only relevant if
<archive>
is a zip archive, which read from a stream) - Cache alternatively optional input in memory (similar to caching on disk, only relevant for zip archives that are consumed from a stream)
- Handle passwords
- recursive extraction
- virtual fs as target
References
# Functions
GetUnpackFunction identifies the correct extractor based on magic bytes.
GetUnpackFunctionByFileName identifies the correct extractor based on file extension.
IsKnownArchiveFileExtension checks if the given file extension is a known archive file extension.
Unpack reads data from src, identifies if its a known archive type.
ValidTypes returns a string with all available types.
# Constants
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
Available file types.
# Interfaces
Extractor is an interface and defines all functions that needs to be implemented by an extraction engine.