Categorygithub.com/shenwei356/seqkit/v2

# Packages

No description provided by the author
No description provided by the author

# README

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Subcommands of SeqKit2

Features

  • Easy to install (download)
    • Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
    • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
    • conda install -c bioconda seqkit
  • Easy to use
    • Ultrafast (see technical-details and benchmark)
    • Seamlessly parsing both FASTA and FASTQ formats
    • Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
    • Reproducible results (configurable rand seed in sample and shuffle)
    • Supporting custom sequence ID via regular expression
    • Supporting Bash/Zsh autocompletion
  • Versatile commands (usages and examples)

Installation

Method 1: Download binaries

Go to Download Page, where you can find download links to various platforms.

Method 2: Install via Pixi

pixi global install -c bioconda seqkit

Method 3: Install via conda

conda install -c bioconda seqkit

Method 4: Install via homebrew

brew install seqkit

Subcommands

CategoryCommandFunctionInputStrand-sensitivityMulti-threads
Basic operationseqTransform sequences: extract ID/seq, filter by length/quality, remove gaps…FASTA/Q
statsSimple statistics: #seqs, min/max_len, N50, Q20%, Q30%…FASTA/Q
subseqGet subsequences by region/gtf/bed, including flanking sequencesFASTA/Q+ or/and -
slidingExtract subsequences in sliding windowsFASTA/Q+ only
faidxCreate the FASTA index file and extract subsequences (with more features than samtools faidx)FASTA+ or/and -
translatetranslate DNA/RNA to protein sequenceFASTA/Q+ or/and -
watch Monitoring and online histograms of sequence featuresFASTA/Q
scat Real time concatenation and streaming of fastx filesFASTA/Q
Format conversionfq2faConvert FASTQ to FASTA formatFASTQ
fx2tabConvert FASTA/Q to tabular formatFASTA/Q
fa2fqRetrieve corresponding FASTQ records by a FASTA fileFASTA/Q+ only
tab2fxConvert tabular format to FASTA/Q formatTSV
convertConvert FASTQ quality encoding between Sanger, Solexa and IlluminaFASTA/Q
SearchinggrepSearch sequences by ID/name/sequence/sequence motifs, mismatch allowedFASTA/Q+ and -partly, -m
locateLocate subsequences/motifs, mismatch allowedFASTA/Q+ and -partly, -m
ampliconExtract amplicon (or specific region around it), mismatch allowedFASTA/Q+ and -partly, -m
fishLook for short sequences in larger sequencesFASTA/Q+ and -
Set operationsampleSample sequences by number or proportionFASTA/Q
rmdupRemove duplicated sequences by ID/name/sequenceFASTA/Q+ and -
commonFind common sequences of multiple files by id/name/sequenceFASTA/Q+ and -
duplicateDuplicate sequences N timesFASTA/Q
splitSplit sequences into files by id/seq region/size/parts (mainly for FASTA)FASTA preffered
split2Split sequences into files by size/parts (FASTA, PE/SE FASTQ)FASTA/Q
headPrint first N FASTA/Q recordsFASTA/Q
head-genomePrint sequences of the first genome with common prefixes in nameFASTA/Q
rangePrint FASTA/Q records in a range (start:end)FASTA/Q
pairPatch up paired-end reads from two fastq filesFASTA/Q
EditreplaceReplace name/sequence by regular expressionFASTA/Q+ only
renameRename duplicated IDsFASTA/Q
concatConcatenate sequences with same ID from multiple filesFASTA/Q+ only
restartReset start position for circular genomeFASTA/Q+ only
mutateEdit sequence (point mutation, insertion, deletion)FASTA/Q+ only
sanaSanitize broken single line FASTQ filesFASTQ
OrderingsortSort sequences by id/name/sequence/lengthFASTA preffered
shuffleShuffle sequencesFASTA preffered
BAM processingbamMonitoring and online histograms of BAM record featuresBAM
MiscellaneoussumCompute message digest for all sequences in FASTA/Q filesFASTA/Q
merge-slidesMerge sliding windows generated from seqkit slidingTSV

Notes:

  • Strand-sensitivity:
    • + only: only processing on the positive/forward strand.
    • + and -: searching on both strands.
    • + or/and -: depends on users' flags/options/arguments.
  • Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.

Citation

Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.

Contributors

Acknowledgements

We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

Contact

Create an issue to report bugs, propose new functions or ask for help.

License

MIT License

Starchart

Stargazers over time