Categorygithub.com/pbenner/gonetics
repositorypackage
1.0.0
Repository: https://github.com/pbenner/gonetics.git
Documentation: pkg.go.dev

# Packages

No description provided by the author
No description provided by the author

# README

Gonetics

Gonetics is a bioinformatics library for the Go programming language (golang). It provides native data structures for handling genetic data and methods for handling common file formats such as BAM, GTF, BED, BigWig, and Wig. The documentation is available here.

Tools

Executables are available here.

ToolDescription
bamCheckBincheck bin records of a bam file
bamGenomeprint the genome (sequence table) of a bam file
bamToBigWigconvert bam to bigWig (estimate fragment length if required)
bamViewprint contents of a bam file
bigWigEditChromNamesedit chromosome names of a bigWig file (i.e. replace chr1 by just 1)
bigWigExtractextract regions from a bigWig file and save them as table or bigWig file
bigWigExtractChromsextract a subset of the chromosomes from a bigWig file
bigWigGenomeprint the genome (sequence table) of a bigWig file
bigWigHistogramcompute a histogram of the values in a bigWig file
bigWigNilread bigWig and output it to a new file
bigWigMapapply a function to a set of bigWig files
bigWigPositivesimple peak finding (i.e. every region with a value above a threshold)
bigWigQuantileNormalizequantile normalize a bigWig file to a reference
bigWigQueryretrieve data from a bigWig file
bigWigQuerySequenceretrieve sequences from a bigWig file
bigWigStatisticscompute summary statistics of a bigWig file
chromHmmTablesToBigWigconvert chromHmm output (posteriors / binariezed bams) to bigWig
countKmerscount kmers in a set of DNA sequences
drawGenomicRegionsdraw random genomic regions
fastaExtractextract regions from a fasta file
fastaUnresolvedRegionsidentify regions that are not resolved (i.e. stretches of 'NNNN...')
gtfToBedconvert GTF files to Bed6 format
memeExtractextract PWM/PPM motifs from MEME/DREME xml files
observedOverExpectedCpGcompute CpG scores as defined by Gardiner-Garden and Frommer (1987)
pwmScanSequencesscan sequences for PWM hits
pwmScanRegionsscan regions for multiple PWMs
segmentationDifferentialextract differential regions from multiple chromatin segmentations

GRanges

Create a GRanges object with three ranges on the first chromosome:

  seqnames := []string{"chr1", "chr1", "chr1"}
  from     := []int{100000266, 100000271, 100000383}
  to       := []int{100000291, 100000296, 100000408}
  strand   := []byte{'+', '+', '-'}

  granges  := NewGRanges(seqnames, from, to, strand)
  fmt.Println(granges)
  seqnames                 ranges strand
1     chr1 [100000266, 100000291)      +
2     chr1 [100000271, 100000296)      +
3     chr1 [100000383, 100000408)      -

Add some meta data to the GRanges object:

  granges.AddMeta("data", []float64{1.0, 2.0, 3.0})
  seqnames                 ranges strand |          data
1     chr1 [100000266, 100000291)      + |      1.000000
2     chr1 [100000271, 100000296)      + |      2.000000
3     chr1 [100000383, 100000408)      - |      3.000000

Find overlaps of two GRanges objects:

  rSubjects := NewGRanges(
    []string{"chr4", "chr4", "chr4", "chr4"},
    []int{100, 200, 300, 400},
    []int{150, 250, 350, 450},
    []byte{})
  rQuery := NewGRanges(
    []string{"chr1", "chr4", "chr4", "chr4", "chr4", "chr4"},
    []int{100, 110, 190, 340, 390, 450},
    []int{150, 120, 220, 360, 400, 500},
    []byte{})

  queryHits, subjectHits := FindOverlaps(rQuery, rSubjects)
  queryHits: [1 2 3 4 5]
subjectHits: [0 1 2 3 3]

Genes

Download gene list from UCSC and export it to file:

  genes := ImportGenesFromUCSC("hg19", "ensGene")
  genes.WriteTable("hg19.knownGene.txt", true, false)
  fmt.Println(genes)
                 names seqnames          transcripts                  cds strand
     1 ENST00000456328     chr1 [   11868,    14409) [   14409,    14409)      +
     2 ENST00000515242     chr1 [   11871,    14412) [   14412,    14412)      +
     3 ENST00000518655     chr1 [   11873,    14409) [   14409,    14409)      +
     4 ENST00000450305     chr1 [   12009,    13670) [   13670,    13670)      +
     5 ENST00000423562     chr1 [   14362,    29370) [   29370,    29370)      -
                   ...      ...                  ...                  ...       
204936 ENST00000420810     chrY [28695571, 28695890) [28695890, 28695890)      +
204937 ENST00000456738     chrY [28732788, 28737748) [28737748, 28737748)      -
204938 ENST00000435945     chrY [28740997, 28780799) [28780799, 28780799)      -
204939 ENST00000435741     chrY [28772666, 28773306) [28773306, 28773306)      -
204940 ENST00000431853     chrY [59001390, 59001635) [59001635, 59001635)      +

Import expression data from a GTF file:

  genes.ImportGTF("genesExpr_test.gtf.gz", "transcript_id", "FPKM", false)
                 names seqnames          transcripts                  cds strand |          expr
     1 ENST00000456328     chr1 [   11868,    14409) [   14409,    14409)      + |      0.073685
     2 ENST00000515242     chr1 [   11871,    14412) [   14412,    14412)      + |      0.000000
     3 ENST00000518655     chr1 [   11873,    14409) [   14409,    14409)      + |      0.000000
     4 ENST00000450305     chr1 [   12009,    13670) [   13670,    13670)      + |      0.000000
     5 ENST00000423562     chr1 [   14362,    29370) [   29370,    29370)      - |     10.413931
                   ...      ...                  ...                  ...        |           ...
204936 ENST00000420810     chrY [28695571, 28695890) [28695890, 28695890)      + |      0.000000
204937 ENST00000456738     chrY [28732788, 28737748) [28737748, 28737748)      - |      0.000000
204938 ENST00000435945     chrY [28740997, 28780799) [28780799, 28780799)      - |      0.000000
204939 ENST00000435741     chrY [28772666, 28773306) [28773306, 28773306)      - |      0.000000
204940 ENST00000431853     chrY [59001390, 59001635) [59001635, 59001635)      + |      0.000000

Peaks

Import peaks from a MACS xls file:

  peaks := ImportXlsPeaks("peaks_test.xls")
   seqnames             ranges strand |  abs_summit     pileup -log10(pvalue) fold_enrichment -log10(qvalue)
 1       2L [   5757,    6001)      * |        5865  33.000000      19.809300        6.851880      17.722200
 2       2L [  47233,   47441)      * |       47354  36.000000      19.648200        6.263150      17.566200
 3       2L [  66379,   67591)      * |       66957 252.000000     350.151250       50.986050     346.525450
 4       2L [  72305,   72838)      * |       72525 170.000000     208.558240       34.460930     205.734390
 5       2L [  72999,   73218)      * |       73130  25.000000      12.711700        5.239670      10.700880
        ...                ...        |         ...        ...            ...             ...            ...
12       2R [3646319, 3646794)      * |     3646442  37.000000      23.176910        7.455710      21.063850
13       2R [3666770, 3668041)      * |     3667119 215.000000     279.229060       41.551060     276.108090
14       2R [3668231, 3668441)      * |     3668363  22.000000       9.943110        4.476950       7.976070
15       2R [3670063, 3670393)      * |     3670180  38.000000      19.474590        5.901360      17.393440
16       2R [3670470, 3670927)      * |     3670719 227.000000     305.243350       45.831180     301.974760

Track

Import ChIP-seq reads from bed files and create a track with the normalized signal:

  fmt.Fprintf(os.Stderr, "Parsing reads (treatment) ...\n")
  treatment1 := GRanges{}
  treatment1.ImportBed6("SRR094207.bed")
  treatment2 := GRanges{}
  treatment2.ImportBed6("SRR094208.bed")
  fmt.Fprintf(os.Stderr, "Parsing reads (control)   ...\n")
  control1   := GRanges{}
  control1.ImportBed6("SRR094215.bed")
  control2   := GRanges{}
  control2.ImportBed6("SRR094216.bed")

  genome  := Genome{}
  genome.Import("Data/hg19.genome")
  d       := 200 // d=200 (see *_peaks.xls)
  binsize := 100 // binsize of the track
  pcounts := 1   // pseudocounts
  track := NormalizedTrack("H3K4me3",
    []GRanges{treatment1, treatment2}, []GRanges{control1, control2},
    genome, d, binsize, pcounts, pcounts, false)

Export track to wig or bigWig:

  track.WriteWiggle("track.wig", "track description")
  track.WriteBigWig("track.bw",  "track description")

BigWig Files

BigWig files contain data in a binary format optimized for fast random access. In addition to the raw data, bigWig files typically contain several zoom levels for which the data has been summarized. The BigWigReader class allows to query data and it automatically selects an appropriate zoom level for the given binsize:

  reader, err := NewBigWigReader("test.bw")
  if err != nil {
    log.Fatal(err)
  }
  // query details
  seqname := "chr4" // (regular expression)
  from    := 11774000
  to      := 11778000
  binsize := 20

  for record := range reader.Query(seqname, from, to, binsize) {
    if record.Error != nil {
      log.Fatalf("reading bigWig failed: %v", record.Error)
    }
    fmt.Println(record)
  }