Categorygithub.com/philipjkim/goreadability
modulepackage
0.0.0-20190422094628-0f3b4a11b312
Repository: https://github.com/philipjkim/goreadability.git
Documentation: pkg.go.dev

# README

goreadability

GoDoc Go Report Card Code Coverage Build Status

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

From v2.0 goreadability uses opengraph tag values if exists. You can disable opengraph lookup and follow the traditional readability rules by setting Option.LookupOpenGraphTags to false.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test -v

Command Line Tool

TODO

Related Projects

  • ruby-readability is the base of this project.
  • fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

# Functions

Debug enables debug logging of the operations done by the library.
Extract requests to reqURL then returns contents extracted from the response.
ExtractFromDocument returns Content when extraction succeeds, otherwise error.
NewOption returns the default option.

# Structs

Content contains primary readable content of a webpage.
Image contains URL and Size (width and height in pixel).
OpenGraph contains opengraph meta values.
Option contains variety of options for extracting page content and images.