# README

goreadability

goreadability is a tool for extracting the primary readable content of a webpage. It is a Go port of arc90's readability project, based on ruby-readability.

From v2.0 goreadability uses opengraph tag values if exists. You can disable opengraph lookup and follow the traditional readability rules by setting Option.LookupOpenGraphTags to false.

Install

go get github.com/philipjkim/goreadability

Example

// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"

// Default option
opt := readability.NewOption()

// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms

content, err := readability.Extract(url, opt)
if err != nil {
    log.Fatal(err)
}

log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)

Testing

go test

# or if you want to see verbose logs:
DEBUG=true go test -v

Command Line Tool

TODO

Related Projects

ruby-readability is the base of this project.
fastimage finds the type and/or size of a remote image given its uri, by fetching as little as needed.

Potential Issues

TODO

License

MIT

# Functions

Debug

Debug enables debug logging of the operations done by the library.

Extract

Extract requests to reqURL then returns contents extracted from the response.

ExtractFromDocument

ExtractFromDocument returns Content when extraction succeeds, otherwise error.

NewOption

NewOption returns the default option.

# Structs

Content

Content contains primary readable content of a webpage.

Image

Image contains URL and Size (width and height in pixel).

OpenGraph

OpenGraph contains opengraph meta values.

Option

Option contains variety of options for extracting page content and images.