Categorygithub.com/Above-Os/article-dynamic-extractor

# README

Article-Extractor

It is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.

This package is based from [Readability.js] by [Mozilla] and [omnivore].

For some websites, specific configuration templates are used to improve the accuracy of extractor.

Table of Contents

Installation

To install this package, just run go get :

go get github.com/beclab/article-extractor

Usage

To get the readable content from an URL, you can use processor.ArticleReadabilityExtractor. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content.

Input parametersdescribe
rawContentraw content of the page
entryUrlurl of the entry
feedUrlfeed url, it can be "" if don’t have the value
rulescustom parsing rules
isrecommendreserved parameters ,not used yet
Out parametersdescribe
contentcontent of the page
pureContentpure content
publishedDatepublished date,parsed by readability
imagecover image of the page
titletitle of the page
authorauthor of the page,parsed by templates
bylinebyline , parsed by readability
publishedAtTimeStamppublished timeStamp,parsed by templates

To get the published date, publishedAtTimeStamp field can be used first, if the value is not empty. To get the author of article, author field can be used first, if the value is not empty.

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author