Categorygithub.com/advancedlogic/GoOse
modulepackage
0.0.0-20231203033844-ae6b36caf275
Repository: https://github.com/advancedlogic/goose.git
Documentation: pkg.go.dev

# README

GoOse

HTML Content / Article Extractor in Golang

Build Status Coverage Status Go Report Card GoDoc

Description

This is a golang port of "Goose" originaly licensed to Gravity.com under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership.

Golang port was written by Antonio Linari

Gravity.com licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

INSTALL

go get github.com/advancedlogic/GoOse

HOW TO USE IT

package main

import (
	"github.com/advancedlogic/GoOse"
)

func main() {
	g := goose.New()
	article, _ := g.ExtractFromURL("http://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	println("title", article.Title)
	println("description", article.MetaDescription)
	println("keywords", article.MetaKeywords)
	println("content", article.CleanedText)
	println("url", article.FinalURL)
	println("top image", article.TopImage)
}

Development - Getting started

This application is written in GO language, please refere to the guides in https://golang.org for getting started.

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Before committing the code, please check if it passes all tests using

make deps
make qa

TODO

  • better organize code
  • improve "xpath" like queries
  • add other image extractions techniques (imagemagick)

THANKS TO

@Martin Angers for goquery
@Fatih Arslan for set
GoLang team for the amazing language and net/html

# Functions

GetDefaultConfiguration returns safe default configuration options.
New returns a new instance of the article extractor.
NewCleaner returns a new instance of a Cleaner.
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body.
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body.
NewExtractor returns a configured HTML parser.
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body.
NewParser returns an HTML parser.
NewStopwords returns an instance of a stop words detector.
NewVideoExtractor returns a new instance of a HTML video extractor.
NewWithConfig returns a new instance of the article extractor with configuration.
NormaliseCharset Overrides/fixes charset names to something we can parse.
OpenGraphResolver return OpenGraph properties.
ReadLinesOfFile returns the lines from a file as a slice of strings.
UTF8encode converts a string from the source character set to UTF-8, skipping invalid byte sequences @see http://stackoverflow.com/questions/32512500/ignore-illegal-bytes-when-decoding-text-with-go.
WebPageImageResolver fetches all candidate images from the HTML page.
WebPageResolver fetches the main image from the HTML page.

# Structs

Article
Article is a collection of properties extracted from the HTML body.
Cleaner removes menus, ads, sidebars, etc.
Configuration is a wrapper for various config options.
ContentExtractor can parse the HTML and fetch various properties.
Crawler can fetch the target HTML page.
Crawler can fetch the target HTML page.
Goose is the main entry point of the program.
Parser is an HTML parser specialised in extraction of main content and other properties.
StopWords implements a simple language detector.
VideoExtractor can extract the main video from an HTML page.

# Interfaces

No description provided by the author