Categorygithub.com/advancedlogic/GoOse
repository
0.0.0-20250803031130-717927370fc8
Repository: https://github.com/advancedlogic/goose.git
Documentation: pkg.go.dev

# Packages

No description provided by the author
No description provided by the author

# README

GoOse

HTML Content / Article Extractor in Go

Build Status Coverage Status Go Report Card GoDoc

Description

GoOse is a powerful Go library and command-line tool for extracting article content and metadata from HTML pages. This is a Go port of the original "Goose" library, completely rewritten and modernized for contemporary Go development.

Key Features:

  • šŸš€ Extract clean article text from web pages
  • šŸ“° Extract article metadata (title, description, keywords, images)
  • šŸ–¼ļø Advanced image extraction and top image detection
  • šŸŽ„ Video content detection and extraction
  • 🌐 Multi-language support with stopwords
  • šŸ”§ Command-line interface for easy integration
  • šŸ“¦ Clean library API for programmatic use
  • ⚔ High performance with concurrent processing support

Originally licensed to Gravity.com under the Apache License 2.0. Go port written by Antonio Linari.

Installation

As a Library

go get github.com/advancedlogic/GoOse

As a CLI Tool

# Install directly
go install github.com/advancedlogic/GoOse/cmd/goose@latest

# Or build from source
git clone https://github.com/advancedlogic/GoOse.git
cd GoOse
make build
# Binary will be available at ./bin/goose

Quick Start

Command Line Usage

# Extract article from URL (text output)
goose convert https://example.com/article

# Extract article with JSON output
goose convert https://example.com/article --format json

# Save output to file
goose convert https://example.com/article --output article.txt

# Show version
goose version

# Show help
goose help

Library Usage

package main

import (
	"fmt"
	"log"

	"github.com/advancedlogic/GoOse/pkg/goose"
)

func main() {
	// Create a new GoOse instance
	g := goose.New()
	
	// Extract from URL
	article, err := g.ExtractFromURL("https://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	if err != nil {
		log.Fatal(err)
	}

	// Print extracted content
	fmt.Println("Title:", article.Title)
	fmt.Println("Description:", article.MetaDescription)
	fmt.Println("Keywords:", article.MetaKeywords)
	fmt.Println("Content:", article.CleanedText)
	fmt.Println("URL:", article.FinalURL)
	fmt.Println("Top Image:", article.TopImage)
	fmt.Println("Authors:", article.Authors)
	fmt.Println("Publish Date:", article.PublishDate)
}

Advanced Configuration

package main

import (
	"github.com/advancedlogic/GoOse/pkg/goose"
)

func main() {
	// Create configuration
	config := goose.Configuration{
		Debug:          false,
		TargetLanguage: "en",
		UserAgent:      "MyApp/1.0",
		Timeout:        30, // seconds
	}
	
	// Create GoOse with custom configuration
	g := goose.NewWithConfig(config)
	
	// Extract from raw HTML
	html := "<html><body><article><h1>Title</h1><p>Content...</p></article></body></html>"
	article, err := g.ExtractFromRawHTML(html, "https://example.com")
	if err != nil {
		// Handle error
	}
	
	// Use the extracted article
	_ = article
}

Project Structure

GoOse follows standard Go project layout:

ā”œā”€ā”€ cmd/goose/          # CLI application
ā”œā”€ā”€ pkg/goose/          # Public library API
ā”œā”€ā”€ internal/           # Private application code
│   ā”œā”€ā”€ crawler/        # Web crawling logic
│   ā”œā”€ā”€ extractor/      # Content extraction
│   ā”œā”€ā”€ parser/         # HTML parsing utilities
│   ā”œā”€ā”€ types/          # Shared data types
│   └── utils/          # Utility functions
ā”œā”€ā”€ docs/               # Documentation
ā”œā”€ā”€ sites/              # Test HTML files
└── Makefile           # Build automation

Development

Prerequisites

  • Go 1.21 or later
  • Make (for build automation)

Getting Started

  1. Clone the repository:

    git clone https://github.com/advancedlogic/GoOse.git
    cd GoOse
    
  2. Install dependencies:

    make deps
    
  3. Build the project:

    make build
    
  4. Run tests:

    make test
    
  5. Run all quality checks:

    make qa
    

Available Make Commands

make help          # Show all available commands
make build         # Build the CLI binary
make install       # Install CLI to GOPATH/bin
make test          # Run all tests
make test-race     # Run tests with race detection
make coverage      # Generate coverage report
make format        # Format source code
make lint          # Run linters
make qa            # Run all quality checks
make clean         # Clean build artifacts
make tidy          # Clean up go.mod and go.sum

Development Workflow

  1. Make changes to the code
  2. Run make format to format your code
  3. Run make qa to ensure quality
  4. Run make test to verify functionality
  5. Commit your changes

API Reference

Main Types

  • goose.Goose - Main extractor instance
  • goose.Article - Extracted article data
  • goose.Configuration - Extractor configuration

Key Methods

  • goose.New() - Create new extractor with default config
  • goose.NewWithConfig(config) - Create extractor with custom config
  • ExtractFromURL(url) - Extract article from URL
  • ExtractFromRawHTML(html, url) - Extract from HTML string

For complete API documentation, run:

go doc github.com/advancedlogic/GoOse/pkg/goose

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes following the coding standards
  4. Run the full test suite (make qa)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Please ensure your code:

  • āœ… Passes all tests (make test)
  • āœ… Follows Go formatting standards (make format)
  • āœ… Passes linting checks (make lint)
  • āœ… Has appropriate test coverage
  • āœ… Includes documentation for public APIs

Roadmap

Current Status

  • āœ… Modern Go modules support
  • āœ… CLI interface with Cobra
  • āœ… Comprehensive test coverage
  • āœ… Standard Go project layout
  • āœ… Build automation with Make

Planned Improvements

  • Enhanced error handling and logging
  • Plugin architecture for custom extractors
  • Performance optimizations
  • Additional output formats (XML, YAML)
  • Docker containerization
  • Advanced image processing
  • Batch processing capabilities

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

  • @Martin Angers for goquery
  • @Fatih Arslan for set
  • Go Team for the amazing language and net/html
  • Original Goose contributors at Gravity.com
  • Community contributors for ongoing improvements