Categorygithub.com/ByteSizedMarius/soup
modulepackage
1.2.6
Repository: https://github.com/bytesizedmarius/soup.git
Documentation: pkg.go.dev

# README

soup

GoDoc Go Report Card

Forked from anaskhan96/soup due to it not being actively maintained. Original Readme below:

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default
func HTML() {} // HTML returns the HTML code for the specific element

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
    • ErrUnableToParse
    • ErrElementNotFound
    • ErrNoNextSibling
    • ErrNoPreviousSibling
    • ErrNoNextElementSibling
    • ErrNoPreviousElementSibling
    • ErrCreatingGetRequest
    • ErrInGetRequest
    • ErrReadingResponse

Installation

Install the package using the command

go get github.com/ByteSizedMarius/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/ByteSizedMarius/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

# Packages

No description provided by the author

# Functions

Cookie sets a cookie for http requests.
Get returns the HTML returned by the url as a string using the default HTTP client.
GetWithClient returns the HTML returned by the url using a provided HTTP client.
Header sets a new HTTP header.
HTMLParse parses the HTML returning a start pointer to the DOM.
Post
Post returns the HTML returned by the url as a string using the default HTTP client.
PostForm is a convenience method for POST requests that sends data in the form of url.Values using the default HTTP client.
PostWithClient returns the HTML returned by the url using a provided HTTP client The type of the body must conform to one of the types listed in func getBodyReader().
SetDebug sets the debug status Setting this to true causes the panics to be thrown and logged onto the console.

# Constants

ErrCreatingGetRequest will be returned when the get request couldn't be created.
ErrCreatingPostRequest will be returned when the post request couldn't be created.
ErrElementNotFound will be returned when element was not found.
ErrInGetRequest will be returned when there was an error during the get request.
ErrMarshallingPostRequest will be returned when the body of a post request couldn't be serialized.
ErrNoNextElementSibling will be returned when no next element sibling can be found.
ErrNoNextSibling will be returned when no next sibling can be found.
ErrNoPreviousElementSibling will be returned when no previous element sibling can be found.
ErrNoPreviousSibling will be returned when no previous sibling can be found.
ErrReadingResponse will be returned if there was an error reading the response to our get request.
ErrUnableToParse will be returned when the HTML could not be parsed.

# Variables

Cookies contains all HTTP cookies to send.
Headers contains all HTTP headers to send.

# Structs

Error allows easier introspection on the type of error returned.
Root is a structure containing a pointer to an html node, the node value, and an error variable to return an error if one occurred.

# Type aliases

ErrorType defines types of errors that are possible from soup.