Categorygithub.com/AlpineMarmot/pulse
repositorypackage
0.0.0-20190310164430-3db1f35a3352
Repository: https://github.com/alpinemarmot/pulse.git
Documentation: pkg.go.dev

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# README

Pulse

Pulse is a crawler build on top of gocolly/colly

Features:

  • Expose all golly/colly options to a yml configuration
  • Create rule(s) that export crawling data to MongoDB

Installation

Go modules must be enabled

$ go build

Usage

$ pulse [-q][--no-logging] [-c configFile] [url entrypoint]

$ pulse -c conf.yml https://www.example.com

Configuration example

see default.yml

Grab HTML data

This rule below will add to mongodb collection "images" the value of src attribute for all tag img. The context-attr is also added as images metadata.

collection: "images"
tag: "img"
attr: "src"
context-attr: "alt"

You can also grab html attributes with a selector instead of tag.

collection: "images-test"
selector: "img[data-src]"
attr: "data-src"
context-attr: "alt"

More infos about selector here: PuerkitoBio/goquery