modulepackage
0.0.0-20250120152349-9b24770cfc3b
Repository: https://github.com/crawlerclub/extractor.git
Documentation: pkg.go.dev
# README
rabbitcrawler
A powerful web scraping tool designed for extracting structured data from websites with configurable rules and multiple execution modes.
Features
- Configurable JSON-based scraping rules
- Multiple extraction modes:
- Static: Fast HTML parsing without JavaScript execution
- Browser: Full browser emulation with JavaScript support
- Concurrent scraping with adjustable worker count
Installation
go install github.com/crawlerclub/extractor/cmd/rabbitextract@latest
go install github.com/crawlerclub/extractor/cmd/rabbitcrawler@latest
Using rabbitextract
rabbitextract is a command-line tool for extracting data from a single webpage using JSON configuration rules.
Command Line Options
-config
: Path to the config JSON file (required)-url
: URL to extract data from (optional if provided in config)-mode
: Extraction mode (optional, defaults to "auto")auto
: Automatically choose between static and browser modestatic
: Fast HTML parsing without JavaScriptbrowser
: Full browser emulation with JavaScript support
-output
: Output file path (optional, defaults to stdout)
Example Usage
- Create a configuration file
config.json
:
{
"name": "example-scraper",
"example_url": "https://example.com/page",
"schemas": [
{
"name": "articles",
"entity_type": "article",
"selector": "//div[@class='article']",
"fields": [
{
"name": "title",
"type": "text",
"selector": ".//h1"
},
{
"name": "content",
"type": "text",
"selector": ".//div[@class='content']"
}
]
}
]
}
- Run the extractor:
rabbitextract -config config.json -url "https://example.com/page" -output result.json
Supported Field Types
text
: Extract text content from an elementattribute
: Extract specific attribute value from an elementnested
: Extract nested object with multiple fieldslist
: Extract array of items
Special Fields
_id
: Used to generate unique external_id for items_time
: Used to set external_time for items
# Functions
No description provided by the author
No description provided by the author
No description provided by the author
# Structs
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
# Interfaces
No description provided by the author
# Type aliases
No description provided by the author