Categorygithub.com/we-are-discussing-rest/web-crawler
modulepackage
0.0.0-20231129002256-49c6e27903e4
Repository: https://github.com/we-are-discussing-rest/web-crawler.git
Documentation: pkg.go.dev

# README

Web Crawler System Design

Design Plan

The system will involve 7 parts overall:

  • Seed Url (Starting url for crawling)
    • For this project I will only be allowing a single url input
    • client will be a long running web server
  • URL Frontier
    • data structure to store URLs for future downloads
    • This will ensure priority/politeness to not DDOS a website
    • Queue router to put data in queues - queue selector to select data from given queues
    • Managed redis queue for FIFO data
      • Each redis key with the primary host will have a queue associated with it
    • Workers will spin up to ingest data from the FIFO queue per key
  • HTML Downloader (including DNS resolution)
    • Gets IP addresses from the DNS resolver and starts downloading html content
  • Content Parser
    • Parses HTML to ensure raw text is not malformed
  • Content Seen?
    • Data store of MD5 hashes of html content - if this data store has the md5 hash from the parser it throws away the data and continues work - if it doesn't have the hash it stores it.
  • Link extractor
    • Extracts links from HTML page
  • URL filter
    • Gets passed the links and stores URLs
    • URLs will then be stored in the URL Frontier and the whole process will continue

Diagram

                If either the DNS resolver fails
                or parser log error and restart
                  ┌─────────────────────────┐
                  │                         │
                  │         ┌─────────┐     │
                  │         │DNS      │     │
                  │   ┌─────┤Resolver │     │
                  │   │     └───▲─────┘     │
                  │   │         │           │
                  │   │         │           │
┌─────────┐   ┌───▼───▼─┐   ┌───┴─────┐   ┌─┴───────┐
│         │   │         │   │         │   │         │
│Client   ├───►Frontier ├───►Html     ├───►Html     │
│         │   │         │   │Download │   │Parser   │
└─────────┘   └──▲───▲──┘   └─────────┘   └────┬────┘
                 │   │                         │
                 │   │       ┌───────┐    ┌────▼────┐  ┌─────────┐
                 │   │       ├───────┘    │         │  │         │
                 │   │       │ Data  ◄────┤Content  ├──►Link     │
                 │   │       │ Store │    │seen?    │  │extract  │
                 │   │       └───────┘    └┬────────┘  └────┬────┘
                 │   │                     │                │
                 │   └─────────────────────┘           ┌────▼────┐
                 │     If MD5 hash exists              │         │
                 │     restart to beginning            │URL      │
                 │                                     │Filter   │
                 │           ┌───────┐                 └───┬─────┘
                 │           ├───────┘                     │
                 └───────────┤Redis  ◄─────────────────────┘
                             │MQ     │   Url's are pushed to
                             └───────┘  redis MQ for processing

Models

The data models for this will be incredibly simple. The queue data model will take form as a redis queue per host.

{ "wikipedia": ["https://wikipedia.com", "https://wikipedia.com/test"] }
{ "go": ["https://pkg.go.com/net/http", "go.com"] }

Using this model will enable the use of grouping crawler workers only within the desired host.

For the Seen content data store it will simple be a SQLite DB containing MD5 hashes of all seen sites.

interface SeenContentModel {
  id PK int unique
  hash string
}

Frontier

Queue router and queue selector will be contained within a module. Queue router will receive links from crawlers and enqueue those links into Redis Queue. The queue selector will work on a pub/sub mechanism.

Queues have their own crawler -> crawler depth will be set by env var when program runs -> when a new queue is added the queue selector module will have a subscriber that spins up a new worker/crawler

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author