# Structs
ContentDeduplicationJoint used to check the hash of page body, if duplicated hash already exists, will break the pipeline.
EmptyJoint is a place holder.
No description provided by the author
No description provided by the author
FilterCheckJointused to check the task url if it is already in the filter, if not in the filter, then add it to task filter, and make sure won't add it next time.
No description provided by the author
No description provided by the author
No description provided by the author
IndexJoint is used to send snapshot and task info into index.
LanguageDetectJoint used to detect the language of the webpage.
No description provided by the author
No description provided by the author
No description provided by the author
TaskDeduplicationJoint is used to find whether the task already in the database.
No description provided by the author
UrlFilterJoint used to validate urls, include host,path,file and file extension.
UrlNormalizationJoint used to cleanup url and do normalization.