Categorygithub.com/rotationalio/baleen
repositorypackage
0.2.3
Repository: https://github.com/rotationalio/baleen.git
Documentation: pkg.go.dev

# Packages

No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author
No description provided by the author

# README

Baleen

An automated ingestion service of RSS feeds to construct a corpus for NLP research.

GoDoc Go Report Card

Current overview:

  • Golang ingestion system that fetches RSS feeds and stores raw data into S3 for archive and analytics.
  • Web-based RSS feed management system that will allow us to easily manage sources
  • Focus on fetching full text by following links in the RSS feed
  • Feed data quality measurements with language statistics, e.g. words, vocab, etc. rate of corpus growth, number of entities, etc. (we should look at prose for this)
  • JSON based logging with limited retention so we don’t fill up our server with logs - tracking of aggregate metrics over time so we know what’s going on and if it's working.
  • Produce model based translations for sentences and paragraphs from the source language to target languages; crowdsource feedback by creating an app that allows bilingual users to say if a translation is good or not to establish annotations.
  • Annotation quality assessment tools and gamification.
  • Estimated cost with 3 yr reserved instance - $64.04 per month (mostly EBS).

Notes