package
3.0.1
Repository: https://github.com/loxiouve/unipdf.git
Documentation: pkg.go.dev

# README

TEXT EXTRACTION CODE

There are two directionss.

  • reading
  • depth

In English text,

  • the reading direction is left to right, increasing X in the PDF coordinate system.
  • the depth directon is top to bottom, decreasing Y in the PDF coordinate system.

HOW TEXT IS EXTRACTED

text_page.go makeTextPage() is the top level text extraction function. It returns an ordered list of textParas which are described below.

  • A page's textMarks are obtained from its content stream. They are in the order they occur in the content stream.
  • The textMarks are grouped into word fragments calledtextWords by scanning through the textMarks and splitting on space characters and the gaps between marks.
  • The textWordss are grouped into rectangular regions based on their bounding boxes' proximities to other textWords. These rectangular regions are called textParass. (In the current implementation there is an intermediate step where the textWords are divided into containers called wordBags.)
  • The textWords in each textPara are arranged into textLines (textWords of similar depth).
  • Within each textLine, textWords are sorted in reading order and each one that starts a whole word is marked by setting its newWord flag to true. (See textLine.text().)
  • All the textParas on a page are checked to see if they are arranged as cells within a table and, if they are, they are combined into textTables and a textPara containing the textTable replaces the textParas containing the cells.
  • The textParas, some of which may be tables, are sorted into reading order (the order in which they are read, not in the reading direction).

The entire order of extracted text from a page is expressed in paraList.writeText().

  • This function iterates through the textParas, which are sorted in reading order.
  • For each textPara with a table, it iterates through the table cell textParas. (See textPara.writeCellText().)
  • For each (top level or table cell) textPara, it iterates through the textLines.
  • For each textLine, it iterates through the textWords inserting a space before each one that has the newWord flag set.

textWord creation

  • makeTextWords() combines textMarks into textWords, word fragments.
  • textWords are the atoms of the text extraction code.

textPara creation

  • dividePage() combines textWords that are close to each other into groups in rectangular regions called wordBags.
  • wordBag.arrangeText() arranges the textWords in the rectangular regions into textLines, groups textWords of about the same depth sorted left to right.
  • textLine.markWordBoundaries() marks the textWords in each textLine that start whole words.

TODO

  • Handle diagonal text.
  • Get R to L text extraction working.
  • Get top to bottom text extraction working.