Extract human languages in plain UTF-8 text from computer code and markup
The output is (or should be) line-preserving, meaning, no new lines are added or subtracted.
<p> foo </p>
becomes
foo