html2text
html2text(t)                                         html2text(t)



NAME
       html2text - an advanced HTML-to-text converter

SYNOPSIS
       html2text -help
       html2text -version
       html2text  [  -unparse  |  -check  ]  [ -debug-scanner ] [
       -debug-parser ] [ -rcfile path ]  [  -style  (  compact  |
       pretty ) ] [ -width width ] [ -o output-file ] [ -nobs ] [
       input-uri ...  ]

DESCRIPTION
       html2text reads HTML 3.2 documents  from  the  input-uris,
       formats  each  into a stream of plain text characters (ISO
       8859-1) and writes the result to standard output (or  into
       output-file, if the -o command line option is used).

       Documents  that  are  specified by an URI that begins with
       "http:" (RFC 1738) are retrieved with the Hypertext Trans-
       fer  Protocol (RFC 1945). URIs that begin with "file:" and
       URIs that do not contain a colon specify local files.  All
       other URIs are invalid.

       If  no  input-uris  are  specified  on  the  command line,
       html2text reads from standard input. A dash as the  input-
       uri is an alternate way to specify standard input.

       html2text  understands  all  HTML  3.2 constructs, but can
       render only part of them due to  the  limitations  of  the
       text  output format. However, the program attempts to pro-
       vide good substitutes for the elements it  cannot  render.
       It also accepts syntactically incorrect input and attempts
       to interpret it "reasonably".

       The way in that html2text formats the  HTML  documents  is
       controlled  by formatting properties read from an RC file.
       html2text attempts to read $HOME/.html2textrc (or the file
       specified  by  the  -rcfile  command line option); if that
       file  cannot  be  read,   html2text   attempts   to   read
       /etc/html2textrc.  If no RC file can be read (or if the RC
       file does not override all  formatting  properties),  then
       "reasonable"  defaults  are assumed. The RC file format is
       described in the html2textrc(c) manual page.

OPTIONS
       -help  Print command line summary and exit.

       -version
              Print program version and exit.

       -unparse
              This option is for diagnostic purposes: Instead  of
              formatting the parsed document, generate HTML code,
              that is guaranteed to be syntactically correct.  If
              html2text  has  problems  parsing  a  syntactically
              incorrect HTML document, this option may  help  you
              to understand what html2text thinks that the origi-
              nal HTML code means.

       -check This option is for diagnostic  purposes:  The  HTML
              document  is  only  parsed and not processed other-
              wise. In this mode  of  operation,  html2text  will
              report  on  parse  errors and scan errors, which it
              does not in other modes of operation.  Notice  that
              parse  and scan errors are not fatal for html2text,
              but may cause mis-interpretation of the  HTML  code
              and/or portions of the document being swallowed.

       -debug-scanner
              While scanning the HTML document, html2text reports
              on each lexical token scanned. This option  is  for
              diagnostic purposes.

       -debug-parser
              While scanning the HTML document, html2text reports
              on the tokens being shifted, rules  being  applied,
              etc. This option is for diagnostic purposes.

       -rcfile path
              Attempt  to  read  the file specified in path as RC
              file.

       -style ( compact | pretty )
              Style pretty changes some of the default values  of
              the  formatting  parameters documented in html2tex-
              trc(c).  To find out which and how  the  formatting
              parameter  defaults  are  changed,  check  the file
              "pretty.style". If this option  is  omitted,  style
              compact is assumed as default.

       -width width
              By  default,  html2text  formats the HTML documents
              for a screen width of 79 characters. If redirecting
              the  output  into a file, or if your terminal has a
              width other than 80 characters, or if you just want
              to  get  an  idea  how  html2text  deals with large
              tables and different terminal widths, you may  want
              to specify a different width.

       -o output-file
              Write the output to output-file instead of standard
              output. A dash as the output-file is  an  alternate
              way to specify the standard output.

       -nobs  By  default,  html2text  renders underlined letters
              with sequences  like  "underscore-backspace-charac-
              ter"   and   boldface   letters   like  "character-
              backspace-character", which  works  fine  when  the
              output  is piped into more(e), less(s), or similar.
              For other applications,  or  when  redirecting  the
              output into a file, it may be desirable not to ren-
              der  character  attributes  with   such   backspace
              sequences, which can be specified with this command
              line option.

FILES
       /etc/html2textrc
              System wide parser configuration file.

       $HOME/.html2textrc
              Personal parser configuration file,  overrides  the
              system wide values.

CONFORMING TO
       HTML    3.2    (HTML   3.2   Reference   Specification   -
       http://www.w3.org/TR/REC-html32),
       RFC 1945 (Hypertext Transfer Protocol - HTTP).

NOTES
       html2text undergoes considerable effort to parse syntacti-
       cally  incorrect input, but is not always as successful as
       other HTML processors. If you have the possibility to cor-
       rect  the  HTML  source  code,  you  may  want  to use the
       -unparse or  -check  options  to  find  out  what  exactly
       html2text's problem is.

RESTRICTIONS
       html2text  provides  only  a  basic  implementation of the
       Hypertext Transfer Protocol (HTTP). It requires  the  com-
       plete and exactly matching URI to be given as argument and
       will not follow redirections (HTTP 301/ 307).

AUTHOR
       html2text was written up to version 1.2.2 by  Arno  Unkrig
       <arno@unkrig.de>  for GMRS Software GmbH, Unterschleiheim.

       Current maintainer and primary download location is:
       Martin Bayer <mbayer@zedat.fu-berlin.de>
       http://userpage.fu-berlin.de/~mbayer/tools/html2text.html

SEE ALSO
       html2textrc(c), less(s), more(e)



                            2001-10-05               html2text(t)