version 2 showing authors affecting page license.
.
Rev |
Author |
# |
Line |
1 |
perry |
1 |
html2text |
|
|
2 |
!!!html2text |
|
|
3 |
NAME |
|
|
4 |
SYNOPSIS |
|
|
5 |
DESCRIPTION |
|
|
6 |
OPTIONS |
|
|
7 |
FILES |
|
|
8 |
CONFORMING TO |
|
|
9 |
NOTES |
|
|
10 |
RESTRICTIONS |
|
|
11 |
AUTHOR |
|
|
12 |
SEE ALSO |
|
|
13 |
---- |
|
|
14 |
!!NAME |
|
|
15 |
|
|
|
16 |
|
|
|
17 |
html2text - an advanced HTML-to-text converter |
|
|
18 |
!!SYNOPSIS |
|
|
19 |
|
|
|
20 |
|
|
|
21 |
__html2text -help |
|
|
22 |
html2text -version |
|
|
23 |
html2text__ [[ __-unparse__ | __-check__ ] [[ |
|
|
24 |
__-debug-scanner__ ] [[ __-debug-parser__ ] [[ |
|
|
25 |
__-rcfile__ ''path'' ] [[ __-style__ ( |
|
|
26 |
__compact__ | __pretty__ ) ] [[ __-width__ |
|
|
27 |
''width'' ] [[ __-o__ ''output-file'' ] [[ |
|
|
28 |
__-nobs__ ] [[ ''input-uri'' ... ] |
|
|
29 |
!!DESCRIPTION |
|
|
30 |
|
|
|
31 |
|
|
|
32 |
__html2text__ reads HTML 3.2 documents from the |
|
|
33 |
''input-uri''s, formats each into a stream of plain text |
|
|
34 |
characters (__ISO 8859-1__) and writes the result to |
|
|
35 |
standard output (or into ''output-file'', if the |
|
|
36 |
__-o__ command line option is used). |
|
|
37 |
|
|
|
38 |
|
|
|
39 |
Documents that are specified by an URI that begins with |
|
|
40 |
RFC 1738__) are retrieved with the |
|
|
41 |
Hypertext Transfer Protocol (__RFC 1945__). URIs that |
|
|
42 |
begin with |
|
|
43 |
__ |
|
|
44 |
|
|
|
45 |
|
|
|
46 |
If no ''input-uri''s are specified on the command line, |
|
|
47 |
__html2text__ reads from standard input. A dash as the |
|
|
48 |
''input-uri'' is an alternate way to specify standard |
|
|
49 |
input. |
|
|
50 |
|
|
|
51 |
|
|
|
52 |
__html2text__ understands all HTML 3.2 constructs, but |
|
|
53 |
can render only part of them due to the limitations of the |
|
|
54 |
text output format. However, the program attempts to provide |
|
|
55 |
good substitutes for the elements it cannot render. It also |
|
|
56 |
accepts syntactically incorrect input and attempts to |
|
|
57 |
interpret it __ |
|
|
58 |
|
|
|
59 |
|
|
|
60 |
The way in that __html2text__ formats the HTML documents |
|
|
61 |
is controlled by formatting properties read from an RC file. |
|
|
62 |
__html2text__ attempts to read ''$HOME/.html2textrc'' |
|
|
63 |
(or the file specified by the __-rcfile__ command line |
|
|
64 |
option); if that file cannot be read, __html2text__ |
|
|
65 |
attempts to read ''/etc/html2textrc''. If no RC file can |
|
|
66 |
be read (or if the RC file does not override all formatting |
|
|
67 |
properties), then |
|
|
68 |
''html2textrc__(5) manual page. |
|
|
69 |
!!OPTIONS |
|
|
70 |
|
|
|
71 |
|
|
|
72 |
__-help__ |
|
|
73 |
|
|
|
74 |
|
|
|
75 |
Print command line summary and exit. |
|
|
76 |
|
|
|
77 |
|
|
|
78 |
__-version__ |
|
|
79 |
|
|
|
80 |
|
|
|
81 |
Print program version and exit. |
|
|
82 |
|
|
|
83 |
|
|
|
84 |
__-unparse__ |
|
|
85 |
|
|
|
86 |
|
|
|
87 |
This option is for diagnostic purposes: Instead of |
|
|
88 |
formatting the parsed document, generate HTML code, that is |
|
|
89 |
guaranteed to be syntactically correct. If __html2text__ |
|
|
90 |
has problems parsing a syntactically incorrect HTML |
|
|
91 |
document, this option may help you to understand what |
|
|
92 |
__html2text__ thinks that the original HTML code |
|
|
93 |
means. |
|
|
94 |
|
|
|
95 |
|
|
|
96 |
__-check__ |
|
|
97 |
|
|
|
98 |
|
|
|
99 |
This option is for diagnostic purposes: The HTML document is |
|
|
100 |
only parsed and not processed otherwise. In this mode of |
|
|
101 |
operation, __html2text__ will report on parse errors and |
|
|
102 |
scan errors, which it does not in other modes of operation. |
|
|
103 |
Notice that parse and scan errors are not fatal for |
|
|
104 |
__html2text__, but may cause mis-interpretation of the |
|
|
105 |
HTML code and/or portions of the document being |
|
|
106 |
swallowed. |
|
|
107 |
|
|
|
108 |
|
|
|
109 |
__-debug-scanner__ |
|
|
110 |
|
|
|
111 |
|
|
|
112 |
While scanning the HTML document, __html2text__ reports |
|
|
113 |
on each lexical token scanned. This option is for diagnostic |
|
|
114 |
purposes. |
|
|
115 |
|
|
|
116 |
|
|
|
117 |
__-debug-parser__ |
|
|
118 |
|
|
|
119 |
|
|
|
120 |
While scanning the HTML document, __html2text__ reports |
|
|
121 |
on the tokens being shifted, rules being applied, etc. This |
|
|
122 |
option is for diagnostic purposes. |
|
|
123 |
|
|
|
124 |
|
|
|
125 |
__-rcfile__ ''path'' |
|
|
126 |
|
|
|
127 |
|
|
|
128 |
Attempt to read the file specified in ''path'' as RC |
|
|
129 |
file. |
|
|
130 |
|
|
|
131 |
|
|
|
132 |
__-style__ ( __compact__ | __pretty__ |
|
|
133 |
) |
|
|
134 |
|
|
|
135 |
|
|
|
136 |
Style __pretty__ changes some of the default values of |
|
|
137 |
the formatting parameters documented in |
|
|
138 |
html2textrc(5). To find out which and how the |
|
|
139 |
formatting parameter defaults are changed, check the file |
|
|
140 |
__compact__ is assumed as default. |
|
|
141 |
|
|
|
142 |
|
|
|
143 |
__-width__ ''width'' |
|
|
144 |
|
|
|
145 |
|
|
|
146 |
By default, __html2text__ formats the HTML documents for |
|
|
147 |
a screen width of 79 characters. If redirecting the output |
|
|
148 |
into a file, or if your terminal has a width other than 80 |
|
|
149 |
characters, or if you just want to get an idea how |
|
|
150 |
__html2text__ deals with large tables and different |
|
|
151 |
terminal widths, you may want to specify a different |
|
|
152 |
''width''. |
|
|
153 |
|
|
|
154 |
|
|
|
155 |
__-o__ ''output-file'' |
|
|
156 |
|
|
|
157 |
|
|
|
158 |
Write the output to ''output-file'' instead of standard |
|
|
159 |
output. A dash as the ''output-file'' is an alternate way |
|
|
160 |
to specify the standard output. |
|
|
161 |
|
|
|
162 |
|
|
|
163 |
__-nobs__ |
|
|
164 |
|
|
|
165 |
|
|
|
166 |
By default, __html2text__ renders underlined letters with |
|
|
167 |
sequences like |
|
|
168 |
more(1), |
|
|
169 |
less(1), or similar. For other applications, or when |
|
|
170 |
redirecting the output into a file, it may be desirable not |
|
|
171 |
to render character attributes with such backspace |
|
|
172 |
sequences, which can be specified with this command line |
|
|
173 |
option. |
|
|
174 |
!!FILES |
|
|
175 |
|
|
|
176 |
|
|
|
177 |
''/etc/html2textrc'' |
|
|
178 |
|
|
|
179 |
|
|
|
180 |
System wide parser configuration file. |
|
|
181 |
|
|
|
182 |
|
|
|
183 |
''$HOME/.html2textrc'' |
|
|
184 |
|
|
|
185 |
|
|
|
186 |
Personal parser configuration file, overrides the system |
|
|
187 |
wide values. |
|
|
188 |
!!CONFORMING TO |
|
|
189 |
|
|
|
190 |
|
|
|
191 |
__HTML 3.2__ (HTML 3.2 Reference Specification - |
|
|
192 |
http://www.w3.org/TR/REC-html32),__ |
|
|
193 |
RFC 1945__ (Hypertext Transfer Protocol - |
|
|
194 |
HTTP). |
|
|
195 |
!!NOTES |
|
|
196 |
|
|
|
197 |
|
|
|
198 |
__html2text__ undergoes considerable effort to parse |
|
|
199 |
syntactically incorrect input, but is not always as |
|
|
200 |
successful as other HTML processors. If you have the |
|
|
201 |
possibility to correct the HTML source code, you may want to |
|
|
202 |
use the __-unparse__ or __-check__ options to find out |
|
|
203 |
what exactly __html2text__'s problem is. |
|
|
204 |
!!RESTRICTIONS |
|
|
205 |
|
|
|
206 |
|
|
|
207 |
__html2text__ provides only a basic implementation of the |
|
|
208 |
Hypertext Transfer Protocol (HTTP). It requires the complete |
|
|
209 |
and exactly matching URI to be given as argument and will |
|
|
210 |
not follow redirections (HTTP 301/ 307). |
|
|
211 |
!!AUTHOR |
|
|
212 |
|
|
|
213 |
|
|
|
214 |
__html2text__ was written up to version 1.2.2 by Arno |
|
|
215 |
Unkrig |
|
|
216 |
__ |
|
|
217 |
|
|
|
218 |
|
|
|
219 |
Current maintainer and primary download location is: |
|
|
220 |
Martin Bayer |
|
|
221 |
http://userpage.fu-berlin.de/~mbayer/tools/html2text.html |
|
|
222 |
!!SEE ALSO |
|
|
223 |
|
|
|
224 |
|
|
|
225 |
html2textrc(5), less(1), |
|
|
226 |
more(1) |
|
|
227 |
---- |