Blame: LatexWordcount - Waikato Linux Users Group

Annotated edit history of LatexWordcount version 7, including all changes. View license author blame.

Rev	Author	#	Line
5	JohnMcPherson	1	`!!Option 1 - detex`
		2
1	DanielLawson	3	`If you try to get a wordcount of a latex source file via the shell command 'wc', it will include any macros you have used, and any comments, in the word count.`
		4
4	ArthurVanBunningen	5	`A better way is to make use of a tool called [detex\|http://www.cs.purdue.edu/homes/trinkle/detex/]. This does a (fairly) good job of stripping out any Latex macros and comments, although it's probably not usable as a .tex -> readable ascii filter (it gets the ordering on included files wrong, for starters). What it does do is automatically include referenced Latex files (at least the ones using \input ), so you don't have to add any numbers.`
1	DanielLawson	6
		7	`I added the following rules to my Makefile:`
		8
		9	`<verbatim>`
		10	`.PHONY: wordcount`
		11
		12	`wordcount: thesis.tex`
		13	`detex thesis.tex \| wc`
		14	`</verbatim>`
		15
		16	`running 'make wordcount' will now run detex over my .tex file, and pipe the output through the 'wc' program:`
		17
		18	`If I do a 'normal' wordcount:`
		19
		20	`<verbatim>`
		21	`$ cat thesis.tex \| wc`
		22	`1419 7738 57862`
		23	`</verbatim>`
		24
		25	`Compare this to our new make target:`
		26
		27	`<verbatim>`
		28	`$ make wordcount`
		29	`/home/dlawson/bin/detex thesis.tex \| wc`
		30	`1009 5528 35563`
		31	`</verbatim>`
		32
		33	`A noticeable difference!`
2	JamesSpencer	34
5	JohnMcPherson	35	`<tt>detex</tt> isn't 100% accurate though. It doesn't seem to understand conditionals ( \ifx1 do this \else do that \fi) or arguments that aren't part of the content (like \usepackage{hyperref}).`
2	JamesSpencer	36
5	JohnMcPherson	37	`----`
		38
		39	`!!Option 2 - pdftotext/ps2ascii`
2	JamesSpencer	40	`Alternatively, convert the .tex file to ascii and run through wc:`
		41
		42	`<verbatim>`
		43	`$ pdflatex report.tex`
		44	`$ ps2ascii report.pdf \| wc -w`
3	JamesSpencer	45	`2003`
2	JamesSpencer	46	`</verbatim>`
5	JohnMcPherson	47
		48	`Note that because postscript/pdf is a page description language, this conversion is not completely accurate as it has to heuristically guess where word breaks are. I noticed that this broke some words into two, especially for larger fonts like section headings. Some letter combinations seem to be worse than others... "project" and "object" were consistently broken into "pro ject" and "ob ject". Don't treat the number of words as gospel :)`
		49
		50	`----`
		51
		52	`!!Option 3 - dvi2tty`
		53	`As you can guess from the name, this is designed for rendering a .dvi file to a terminal (or to a file).`
6	JohnMcPherson	54	`The disadvantage of this is that you have to be using plain latex instead of pdflatex, so that you can generate a .dvi file. [IMHO] this is a better approach than the previous two because LaTeX has already parsed the source files so it doesn't need to worry about tex/latex commands, but the format still has enough information to keep words intact (except where TeX has hyphenated words at the margin).`
5	JohnMcPherson	55	`<verbatim>`
		56	`# -w100 means format for 100 chars wide`
		57	`# perl to remove punctuation so wc doesn't count them`
6	JohnMcPherson	58	`# and remove hyphenation at the end of lines`
5	JohnMcPherson	59	`dvi2tty -w100 /path/to/file.dvi \`
		60	`\| perl -pe 's/-$// \|\| s/$/ /; chomp; s/[-\._\|]//g' \| wc -w`
		61	`</verbatim>`
		62
		63
		64	`You want the <tt>dvi2tty</tt> package in Debian Sarge,`
		65	`dev-tex/dvi2tty in Gentoo, or get it from`
		66	`<tt>ftp://ftp.mesa.nl/pub/dvi2tty/dvi2tty-5.3.1.tar.gz</tt>`
7	MetinSezgin	67
		68	`----`
		69
		70	`!!Option 4 - copy paste`
		71
		72	`Copy you document from a pdf, dvi or a postscript viewer and paste into word. Do the usual word count.`
		73
		74	`This will over-estimate in most cases.`

Last edited on Sunday, August 20, 2006 5:21:43 am by "MetinSezgin"

Edit PageHistory Diff Info LikePages