Alvis-Convert
view release on metacpan or search on metacpan
bin/html2plain view on Meta::CPAN
__END__
=head1 NAME
html2plain.pl - HTML to plain text converter
=head1 SYNOPSIS
html2plain.pl [options] [source directory ...]
Options:
--html-ext HTML file identifying filename extension
--out-ext output filename extension
--out-dir output directory
--N-per-out-dir # of records per output directory
--source-encoding the encoding of the HTML files
--[no]assert-html assert that the document is HTML
--[no]symbolic-char-entities-to-chars
convert symbolic character entities to UTF-8
characters
--[no]numerical-char-entities-to-chars
convert numerical character entities to UTF-8
characters
--[no]clean-whitespace remove redundant whitespace
--[no]assert-assumptions assert that the document is in UTF-8 and contains
before actually converting to plain text
--help brief help message
--man full documentation
--[no]warnings warnings output flag
=head1 OPTIONS
=over 8
=item B<--html-ext>
Sets the HTML file identifying filename extension.
Default value: 'html'.
=item B<--out-ext>
Sets the output filename extension.
Default value: 'plain'.
=item B<--out-dir>
Sets the output directory. Default value: '.'.
=item B<--N-per-out-dir>
Sets the # of records per output directory. Default value: 1000.
=item B<--source-encoding>
Specifies the encoding of the HTML files. Default value undef,
which means that the encoding is guessed for each document.
=item B<--[no]assert-html>
Specifies whether it is asserted that the document actually looks like
HTML before trying to convert. Default: yes.
=item B<--[no]symbolic-char-entities-to-chars>
Specifies whether symbolic character entities are converted to
UTF-8 characters. Default: yes.
=item B<--[no]numerical-char-entities-to-chars>
Specifies whether numerical character entities are converted to
UTF-8 characters. Default: yes.
=item B<--[no]clean-whitespace>
Specifies whether redundant whitespace is removed from the output.
Default: yes.
=item B<--[no]assert-assumptions>
Specifies whether assumptions about the source are validated before
trying to convert (that it is in UTF-8 (converted to internally) and
contains no '\0's. Default: yes.
=item B<--help>
Prints a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=item B<--[no]warnings>
Output (or suppress) warnings. Default value: yes.
=back
=head1 DESCRIPTION
Goes recursively through the HTML files under the source directory
and converts their textual content to plain text files.
The output is in UTF-8.
=cut
( run in 0.667 second using v1.01-cache-2.11-cpan-39bf76dae61 )