Alvis-Convert

 view release on metacpan or  search on metacpan

bin/html2plain  view on Meta::CPAN

__END__

=head1 NAME
    
    html2plain.pl - HTML to plain text converter
    
=head1 SYNOPSIS
    
    html2plain.pl [options] [source directory ...]

  Options:

    --html-ext                HTML file identifying filename extension
    --out-ext                 output filename extension
    --out-dir                 output directory
    --N-per-out-dir           # of records per output directory
    --source-encoding         the encoding of the HTML files
    --[no]assert-html         assert that the document is HTML
    --[no]symbolic-char-entities-to-chars
                              convert symbolic character entities to UTF-8
                              characters
    --[no]numerical-char-entities-to-chars
                              convert numerical character entities to UTF-8
                              characters
    --[no]clean-whitespace    remove redundant whitespace
    --[no]assert-assumptions  assert that the document is in UTF-8 and contains
                              before actually converting to plain text
    --help                    brief help message
    --man                     full documentation
    --[no]warnings            warnings output flag
    
=head1 OPTIONS
    
=over 8

=item B<--html-ext>

    Sets the HTML file identifying filename extension. 
    Default value: 'html'.

=item B<--out-ext>

    Sets the output filename extension. 
    Default value: 'plain'.

=item B<--out-dir>

    Sets the output directory. Default value: '.'.

=item B<--N-per-out-dir>

    Sets the # of records per output directory. Default value: 1000.

=item B<--source-encoding>

    Specifies the encoding of the HTML files. Default value undef,
    which means that the encoding is guessed for each document.

=item B<--[no]assert-html>

    Specifies whether it is asserted that the document actually looks like
    HTML before trying to convert. Default: yes.

=item B<--[no]symbolic-char-entities-to-chars>

    Specifies whether symbolic character entities are converted to 
    UTF-8 characters. Default: yes.

=item B<--[no]numerical-char-entities-to-chars>

    Specifies whether numerical character entities are converted to 
    UTF-8 characters. Default: yes.

=item B<--[no]clean-whitespace>

    Specifies whether redundant whitespace is removed from the output.
    Default: yes.

=item B<--[no]assert-assumptions>

    Specifies whether assumptions about the source are validated before
    trying to convert (that it is in UTF-8 (converted to internally) and
    contains no '\0's. Default: yes.

=item B<--help>

    Prints a brief help message and exits.

=item B<--man>

    Prints the manual page and exits.

=item B<--[no]warnings>

    Output (or suppress) warnings. Default value: yes.

=back

=head1 DESCRIPTION

    Goes recursively through the HTML files under the source directory
    and converts their textual content to plain text files. 
    The output is in UTF-8.

=cut




( run in 0.667 second using v1.01-cache-2.11-cpan-39bf76dae61 )