App-optex-textconv

 view release on metacpan or  search on metacpan

lib/App/optex/textconv.pm  view on Meta::CPAN

This module replaces several sort of filenames by node representing
its text information.  File itself is not altered.

For example, you can check the text difference between MS word files
like this:

    $ optex diff -Mtextconv OLD.docx NEW.docx

If you have symbolic link named B<diff> to B<optex>, and following
setting in your F<~/.optex.d/diff.rc>:

    option default --textconv
    option --textconv -Mtextconv $<move>

Next command simply produces the same result.

    $ diff OLD.docx NEW.docx

=head2 FILE FORMATS

=over 7

=item git

L<git(1)> file object. Like C<HEAD^:README.md>.

=item msdoc

Microsoft office format files in XML (.docx, .pptx, .xlsx, .docm,
.pptm, .xlsm).
Use
L<App::optex::textconv::msdoc>,
L<App::optex::textconv::ooxml>,
L<App::optex::textconv::ooxml::regex>,
L<App::optex::textconv::ooxml::xslt>.

=item doc

Microsoft Word file.
Use L<Text::Extract::Word> module.

=item xls

Microsoft Excel file.
Use L<Spreadsheet::ParseExcel> module.

=item pdf

Use L<pdftotext(1)> command to covert PDF format.
See L<App::optex::textconv::pdf>.

=item jpeg

JPEG file is converted to its exif information (.jpeg, .jpg).

=item http

Name start with C<http://> or C<https://> is converted to text data
translated by L<w3c(1)> command.

=item gpg

Invoke L<gpg(1)> command to decrypt encrypted files with C<.gpg>
extention.

=item pandoc

Use L<pandoc|https://pandoc.org/> command to translate Microsoft
office document in XML format.
See L<App::optex::textconv::pandoc>.

=item tika

Use L<Apache Tika|https://tika.apache.org/> command to translate
Microsoft office document in XML and non-XML format.
See L<App::optex::textconv::tika>.

=back

=head1 MICROSOFT DOCUMENTS

Microsoft office document in XML format (.docx, .pptx, .xlsx) is
converted to plain text by original code implemented in
L<App::optex::textconv::ooxml::regex> module.  Algorithm used in this
module is extremely simple, and consequently runs fast.

=begin COMMENT

If related modules are available, L<App::optex::textconv::ooxml::xslt>
is used to covert XML using XSLT mechanism.

=end COMMENT

Two modules are included in this distribution to use other external
converter program, B<pandoc> and B<tika>, which implement much more
serious algorithm.  They can be invoked by calling B<load> function
with module declaration like:

    optex -Mtextconv::load=pandoc

    optex -Mtextconv::load=tika

=head1 INSTALL

=head2 CPANM

    cpanm App::optex::textconv

=head2 GIT

These are sample configurations using L<App::optex::textconv> in git
environment.

	~/.gitconfig
		[diff "msdoc"]
			textconv = optex -Mtextconv cat
		[diff "pdf"]
			textconv = optex -Mtextconv cat
		[diff "jpg"]
			textconv = optex -Mtextconv cat

	~/.config/git/attributes
		*.docx   diff=msdoc



( run in 0.620 second using v1.01-cache-2.11-cpan-e1769b4cff6 )