App-optex-textconv
view release on metacpan or search on metacpan
lib/App/optex/textconv.pm view on Meta::CPAN
This module replaces several sort of filenames by node representing
its text information. File itself is not altered.
For example, you can check the text difference between MS word files
like this:
$ optex diff -Mtextconv OLD.docx NEW.docx
If you have symbolic link named B<diff> to B<optex>, and following
setting in your F<~/.optex.d/diff.rc>:
option default --textconv
option --textconv -Mtextconv $<move>
Next command simply produces the same result.
$ diff OLD.docx NEW.docx
=head2 FILE FORMATS
=over 7
=item git
L<git(1)> file object. Like C<HEAD^:README.md>.
=item msdoc
Microsoft office format files in XML (.docx, .pptx, .xlsx, .docm,
.pptm, .xlsm).
Use
L<App::optex::textconv::msdoc>,
L<App::optex::textconv::ooxml>,
L<App::optex::textconv::ooxml::regex>,
L<App::optex::textconv::ooxml::xslt>.
=item doc
Microsoft Word file.
Use L<Text::Extract::Word> module.
=item xls
Microsoft Excel file.
Use L<Spreadsheet::ParseExcel> module.
=item pdf
Use L<pdftotext(1)> command to covert PDF format.
See L<App::optex::textconv::pdf>.
=item jpeg
JPEG file is converted to its exif information (.jpeg, .jpg).
=item http
Name start with C<http://> or C<https://> is converted to text data
translated by L<w3c(1)> command.
=item gpg
Invoke L<gpg(1)> command to decrypt encrypted files with C<.gpg>
extention.
=item pandoc
Use L<pandoc|https://pandoc.org/> command to translate Microsoft
office document in XML format.
See L<App::optex::textconv::pandoc>.
=item tika
Use L<Apache Tika|https://tika.apache.org/> command to translate
Microsoft office document in XML and non-XML format.
See L<App::optex::textconv::tika>.
=back
=head1 MICROSOFT DOCUMENTS
Microsoft office document in XML format (.docx, .pptx, .xlsx) is
converted to plain text by original code implemented in
L<App::optex::textconv::ooxml::regex> module. Algorithm used in this
module is extremely simple, and consequently runs fast.
=begin COMMENT
If related modules are available, L<App::optex::textconv::ooxml::xslt>
is used to covert XML using XSLT mechanism.
=end COMMENT
Two modules are included in this distribution to use other external
converter program, B<pandoc> and B<tika>, which implement much more
serious algorithm. They can be invoked by calling B<load> function
with module declaration like:
optex -Mtextconv::load=pandoc
optex -Mtextconv::load=tika
=head1 INSTALL
=head2 CPANM
cpanm App::optex::textconv
=head2 GIT
These are sample configurations using L<App::optex::textconv> in git
environment.
~/.gitconfig
[diff "msdoc"]
textconv = optex -Mtextconv cat
[diff "pdf"]
textconv = optex -Mtextconv cat
[diff "jpg"]
textconv = optex -Mtextconv cat
~/.config/git/attributes
*.docx diff=msdoc
( run in 0.620 second using v1.01-cache-2.11-cpan-e1769b4cff6 )