App-optex-textconv

 view release on metacpan or  search on metacpan

README.md  view on Meta::CPAN

# DESCRIPTION

This module replaces several sort of filenames by node representing
its text information.  File itself is not altered.

For example, you can check the text difference between MS word files
like this:

    $ optex diff -Mtextconv OLD.docx NEW.docx

If you have symbolic link named **diff** to **optex**, and following
setting in your `~/.optex.d/diff.rc`:

    option default --textconv
    option --textconv -Mtextconv $<move>

Next command simply produces the same result.

    $ diff OLD.docx NEW.docx

## FILE FORMATS

- git

    [git(1)](http://man.he.net/man1/git) file object. Like `HEAD^:README.md`.

- msdoc

    Microsoft office format files in XML (.docx, .pptx, .xlsx, .docm,
    .pptm, .xlsm).
    Use
    [App::optex::textconv::msdoc](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Amsdoc),
    [App::optex::textconv::ooxml](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Aooxml),
    [App::optex::textconv::ooxml::regex](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Aooxml%3A%3Aregex),
    [App::optex::textconv::ooxml::xslt](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Aooxml%3A%3Axslt).

- doc

    Microsoft Word file.
    Use [Text::Extract::Word](https://metacpan.org/pod/Text%3A%3AExtract%3A%3AWord) module.

- xls

    Microsoft Excel file.
    Use [Spreadsheet::ParseExcel](https://metacpan.org/pod/Spreadsheet%3A%3AParseExcel) module.

- pdf

    Use [pdftotext(1)](http://man.he.net/man1/pdftotext) command to covert PDF format.
    See [App::optex::textconv::pdf](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Apdf).

- jpeg

    JPEG file is converted to its exif information (.jpeg, .jpg).

- http

    Name start with `http://` or `https://` is converted to text data
    translated by [w3c(1)](http://man.he.net/man1/w3c) command.

- gpg

    Invoke [gpg(1)](http://man.he.net/man1/gpg) command to decrypt encrypted files with `.gpg`
    extention.

- pandoc

    Use [pandoc](https://pandoc.org/) command to translate Microsoft
    office document in XML format.
    See [App::optex::textconv::pandoc](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Apandoc).

- tika

    Use [Apache Tika](https://tika.apache.org/) command to translate
    Microsoft office document in XML and non-XML format.
    See [App::optex::textconv::tika](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Atika).

# MICROSOFT DOCUMENTS

Microsoft office document in XML format (.docx, .pptx, .xlsx) is
converted to plain text by original code implemented in
[App::optex::textconv::ooxml::regex](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv%3A%3Aooxml%3A%3Aregex) module.  Algorithm used in this
module is extremely simple, and consequently runs fast.

Two modules are included in this distribution to use other external
converter program, **pandoc** and **tika**, which implement much more
serious algorithm.  They can be invoked by calling **load** function
with module declaration like:

    optex -Mtextconv::load=pandoc

    optex -Mtextconv::load=tika

# INSTALL

## CPANM

    cpanm App::optex::textconv

## GIT

These are sample configurations using [App::optex::textconv](https://metacpan.org/pod/App%3A%3Aoptex%3A%3Atextconv) in git
environment.

        ~/.gitconfig
                [diff "msdoc"]
                        textconv = optex -Mtextconv cat
                [diff "pdf"]
                        textconv = optex -Mtextconv cat
                [diff "jpg"]
                        textconv = optex -Mtextconv cat

        ~/.config/git/attributes
                *.docx   diff=msdoc
                *.pptx   diff=msdoc
                *.xlmx   diff=msdoc
                *.pdf    diff=pdf
                *.jpg    diff=jpg

About other GIT related setting, see
[https://github.com/kaz-utashiro/sdif-tools](https://github.com/kaz-utashiro/sdif-tools).

# SEE ALSO



( run in 1.043 second using v1.01-cache-2.11-cpan-df04353d9ac )