formula results from the CPAN

Lingua-Identify
view release on metacpan or search on metacpan
    'text'             => $text,
    'mode'             => 'dummy',
  }

=back

=head2 langof_file

I<langof_file> works just like I<langof>, with the exception that it
reveives filenames instead of text. It reads these texts (if existing
and readable, of course) and parses its content.

Currently, I<langof_file> assumes the files are regular text. This may
change in the future and the files might be scanned to check their
filetype and then parsed to extract only their textual content (which
should be pretty useful so that you can perform language
identification, say, in HTML files, or PDFs).

To identify the language a file is written in:

  $language = langof_file($path);

To get the most probable language and also the percentage of its probability,
do:

  ($language, $probability) = langof_file($path);

If you want a hash where each active language is mapped into its percentage,
use this:

  %languages = langof_file($path);

If you pass more than one file to I<langof_file>, they will all be
read and their content merged and then parsed for language
identification.

=head3 OPTIONS

I<langof_file> accepts all the options I<langof> does, so refer to
those first (up in this document).

  $language = langof_file(\%config, $path);

I<langof_file> currently only reads the first 10,000 bytes of each
file.

=head2 confidence

After getting the results into an array, its first element is the most probable
language. That doesn't mean it is very probable or not.

You can find more about the likeliness of the results to be accurate by
computing its confidence level.

  use Lingua::Identify qw/:language_identification/;
  my @results = langof($text);
  my $confidence_level = confidence(@results);
  # $confidence_level now holds a value between 0.5 and 1; the higher that
  # value, the more accurate the results seem to be

The formula used is pretty simple: p1 / (p1 + p2) , where p1 is the
probability of the most likely language and p2 is the probability of
the language which came in second. A couple of examples to illustrate
this:

English 50% Portuguese 10% ...

confidence level: 50 / (50 + 10) = 0.83

Another example:

Spanish 30% Portuguese 10% ...

confidence level: 30 / (25 + 30) = 0.55

French 10% German 5% ...

confidence level: 10 / (10 + 5) = 0.67

As you can see, the first example is probably the most accurate one.
Are there any doubts? The English language has five times the
probability of the second language.

The second example is a bit more tricky. 55% confidence. The
confidence level is always above 50%, for obvious reasons. 55% doesn't
make anyone confident in the results, and one shouldn't be, with
results such as these.

Notice the third example. The confidence level goes up to 67%, but the
probability of French is of mere 10%. So what? It's twice as much as
the second language. The low probability may well be caused by a great
number of languages in play.

=head2 get_all_methods

Returns a list comprised of all the available methods for language
identification.

=head1 LANGUAGE IDENTIFICATION IN GENERAL

Language identification is based in patterns.

In order to identify the language a given text is written in, we repeat a given
process for each active language (see section LANGUAGES MANIPULATION); in that
process, we look for common patterns of that language. Those patterns can be
prefixes, suffixes, common words, ngrams or even sequences of words.

After repeating the process for each language, the total score for each of them
is then used to compute the probability (in percentage) for each language to be
the one of that text.

=head1 METHODS OF LANGUAGE IDENTIFICATION

C<Lingua::Identify> currently comprises four different ways for language
identification, in a total of thirteen variations of those.

The available methods are the following: B<smallwords>, B<prefixes1>,
B<prefixes2>, B<prefixes3>, B<prefixes4>, B<suffixes1>, B<suffixes2>,
B<suffixes3>, B<suffixes4>, B<ngrams1>, B<ngrams2>, B<ngrams3> and B<ngrams4>.

Here's a more detailed explanation of each of those ways and those methods
( run in 1.065 second using v1.01-cache-2.11-cpan-6aa56a78535 )