Lugh

 view release on metacpan or  search on metacpan

lib/Lugh/Tokenizer.pm  view on Meta::CPAN

=head2 Regular Tokens

Normal subword units:

    "hello"  → Single token
    "▁world" → Word with space prefix
    "ing"    → Common suffix

=head2 Special Tokens

Control tokens with special meaning:

    <s>     → BOS (beginning of sequence)
    </s>    → EOS (end of sequence)
    <unk>   → Unknown token
    <pad>   → Padding token

=head2 Byte Fallback Tokens

For characters not in vocabulary (LLaMA models):

    <0x00> through <0xFF>  → Raw byte tokens

This allows encoding any UTF-8 text, even with unseen characters.

=head1 COMMON PATTERNS

=head2 Basic Tokenization

    my $model = Lugh::Model->new(model => $path);
    my $tokenizer = Lugh::Tokenizer->new(model => $model);
    
    my @tokens = $tokenizer->encode("Hello, world!");
    print "Tokens: @tokens\n";
    
    my $decoded = $tokenizer->decode(@tokens);
    print "Decoded: $decoded\n";

=head2 Token Inspection

    # See what each token represents
    my @tokens = $tokenizer->encode("The quick brown fox");
    for my $id (@tokens) {
        my $text = $tokenizer->decode([$id]);
        printf "Token %5d: '%s'\n", $id, $text;
    }

=head2 Chat Template

    # Build a chat prompt (LLaMA 2 format)
    my $prompt = "<s>[INST] <<SYS>>
    You are a helpful assistant.
    <</SYS>>
    
    What is the capital of France? [/INST]";
    
    my @tokens = $tokenizer->encode($prompt, add_bos => 0);

=head2 Streaming Decode

    # Decode one token at a time (for streaming output)
    for my $token (@generated_tokens) {
        my $text = $tokenizer->decode([$token]);
        print $text;
        STDOUT->flush();
    }

=head1 LIMITATIONS

=over 4

=item * B<Greedy Algorithm> - May not produce optimal BPE tokenization

=item * B<No Merge Rules> - Does not use BPE merge rules, just vocabulary lookup

=item * B<UTF-8 Only> - Input text must be valid UTF-8

=item * B<No Normalization> - Does not perform Unicode normalization

=back

For most LLM inference use cases, these limitations do not significantly
impact results.

=head1 THREAD SAFETY

Lugh::Tokenizer objects are NOT thread-safe. Each Perl thread must
create its own Tokenizer object (though they can share the same Model
if created separately in each thread).

=head1 SEE ALSO

L<Lugh>, L<Lugh::Model>, L<Lugh::Inference>

L<https://github.com/google/sentencepiece> - SentencePiece tokenizer

L<https://arxiv.org/abs/1508.07909> - BPE paper

=head1 AUTHOR

lnation E<lt>email@lnation.orgE<gt>

=head1 LICENSE

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=cut

1;



( run in 1.038 second using v1.01-cache-2.11-cpan-df04353d9ac )