Lugh
view release on metacpan or search on metacpan
lib/Lugh/Tokenizer.pm view on Meta::CPAN
=head2 Regular Tokens
Normal subword units:
"hello" â Single token
"âworld" â Word with space prefix
"ing" â Common suffix
=head2 Special Tokens
Control tokens with special meaning:
<s> â BOS (beginning of sequence)
</s> â EOS (end of sequence)
<unk> â Unknown token
<pad> â Padding token
=head2 Byte Fallback Tokens
For characters not in vocabulary (LLaMA models):
<0x00> through <0xFF> â Raw byte tokens
This allows encoding any UTF-8 text, even with unseen characters.
=head1 COMMON PATTERNS
=head2 Basic Tokenization
my $model = Lugh::Model->new(model => $path);
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my @tokens = $tokenizer->encode("Hello, world!");
print "Tokens: @tokens\n";
my $decoded = $tokenizer->decode(@tokens);
print "Decoded: $decoded\n";
=head2 Token Inspection
# See what each token represents
my @tokens = $tokenizer->encode("The quick brown fox");
for my $id (@tokens) {
my $text = $tokenizer->decode([$id]);
printf "Token %5d: '%s'\n", $id, $text;
}
=head2 Chat Template
# Build a chat prompt (LLaMA 2 format)
my $prompt = "<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is the capital of France? [/INST]";
my @tokens = $tokenizer->encode($prompt, add_bos => 0);
=head2 Streaming Decode
# Decode one token at a time (for streaming output)
for my $token (@generated_tokens) {
my $text = $tokenizer->decode([$token]);
print $text;
STDOUT->flush();
}
=head1 LIMITATIONS
=over 4
=item * B<Greedy Algorithm> - May not produce optimal BPE tokenization
=item * B<No Merge Rules> - Does not use BPE merge rules, just vocabulary lookup
=item * B<UTF-8 Only> - Input text must be valid UTF-8
=item * B<No Normalization> - Does not perform Unicode normalization
=back
For most LLM inference use cases, these limitations do not significantly
impact results.
=head1 THREAD SAFETY
Lugh::Tokenizer objects are NOT thread-safe. Each Perl thread must
create its own Tokenizer object (though they can share the same Model
if created separately in each thread).
=head1 SEE ALSO
L<Lugh>, L<Lugh::Model>, L<Lugh::Inference>
L<https://github.com/google/sentencepiece> - SentencePiece tokenizer
L<https://arxiv.org/abs/1508.07909> - BPE paper
=head1 AUTHOR
lnation E<lt>email@lnation.orgE<gt>
=head1 LICENSE
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
=cut
1;
( run in 1.038 second using v1.01-cache-2.11-cpan-df04353d9ac )