Lugh
view release on metacpan or search on metacpan
lib/Lugh/Inference.pm view on Meta::CPAN
=item * C<max_tokens> - Maximum tokens to generate (default: 128)
=item * C<temperature> - Sampling temperature (default: 0.8)
=item * C<top_p> - Top-p (nucleus) sampling threshold (default: 0.95)
=item * C<top_k> - Top-k sampling limit (default: 40). If < 1000, uses top_k; otherwise uses top_p
=item * C<greedy> - If true, use greedy decoding (argmax) (default: 0)
=item * C<eos_token> - Token ID to stop generation (default: from model, typically 2)
=item * C<callback> - Optional subroutine called for each generated token
=back
B<Returns:> A list of generated token IDs (not including the prompt).
B<Callback:>
The callback receives (token_id, count) and should return true to stop generation:
callback => sub {
my ($token, $count) = @_;
print $tokenizer->decode([$token]);
return 0; # Continue (return 1 to stop)
}
B<Stopping Conditions:>
Generation stops when:
=over 4
=item * max_tokens is reached
=item * EOS token is generated
=item * Callback returns true
=back
B<Example:>
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
my @prompt = $tokenizer->encode("Once upon a time");
# Greedy generation
my @tokens = $inference->generate(\@prompt,
max_tokens => 50,
greedy => 1,
);
print $tokenizer->decode(\@tokens);
# Creative generation with streaming
@tokens = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 1.0,
top_p => 0.95,
callback => sub {
my ($tok, $n) = @_;
print $tokenizer->decode([$tok]);
STDOUT->flush();
return 0;
},
);
=head1 ATTENTION MECHANISM
=head2 Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / âd_k) Ã V
Where:
=over 4
=item * Q - Query vectors [head_dim, n_tokens, n_heads]
=item * K - Key vectors [head_dim, n_tokens, n_kv_heads]
=item * V - Value vectors [head_dim, n_tokens, n_kv_heads]
=item * d_k - Head dimension (typically 64-128)
=back
=head2 Grouped Query Attention (GQA)
GQA uses fewer KV heads than query heads to reduce memory:
Model n_head n_kv_head Ratio
LLaMA 7B 32 32 1:1 (MHA)
LLaMA 2 70B 64 8 8:1 (GQA)
TinyLlama 32 4 8:1 (GQA)
Mistral 7B 32 8 4:1 (GQA)
The implementation broadcasts KV heads to match query heads using
ggml's native broadcasting.
=head2 Causal Masking
The attention uses causal (autoregressive) masking so each position
can only attend to itself and previous positions:
Position: 0 1 2 3
0 â â â â
1 â â â â
2 â â â â
3 â â â â
This is implemented using C<ggml_diag_mask_inf> which sets the upper
triangle to -infinity before softmax.
=head2 RoPE (Rotary Position Embeddings)
lib/Lugh/Inference.pm view on Meta::CPAN
RoPE(x, pos) = x à cos(pos à θ) + rotate(x) à sin(pos à θ)
Where θ depends on the dimension and base frequency (typically 10000).
Parameters are read from model metadata:
=over 4
=item * C<llama.rope.dimension_count> - Dimensions to rotate
=item * C<llama.rope.freq_base> - Base frequency
=item * C<llama.context_length> - Original context length
=back
=head1 FEED-FORWARD NETWORK
The FFN uses SwiGLU activation:
FFN(x) = down(gate(x) Ã SiLU(up(x)))
Where:
=over 4
=item * gate, up - Linear projections to intermediate dimension
=item * SiLU - Sigmoid Linear Unit: x à sigmoid(x)
=item * down - Linear projection back to model dimension
=back
Typical dimensions:
Model n_embd FFN_dim Ratio
TinyLlama 2048 5632 2.75Ã
LLaMA 7B 4096 11008 2.69Ã
LLaMA 13B 5120 13824 2.70Ã
=head1 GENERATION LOOP
The C<generate()> method handles the complete generation loop internally.
For simple use cases:
use Lugh;
my $model = Lugh::Model->new(model => 'model.gguf');
my $tokenizer = Lugh::Tokenizer->new(model => $model);
my $inference = Lugh::Inference->new(model => $model);
my @prompt = $tokenizer->encode("Once upon a time");
my @generated = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 0.8,
top_p => 0.95,
);
print $tokenizer->decode(\@generated);
For streaming output:
my @generated = $inference->generate(\@prompt,
max_tokens => 100,
temperature => 0.8,
callback => sub {
my ($token, $count) = @_;
print $tokenizer->decode([$token]);
STDOUT->flush();
return 0; # Continue
},
);
For manual control (building your own loop):
my @tokens = $tokenizer->encode($prompt);
my @generated;
for (1..$max_tokens) {
my @logits = $inference->forward(tokens => \@tokens);
my $next = $inference->sample_top_p(\@logits,
temperature => 0.8,
top_p => 0.9
);
last if $next == $tokenizer->eos_id;
push @tokens, $next;
push @generated, $next;
print $tokenizer->decode([$next]);
STDOUT->flush();
}
=head1 PERFORMANCE
=head2 Computation
A single forward pass performs approximately:
FLOPs â 2 à n_params à n_tokens
For TinyLlama (1.1B params) with 6 tokens:
2 Ã 1.1e9 Ã 6 â 13 GFLOPs
=head2 Memory
During inference, memory is needed for:
=over 4
=item * Model weights (quantized) - Depends on model size and quantization
=item * Activations - O(n_tokens à n_embd à n_layers)
=item * Attention scores - O(n_tokens² à n_heads à n_layers)
=back
=head2 Optimizations
( run in 1.179 second using v1.01-cache-2.11-cpan-df04353d9ac )