streaming results from the CPAN

streaming

Lugh

view release on metacpan or search on metacpan

lib/Lugh/Inference.pm view on Meta::CPAN


=item * C<max_tokens> - Maximum tokens to generate (default: 128)

=item * C<temperature> - Sampling temperature (default: 0.8)

=item * C<top_p> - Top-p (nucleus) sampling threshold (default: 0.95)

=item * C<top_k> - Top-k sampling limit (default: 40). If < 1000, uses top_k; otherwise uses top_p

=item * C<greedy> - If true, use greedy decoding (argmax) (default: 0)

=item * C<eos_token> - Token ID to stop generation (default: from model, typically 2)

=item * C<callback> - Optional subroutine called for each generated token

=back

B<Returns:> A list of generated token IDs (not including the prompt).

B<Callback:>

The callback receives (token_id, count) and should return true to stop generation:

    callback => sub {
        my ($token, $count) = @_;
        print $tokenizer->decode([$token]);
        return 0;  # Continue (return 1 to stop)
    }

B<Stopping Conditions:>

Generation stops when:

=over 4

=item * max_tokens is reached

=item * EOS token is generated

=item * Callback returns true

=back

B<Example:>

    use Lugh;
    
    my $model = Lugh::Model->new(model => 'model.gguf');
    my $tokenizer = Lugh::Tokenizer->new(model => $model);
    my $inference = Lugh::Inference->new(model => $model);
    
    my @prompt = $tokenizer->encode("Once upon a time");
    
    # Greedy generation
    my @tokens = $inference->generate(\@prompt,
        max_tokens => 50,
        greedy     => 1,
    );
    print $tokenizer->decode(\@tokens);
    
    # Creative generation with streaming
    @tokens = $inference->generate(\@prompt,
        max_tokens  => 100,
        temperature => 1.0,
        top_p       => 0.95,
        callback    => sub {
            my ($tok, $n) = @_;
            print $tokenizer->decode([$tok]);
            STDOUT->flush();
            return 0;
        },
    );

=head1 ATTENTION MECHANISM

=head2 Scaled Dot-Product Attention

    Attention(Q, K, V) = softmax(QK^T / âˆšd_k) Ã— V

Where:

=over 4

=item * Q - Query vectors [head_dim, n_tokens, n_heads]

=item * K - Key vectors [head_dim, n_tokens, n_kv_heads]

=item * V - Value vectors [head_dim, n_tokens, n_kv_heads]

=item * d_k - Head dimension (typically 64-128)

=back

=head2 Grouped Query Attention (GQA)

GQA uses fewer KV heads than query heads to reduce memory:

    Model       n_head  n_kv_head  Ratio
    LLaMA 7B    32      32         1:1 (MHA)
    LLaMA 2 70B 64      8          8:1 (GQA)
    TinyLlama   32      4          8:1 (GQA)
    Mistral 7B  32      8          4:1 (GQA)

The implementation broadcasts KV heads to match query heads using
ggml's native broadcasting.

=head2 Causal Masking

The attention uses causal (autoregressive) masking so each position
can only attend to itself and previous positions:

    Position:  0  1  2  3
    0          âœ“  âœ—  âœ—  âœ—
    1          âœ“  âœ“  âœ—  âœ—
    2          âœ“  âœ“  âœ“  âœ—
    3          âœ“  âœ“  âœ“  âœ“

This is implemented using C<ggml_diag_mask_inf> which sets the upper
triangle to -infinity before softmax.

=head2 RoPE (Rotary Position Embeddings)

lib/Lugh/Inference.pm view on Meta::CPAN

    RoPE(x, pos) = x Ã— cos(pos Ã— Î¸) + rotate(x) Ã— sin(pos Ã— Î¸)

Where Î¸ depends on the dimension and base frequency (typically 10000).

Parameters are read from model metadata:

=over 4

=item * C<llama.rope.dimension_count> - Dimensions to rotate

=item * C<llama.rope.freq_base> - Base frequency

=item * C<llama.context_length> - Original context length

=back

=head1 FEED-FORWARD NETWORK

The FFN uses SwiGLU activation:

    FFN(x) = down(gate(x) Ã— SiLU(up(x)))

Where:

=over 4

=item * gate, up - Linear projections to intermediate dimension

=item * SiLU - Sigmoid Linear Unit: x Ã— sigmoid(x)

=item * down - Linear projection back to model dimension

=back

Typical dimensions:

    Model       n_embd  FFN_dim   Ratio
    TinyLlama   2048    5632      2.75Ã—
    LLaMA 7B    4096    11008     2.69Ã—
    LLaMA 13B   5120    13824     2.70Ã—

=head1 GENERATION LOOP

The C<generate()> method handles the complete generation loop internally.
For simple use cases:

    use Lugh;
    
    my $model = Lugh::Model->new(model => 'model.gguf');
    my $tokenizer = Lugh::Tokenizer->new(model => $model);
    my $inference = Lugh::Inference->new(model => $model);
    
    my @prompt = $tokenizer->encode("Once upon a time");
    my @generated = $inference->generate(\@prompt,
        max_tokens  => 100,
        temperature => 0.8,
        top_p       => 0.95,
    );
    print $tokenizer->decode(\@generated);

For streaming output:

    my @generated = $inference->generate(\@prompt,
        max_tokens  => 100,
        temperature => 0.8,
        callback    => sub {
            my ($token, $count) = @_;
            print $tokenizer->decode([$token]);
            STDOUT->flush();
            return 0;  # Continue
        },
    );

For manual control (building your own loop):

    my @tokens = $tokenizer->encode($prompt);
    my @generated;
    
    for (1..$max_tokens) {
        my @logits = $inference->forward(tokens => \@tokens);
        my $next = $inference->sample_top_p(\@logits,
            temperature => 0.8,
            top_p => 0.9
        );
        
        last if $next == $tokenizer->eos_id;
        
        push @tokens, $next;
        push @generated, $next;
        
        print $tokenizer->decode([$next]);
        STDOUT->flush();
    }

=head1 PERFORMANCE

=head2 Computation

A single forward pass performs approximately:

    FLOPs â‰ˆ 2 Ã— n_params Ã— n_tokens

For TinyLlama (1.1B params) with 6 tokens:

    2 Ã— 1.1e9 Ã— 6 â‰ˆ 13 GFLOPs

=head2 Memory

During inference, memory is needed for:

=over 4

=item * Model weights (quantized) - Depends on model size and quantization

=item * Activations - O(n_tokens Ã— n_embd Ã— n_layers)

=item * Attention scores - O(n_tokensÂ² Ã— n_heads Ã— n_layers)

=back

=head2 Optimizations

( run in 1.179 second using v1.01-cache-2.11-cpan-df04353d9ac )