Skip to main content
  1. Posts/

LLM Parameters

·2050 words·5 mins· loading · loading · ·
 Author
Curious
Table of Contents

featured

This is an introduction to the LLM parameters.
#

Inference-Parameters
#

  1. Temperature(Default 1 RANGE 0-2)

    high temperature will enhance llm’s personal, less temperature will keep llm’s level

    for instance: while temperature =1.0, llm’s behavior will more brave , 0.1 will be stable


    A higher temperature setting can enhance the LLM’s creativity, while a lower temperature will maintain its consistency.

    For example: When the temperature is set to 1.0, the LLM’s responses tend to be more adventurous; at 0.1, they are more conservative.

  2. Max Generation Length(Default: Model-dependent,Range 1 to model’s maximum)

    Set Max Generation Length can control llm’s input context’s length


    Setting the Maximum Generation Length can control the length of the input context for the LLM."

  3. Top-K(Default 0 Range 1 to vocabulary,Common Range 40,50)

    Sampling is restricted to the top k highest-probability words. For instance, with k set to 50, the selection is made from among the 50 most likely words.

  4. Top-p(Default 1 Range0-1 ,Common 0.9,0.95)

    Sampling is conducted from the pool of words until their cumulative probability reaches p. For example, with p set to 0.9, the selection is made from among the words whose combined probabilities account for 90% of the total.

  5. Repetion Penalty(Default 1 Range 1.0- 2.0)

    To mitigate the likelihood of generating repetitive words, a higher repetition penalty Range can be applied, which effectively reduces the occurrence of redundant content in the text.

  6. Beam Searth(Default 1,Range 1 to 10)

    Beam Search is a heuristic search algorithm that maintains multiple candidate sequences during generation, ultimately selecting the most optimal text output. Key parameters for this algorithm include the Beam Width, which determines the number of candidates retained at each step.

  7. Length Penalty

    Length Penalty is a setting used in text generation to help the model create text that’s not too short or too long. It works by adjusting the score of a generated text based on its length. Example: Imagine you’re asking a computer to write a summary of a book. Without Length Penalty, the summary might be too short and miss important points, or it might be too long and include unnecessary details. By using Length Penalty, you can tell the computer to aim for a summary that’s just the right length, so it includes all the key information without extra fluff.

  8. frequency_penalty(Default 0 Range -2.0- +2.0)

    Description: Penalizes words based on their frequency in the generated text, reducing repetition. Example: When set to 1.0, common words are used less often, resulting in more diverse text.

  9. presence_penalty(Default 0 Range -2.0- +2.0)

    Description: Penalizes words that have already appeared in the text, encouraging the use of new words. Example: When set to 1.5, the model tends to use words that haven’t appeared yet, increasing text diversity.

  10. stop(Range:String or String list)

    Description: Specifies one or more tokens at which to stop generation. Example: When set to [".", “!”, “?”], the model will stop generating after producing these punctuation marks.

  11. n (Default 1)

    Description: Specifies how many completions to generate. Example: When set to 3, the model will generate 3 different responses.

  12. best_of

    Description: Generates multiple candidate results and returns the best n. Example: With n=2 and best_of=5, the model generates 5 candidate results and returns the 2 best ones.

  13. logprobs

    Description: Returns the most likely tokens and their log probabilities. Example: When set to 3, each generated token will be accompanied by the 3 most likely alternative options and their probabilities.

  14. no_repeat_ngram_size

    Description: Prevents repetition of specified length word groups. Example: When set to 3, the model avoids generating any consecutive repetition of three-word combinations.

[!CAUTION]

These parameters can be used together, but with some limitations:

  1. Some parameters work against each other, like high temperature with strict top-k/top-p.
  2. Beam search often overrides other sampling methods.
  3. Using many complex parameters at once can slow down generation.
  4. Not all models support every parameter.
  5. Some parameters have specific Range ranges (e.g., temperature is usually 0-2).
  6. Different tasks may need different parameter combinations.
  7. Stop conditions usually take priority over other parameters.

Best practice:

  • Start with defaults
  • Adjust one parameter at a time
  • Test thoroughly
  • Keep notes on what works best
参数名称
Parameter Name
默认值
Default Value
取值范围
Range
描述
Description
Temperature
温度
1 0-2 控制输出的随机性和创造性。较高值增加创造性,较低值增加一致性。
Controls randomness and creativity of output. Higher values increase creativity, lower values increase consistency.
Max Generation Length
最大生成长度
模型相关
Model-dependent
1 到模型最大值
1 to model’s maximum
控制LLM输入上下文的长度。
Controls the length of the input context for the LLM.
Top-K
前K个
0 1 到词汇表大小
1 to vocabulary size
限制采样到概率最高的K个词。常用范围40-50。
Restricts sampling to top K highest-probability words. Common range 40-50.
Top-p
前p个
1 0-1 从累积概率达到p的词池中采样。常用值0.9, 0.95。
Samples from words until cumulative probability reaches p. Common values 0.9, 0.95.
Repetition Penalty
重复惩罚
1 1.0-2.0 降低重复词的生成概率,减少冗余内容。
Reduces likelihood of generating repetitive words, decreasing redundant content.
Beam Search
束搜索
1 1-10 保持多个候选序列,选择最优输出。关键参数包括Beam Width。
Maintains multiple candidate sequences, selects optimal output. Key parameter includes Beam Width.
Length Penalty
长度惩罚
1.0 通常0.0到2.0
Usually 0.0 to 2.0
根据长度调整生成文本的分数,控制输出长度。
Adjusts the score of generated text based on its length, controlling output length.
frequency_penalty
频率惩罚
0 -2.0 到 2.0
-2.0 to 2.0
根据词频对令牌进行惩罚。
Penalizes tokens based on their frequency.
presence_penalty
存在惩罚
0 -2.0 到 2.0
-2.0 to 2.0
根据是否出现过对令牌进行惩罚。
Penalizes tokens based on their presence.
stop
停止

None
字符串或字符串列表
String or String list
指定停止生成的标记。
Specifies tokens at which to stop generation.
n
数量
1 正整数
Positive integer
生成多少个完成结果。
Number of completions to generate.
best_of
最佳数量
1 正整数,大于等于n
Positive integer, ≥ n
生成多个候选结果并返回最佳的n个。
Generate multiple candidates and return the best n.
logprobs
对数概率
null 非负整数(通常0-5)
Non-negative integer (usually 0-5)
返回最可能的令牌及其对数概率。
Return log probabilities of the most likely tokens.
no_repeat_ngram_size
不重复n元组大小
0 正整数
Positive integer
防止重复指定长度的n元组。
Prevent repetition of n-grams of specified length.

Tranining parameters
#

  1. Learning Rate: Explanation: Controls the step size at each iteration while moving toward a minimum of the loss function. Example: 0.0001 (1e-4)
  2. Batch Size: Explanation: The number of training examples used in one iteration. Example: 32, 64, 128
  3. Optimizer: Explanation: Algorithm used to update the model’s weights. Example: Adam, SGD, RMSprop
  4. Epochs: Explanation: The number of complete passes through the entire training dataset. Example: 10, 50, 100
  5. Weight Initialization: Explanation: Method used to set the initial random weights of the neural network. Example: Xavier initialization, He initialization
  6. Regularization: Explanation: Techniques to prevent overfitting. Example: L2 regularization (weight decay = 0.01), Dropout (rate = 0.1)
  7. Learning Rate Scheduler: Explanation: Strategy to adjust the learning rate during training. Example: StepLR (step_size=30, gamma=0.1), CosineAnnealingLR
  8. Model Architecture: Explanation: The structure and size of the neural network. Example: Transformer with 12 layers, 768 hidden size, 12 attention heads
  9. Sequence Length: Explanation: The maximum length of input sequences. Example: 512, 1024, 2048 tokens
  10. Warmup Steps: Explanation: Number of steps to gradually increase the learning rate at the start of training. Example: 1000 steps
  11. Gradient Clipping: Explanation: Technique to prevent exploding gradients by limiting their magnitude. Example: max_norm=1.0
  12. Mixed Precision Training: Explanation: Using lower precision (e.g., float16) to speed up training and reduce memory usage. Example: Enabled with float16
  13. Distributed Training Strategy: Explanation: Method for training across multiple GPUs or nodes. Example: Data Parallel, Model Parallel
  14. Attention Dropout: Explanation: Dropout rate specifically for attention layers. Example: 0.1
  15. Activation Function: Explanation: Non-linear function applied to neuron outputs. Example: ReLU, GELU
Parameter Explanation Example
Learning Rate
学习率
Controls the step size at each iteration while moving toward a minimum of the loss function.
控制每次迭代时参数更新的步长。
0.0001 (1e-4)
Batch Size
批量大小
The number of training examples used in one iteration.
每次迭代中使用的训练样本数量。
32, 64, 128
Optimizer
优化器
Algorithm used to update the model’s weights.
用于更新模型权重的算法。
Adam, SGD, RMSprop
Epochs
训练轮数
The number of complete passes through the entire training dataset.
完整遍历整个训练数据集的次数。
10, 50, 100
Weight Initialization
权重初始化
Method used to set the initial random weights of the neural network.
设置神经网络初始随机权重的方法。
Xavier initialization
He initialization
Regularization
正则化
Techniques to prevent overfitting.
防止过拟合的技术。
L2 regularization (weight decay = 0.01)
Dropout (rate = 0.1)
Learning Rate Scheduler
学习率调度器
Strategy to adjust the learning rate during training.
在训练过程中调整学习率的策略。
StepLR (step_size=30, gamma=0.1)
CosineAnnealingLR
Model Architecture
模型架构
The structure and size of the neural network.
神经网络的结构和大小。
Transformer with 12 layers, 768 hidden size, 12 attention heads
Sequence Length
序列长度
The maximum length of input sequences.
输入序列的最大长度。
512, 1024, 2048 tokens
Warmup Steps
预热步数
Number of steps to gradually increase the learning rate at the start of training.
训练开始时逐步增加学习率的步数。
1000 steps
Gradient Clipping
梯度裁剪
Technique to prevent exploding gradients by limiting their magnitude.
通过限制梯度幅度来防止梯度爆炸的技术。
max_norm=1.0
Mixed Precision Training
混合精度训练
Using lower precision (e.g., float16) to speed up training and reduce memory usage.
使用较低精度(如float16)来加速训练并减少内存使用。
Enabled with float16
Distributed Training Strategy
分布式训练策略
Method for training across multiple GPUs or nodes.
跨多个GPU或节点进行训练的方法。
Data Parallel, Model Parallel
Attention Dropout
注意力丢弃率
Dropout rate specifically for attention layers.
专门用于注意力层的丢弃率。
0.1
Activation Function
激活函数
Non-linear function applied to neuron outputs.
应用于神经元输出的非线性函数。
ReLU, GELU