This is an introduction to the LLM parameters. #
Inference-Parameters #
-
Temperature(Default 1 RANGE 0-2)
high temperature will enhance llm’s personal, less temperature will keep llm’s level
for instance: while temperature =1.0, llm’s behavior will more brave , 0.1 will be stable
A higher temperature setting can enhance the LLM’s creativity, while a lower temperature will maintain its consistency.
For example: When the temperature is set to 1.0, the LLM’s responses tend to be more adventurous; at 0.1, they are more conservative.
-
Max Generation Length(Default: Model-dependent,Range 1 to model’s maximum)
Set Max Generation Length can control llm’s input context’s length
Setting the Maximum Generation Length can control the length of the input context for the LLM."
-
Top-K(Default 0 Range 1 to vocabulary,Common Range 40,50)
Sampling is restricted to the top k highest-probability words. For instance, with k set to 50, the selection is made from among the 50 most likely words.
-
Top-p(Default 1 Range0-1 ,Common 0.9,0.95)
Sampling is conducted from the pool of words until their cumulative probability reaches p. For example, with p set to 0.9, the selection is made from among the words whose combined probabilities account for 90% of the total.
-
Repetion Penalty(Default 1 Range 1.0- 2.0)
To mitigate the likelihood of generating repetitive words, a higher repetition penalty Range can be applied, which effectively reduces the occurrence of redundant content in the text.
-
Beam Searth(Default 1,Range 1 to 10)
Beam Search is a heuristic search algorithm that maintains multiple candidate sequences during generation, ultimately selecting the most optimal text output. Key parameters for this algorithm include the Beam Width, which determines the number of candidates retained at each step.
-
Length Penalty
Length Penalty is a setting used in text generation to help the model create text that’s not too short or too long. It works by adjusting the score of a generated text based on its length. Example: Imagine you’re asking a computer to write a summary of a book. Without Length Penalty, the summary might be too short and miss important points, or it might be too long and include unnecessary details. By using Length Penalty, you can tell the computer to aim for a summary that’s just the right length, so it includes all the key information without extra fluff.
-
frequency_penalty(Default 0 Range -2.0- +2.0)
Description: Penalizes words based on their frequency in the generated text, reducing repetition. Example: When set to 1.0, common words are used less often, resulting in more diverse text.
-
presence_penalty(Default 0 Range -2.0- +2.0)
Description: Penalizes words that have already appeared in the text, encouraging the use of new words. Example: When set to 1.5, the model tends to use words that haven’t appeared yet, increasing text diversity.
-
stop(Range:String or String list)
Description: Specifies one or more tokens at which to stop generation. Example: When set to [".", “!”, “?”], the model will stop generating after producing these punctuation marks.
-
n (Default 1)
Description: Specifies how many completions to generate. Example: When set to 3, the model will generate 3 different responses.
-
best_of
Description: Generates multiple candidate results and returns the best n. Example: With n=2 and best_of=5, the model generates 5 candidate results and returns the 2 best ones.
-
logprobs
Description: Returns the most likely tokens and their log probabilities. Example: When set to 3, each generated token will be accompanied by the 3 most likely alternative options and their probabilities.
-
no_repeat_ngram_size
Description: Prevents repetition of specified length word groups. Example: When set to 3, the model avoids generating any consecutive repetition of three-word combinations.
[!CAUTION]
These parameters can be used together, but with some limitations:
- Some parameters work against each other, like high temperature with strict top-k/top-p.
- Beam search often overrides other sampling methods.
- Using many complex parameters at once can slow down generation.
- Not all models support every parameter.
- Some parameters have specific Range ranges (e.g., temperature is usually 0-2).
- Different tasks may need different parameter combinations.
- Stop conditions usually take priority over other parameters.
Best practice:
- Start with defaults
- Adjust one parameter at a time
- Test thoroughly
- Keep notes on what works best
参数名称 Parameter Name |
默认值 Default Value |
取值范围 Range |
描述 Description |
---|---|---|---|
Temperature 温度 |
1 | 0-2 | 控制输出的随机性和创造性。较高值增加创造性,较低值增加一致性。 Controls randomness and creativity of output. Higher values increase creativity, lower values increase consistency. |
Max Generation Length 最大生成长度 |
模型相关 Model-dependent |
1 到模型最大值 1 to model’s maximum |
控制LLM输入上下文的长度。 Controls the length of the input context for the LLM. |
Top-K 前K个 |
0 | 1 到词汇表大小 1 to vocabulary size |
限制采样到概率最高的K个词。常用范围40-50。 Restricts sampling to top K highest-probability words. Common range 40-50. |
Top-p 前p个 |
1 | 0-1 | 从累积概率达到p的词池中采样。常用值0.9, 0.95。 Samples from words until cumulative probability reaches p. Common values 0.9, 0.95. |
Repetition Penalty 重复惩罚 |
1 | 1.0-2.0 | 降低重复词的生成概率,减少冗余内容。 Reduces likelihood of generating repetitive words, decreasing redundant content. |
Beam Search 束搜索 |
1 | 1-10 | 保持多个候选序列,选择最优输出。关键参数包括Beam Width。 Maintains multiple candidate sequences, selects optimal output. Key parameter includes Beam Width. |
Length Penalty 长度惩罚 |
1.0 | 通常0.0到2.0 Usually 0.0 to 2.0 |
根据长度调整生成文本的分数,控制输出长度。 Adjusts the score of generated text based on its length, controlling output length. |
frequency_penalty 频率惩罚 |
0 | -2.0 到 2.0 -2.0 to 2.0 |
根据词频对令牌进行惩罚。 Penalizes tokens based on their frequency. |
presence_penalty 存在惩罚 |
0 | -2.0 到 2.0 -2.0 to 2.0 |
根据是否出现过对令牌进行惩罚。 Penalizes tokens based on their presence. |
stop 停止 |
无 None |
字符串或字符串列表 String or String list |
指定停止生成的标记。 Specifies tokens at which to stop generation. |
n 数量 |
1 | 正整数 Positive integer |
生成多少个完成结果。 Number of completions to generate. |
best_of 最佳数量 |
1 | 正整数,大于等于n Positive integer, ≥ n |
生成多个候选结果并返回最佳的n个。 Generate multiple candidates and return the best n. |
logprobs 对数概率 |
null | 非负整数(通常0-5) Non-negative integer (usually 0-5) |
返回最可能的令牌及其对数概率。 Return log probabilities of the most likely tokens. |
no_repeat_ngram_size 不重复n元组大小 |
0 | 正整数 Positive integer |
防止重复指定长度的n元组。 Prevent repetition of n-grams of specified length. |
Tranining parameters #
- Learning Rate: Explanation: Controls the step size at each iteration while moving toward a minimum of the loss function. Example: 0.0001 (1e-4)
- Batch Size: Explanation: The number of training examples used in one iteration. Example: 32, 64, 128
- Optimizer: Explanation: Algorithm used to update the model’s weights. Example: Adam, SGD, RMSprop
- Epochs: Explanation: The number of complete passes through the entire training dataset. Example: 10, 50, 100
- Weight Initialization: Explanation: Method used to set the initial random weights of the neural network. Example: Xavier initialization, He initialization
- Regularization: Explanation: Techniques to prevent overfitting. Example: L2 regularization (weight decay = 0.01), Dropout (rate = 0.1)
- Learning Rate Scheduler: Explanation: Strategy to adjust the learning rate during training. Example: StepLR (step_size=30, gamma=0.1), CosineAnnealingLR
- Model Architecture: Explanation: The structure and size of the neural network. Example: Transformer with 12 layers, 768 hidden size, 12 attention heads
- Sequence Length: Explanation: The maximum length of input sequences. Example: 512, 1024, 2048 tokens
- Warmup Steps: Explanation: Number of steps to gradually increase the learning rate at the start of training. Example: 1000 steps
- Gradient Clipping: Explanation: Technique to prevent exploding gradients by limiting their magnitude. Example: max_norm=1.0
- Mixed Precision Training: Explanation: Using lower precision (e.g., float16) to speed up training and reduce memory usage. Example: Enabled with float16
- Distributed Training Strategy: Explanation: Method for training across multiple GPUs or nodes. Example: Data Parallel, Model Parallel
- Attention Dropout: Explanation: Dropout rate specifically for attention layers. Example: 0.1
- Activation Function: Explanation: Non-linear function applied to neuron outputs. Example: ReLU, GELU
Parameter | Explanation | Example |
---|---|---|
Learning Rate 学习率 |
Controls the step size at each iteration while moving toward a minimum of the loss function. 控制每次迭代时参数更新的步长。 |
0.0001 (1e-4) |
Batch Size 批量大小 |
The number of training examples used in one iteration. 每次迭代中使用的训练样本数量。 |
32, 64, 128 |
Optimizer 优化器 |
Algorithm used to update the model’s weights. 用于更新模型权重的算法。 |
Adam, SGD, RMSprop |
Epochs 训练轮数 |
The number of complete passes through the entire training dataset. 完整遍历整个训练数据集的次数。 |
10, 50, 100 |
Weight Initialization 权重初始化 |
Method used to set the initial random weights of the neural network. 设置神经网络初始随机权重的方法。 |
Xavier initialization He initialization |
Regularization 正则化 |
Techniques to prevent overfitting. 防止过拟合的技术。 |
L2 regularization (weight decay = 0.01) Dropout (rate = 0.1) |
Learning Rate Scheduler 学习率调度器 |
Strategy to adjust the learning rate during training. 在训练过程中调整学习率的策略。 |
StepLR (step_size=30, gamma=0.1) CosineAnnealingLR |
Model Architecture 模型架构 |
The structure and size of the neural network. 神经网络的结构和大小。 |
Transformer with 12 layers, 768 hidden size, 12 attention heads |
Sequence Length 序列长度 |
The maximum length of input sequences. 输入序列的最大长度。 |
512, 1024, 2048 tokens |
Warmup Steps 预热步数 |
Number of steps to gradually increase the learning rate at the start of training. 训练开始时逐步增加学习率的步数。 |
1000 steps |
Gradient Clipping 梯度裁剪 |
Technique to prevent exploding gradients by limiting their magnitude. 通过限制梯度幅度来防止梯度爆炸的技术。 |
max_norm=1.0 |
Mixed Precision Training 混合精度训练 |
Using lower precision (e.g., float16) to speed up training and reduce memory usage. 使用较低精度(如float16)来加速训练并减少内存使用。 |
Enabled with float16 |
Distributed Training Strategy 分布式训练策略 |
Method for training across multiple GPUs or nodes. 跨多个GPU或节点进行训练的方法。 |
Data Parallel, Model Parallel |
Attention Dropout 注意力丢弃率 |
Dropout rate specifically for attention layers. 专门用于注意力层的丢弃率。 |
0.1 |
Activation Function 激活函数 |
Non-linear function applied to neuron outputs. 应用于神经元输出的非线性函数。 |
ReLU, GELU |