Table of Contents

This is an introduction to the LLM parameters.
#

Inference-Parameters
#

Temperature(Default 1 RANGE 0-2)

high temperature will enhance llm’s personal, less temperature will keep llm’s level

for instance: while temperature =1.0, llm’s behavior will more brave , 0.1 will be stable

A higher temperature setting can enhance the LLM’s creativity, while a lower temperature will maintain its consistency.

For example: When the temperature is set to 1.0, the LLM’s responses tend to be more adventurous; at 0.1, they are more conservative.
Max Generation Length(Default: Model-dependent,Range 1 to model’s maximum)

Set Max Generation Length can control llm’s input context’s length

Setting the Maximum Generation Length can control the length of the input context for the LLM."
Top-K(Default 0 Range 1 to vocabulary,Common Range 40,50)

Sampling is restricted to the top k highest-probability words. For instance, with k set to 50, the selection is made from among the 50 most likely words.
Top-p(Default 1 Range0-1 ,Common 0.9,0.95)

Sampling is conducted from the pool of words until their cumulative probability reaches p. For example, with p set to 0.9, the selection is made from among the words whose combined probabilities account for 90% of the total.
Repetion Penalty(Default 1 Range 1.0- 2.0)

To mitigate the likelihood of generating repetitive words, a higher repetition penalty Range can be applied, which effectively reduces the occurrence of redundant content in the text.
Beam Searth(Default 1,Range 1 to 10)

Beam Search is a heuristic search algorithm that maintains multiple candidate sequences during generation, ultimately selecting the most optimal text output. Key parameters for this algorithm include the Beam Width, which determines the number of candidates retained at each step.
Length Penalty

Length Penalty is a setting used in text generation to help the model create text that’s not too short or too long. It works by adjusting the score of a generated text based on its length. Example: Imagine you’re asking a computer to write a summary of a book. Without Length Penalty, the summary might be too short and miss important points, or it might be too long and include unnecessary details. By using Length Penalty, you can tell the computer to aim for a summary that’s just the right length, so it includes all the key information without extra fluff.
frequency_penalty(Default 0 Range -2.0- +2.0)

Description: Penalizes words based on their frequency in the generated text, reducing repetition. Example: When set to 1.0, common words are used less often, resulting in more diverse text.
presence_penalty(Default 0 Range -2.0- +2.0)

Description: Penalizes words that have already appeared in the text, encouraging the use of new words. Example: When set to 1.5, the model tends to use words that haven’t appeared yet, increasing text diversity.
stop(Range:String or String list)

Description: Specifies one or more tokens at which to stop generation. Example: When set to [".", “!”, “?”], the model will stop generating after producing these punctuation marks.
n (Default 1)

Description: Specifies how many completions to generate. Example: When set to 3, the model will generate 3 different responses.
best_of

Description: Generates multiple candidate results and returns the best n. Example: With n=2 and best_of=5, the model generates 5 candidate results and returns the 2 best ones.
logprobs

Description: Returns the most likely tokens and their log probabilities. Example: When set to 3, each generated token will be accompanied by the 3 most likely alternative options and their probabilities.
no_repeat_ngram_size

Description: Prevents repetition of specified length word groups. Example: When set to 3, the model avoids generating any consecutive repetition of three-word combinations.

[!CAUTION]

These parameters can be used together, but with some limitations:

Some parameters work against each other, like high temperature with strict top-k/top-p.

Beam search often overrides other sampling methods.

Using many complex parameters at once can slow down generation.

Not all models support every parameter.

Some parameters have specific Range ranges (e.g., temperature is usually 0-2).

Different tasks may need different parameter combinations.

Stop conditions usually take priority over other parameters.

Best practice:

Start with defaults

Adjust one parameter at a time

Test thoroughly

Keep notes on what works best

参数名称 Parameter Name	默认值 Default Value	取值范围 Range	描述 Description
Temperature 温度	1	0-2	控制输出的随机性和创造性。较高值增加创造性，较低值增加一致性。 Controls randomness and creativity of output. Higher values increase creativity, lower values increase consistency.
Max Generation Length 最大生成长度	模型相关 Model-dependent	1 到模型最大值 1 to model’s maximum	控制LLM输入上下文的长度。 Controls the length of the input context for the LLM.
Top-K 前K个	0	1 到词汇表大小 1 to vocabulary size	限制采样到概率最高的K个词。常用范围40-50。 Restricts sampling to top K highest-probability words. Common range 40-50.
Top-p 前p个	1	0-1	从累积概率达到p的词池中采样。常用值0.9, 0.95。 Samples from words until cumulative probability reaches p. Common values 0.9, 0.95.
Repetition Penalty 重复惩罚	1	1.0-2.0	降低重复词的生成概率，减少冗余内容。 Reduces likelihood of generating repetitive words, decreasing redundant content.
Beam Search 束搜索	1	1-10	保持多个候选序列，选择最优输出。关键参数包括Beam Width。 Maintains multiple candidate sequences, selects optimal output. Key parameter includes Beam Width.
Length Penalty 长度惩罚	1.0	通常0.0到2.0 Usually 0.0 to 2.0	根据长度调整生成文本的分数，控制输出长度。 Adjusts the score of generated text based on its length, controlling output length.
frequency_penalty 频率惩罚	0	-2.0 到 2.0 -2.0 to 2.0	根据词频对令牌进行惩罚。 Penalizes tokens based on their frequency.
presence_penalty 存在惩罚	0	-2.0 到 2.0 -2.0 to 2.0	根据是否出现过对令牌进行惩罚。 Penalizes tokens based on their presence.
stop 停止	无 None	字符串或字符串列表 String or String list	指定停止生成的标记。 Specifies tokens at which to stop generation.
n 数量	1	正整数 Positive integer	生成多少个完成结果。 Number of completions to generate.
best_of 最佳数量	1	正整数，大于等于n Positive integer, ≥ n	生成多个候选结果并返回最佳的n个。 Generate multiple candidates and return the best n.
logprobs 对数概率	null	非负整数（通常0-5） Non-negative integer (usually 0-5)	返回最可能的令牌及其对数概率。 Return log probabilities of the most likely tokens.
no_repeat_ngram_size 不重复n元组大小	0	正整数 Positive integer	防止重复指定长度的n元组。 Prevent repetition of n-grams of specified length.

Tranining parameters
#

Learning Rate: Explanation: Controls the step size at each iteration while moving toward a minimum of the loss function. Example: 0.0001 (1e-4)
Batch Size: Explanation: The number of training examples used in one iteration. Example: 32, 64, 128
Optimizer: Explanation: Algorithm used to update the model’s weights. Example: Adam, SGD, RMSprop
Epochs: Explanation: The number of complete passes through the entire training dataset. Example: 10, 50, 100
Weight Initialization: Explanation: Method used to set the initial random weights of the neural network. Example: Xavier initialization, He initialization
Regularization: Explanation: Techniques to prevent overfitting. Example: L2 regularization (weight decay = 0.01), Dropout (rate = 0.1)
Learning Rate Scheduler: Explanation: Strategy to adjust the learning rate during training. Example: StepLR (step_size=30, gamma=0.1), CosineAnnealingLR
Model Architecture: Explanation: The structure and size of the neural network. Example: Transformer with 12 layers, 768 hidden size, 12 attention heads
Sequence Length: Explanation: The maximum length of input sequences. Example: 512, 1024, 2048 tokens
Warmup Steps: Explanation: Number of steps to gradually increase the learning rate at the start of training. Example: 1000 steps
Gradient Clipping: Explanation: Technique to prevent exploding gradients by limiting their magnitude. Example: max_norm=1.0
Mixed Precision Training: Explanation: Using lower precision (e.g., float16) to speed up training and reduce memory usage. Example: Enabled with float16
Distributed Training Strategy: Explanation: Method for training across multiple GPUs or nodes. Example: Data Parallel, Model Parallel
Attention Dropout: Explanation: Dropout rate specifically for attention layers. Example: 0.1
Activation Function: Explanation: Non-linear function applied to neuron outputs. Example: ReLU, GELU

Parameter	Explanation	Example
Learning Rate 学习率	Controls the step size at each iteration while moving toward a minimum of the loss function. 控制每次迭代时参数更新的步长。	0.0001 (1e-4)
Batch Size 批量大小	The number of training examples used in one iteration. 每次迭代中使用的训练样本数量。	32, 64, 128
Optimizer 优化器	Algorithm used to update the model’s weights. 用于更新模型权重的算法。	Adam, SGD, RMSprop
Epochs 训练轮数	The number of complete passes through the entire training dataset. 完整遍历整个训练数据集的次数。	10, 50, 100
Weight Initialization 权重初始化	Method used to set the initial random weights of the neural network. 设置神经网络初始随机权重的方法。	Xavier initialization He initialization
Regularization 正则化	Techniques to prevent overfitting. 防止过拟合的技术。	L2 regularization (weight decay = 0.01) Dropout (rate = 0.1)
Learning Rate Scheduler 学习率调度器	Strategy to adjust the learning rate during training. 在训练过程中调整学习率的策略。	StepLR (step_size=30, gamma=0.1) CosineAnnealingLR
Model Architecture 模型架构	The structure and size of the neural network. 神经网络的结构和大小。	Transformer with 12 layers, 768 hidden size, 12 attention heads
Sequence Length 序列长度	The maximum length of input sequences. 输入序列的最大长度。	512, 1024, 2048 tokens
Warmup Steps 预热步数	Number of steps to gradually increase the learning rate at the start of training. 训练开始时逐步增加学习率的步数。	1000 steps
Gradient Clipping 梯度裁剪	Technique to prevent exploding gradients by limiting their magnitude. 通过限制梯度幅度来防止梯度爆炸的技术。	max_norm=1.0
Mixed Precision Training 混合精度训练	Using lower precision (e.g., float16) to speed up training and reduce memory usage. 使用较低精度（如float16）来加速训练并减少内存使用。	Enabled with float16
Distributed Training Strategy 分布式训练策略	Method for training across multiple GPUs or nodes. 跨多个GPU或节点进行训练的方法。	Data Parallel, Model Parallel
Attention Dropout 注意力丢弃率	Dropout rate specifically for attention layers. 专门用于注意力层的丢弃率。	0.1
Activation Function 激活函数	Non-linear function applied to neuron outputs. 应用于神经元输出的非线性函数。	ReLU, GELU

This is an introduction to the LLM parameters. #

Inference-Parameters #

Tranining parameters #

This is an introduction to the LLM parameters.
#

Inference-Parameters
#

Tranining parameters
#