GenAI Image Model Training

Introduction

This post will really just be a compilation of my personal notes, where I've explored the essential concepts of generative image models, including training hyperparameters, model semantics, and practical insights to help you train a model yourself. Whether you're employing presets or experimenting with advanced configurations, I hope these notes make your journey into utilizing these tools a little smoother.

Preface and Experience

This collection of notes on image model training with KohyaSS originated in a pocket-sized notebook and is continuously evolving. Expect ongoing revisions as I condense key information to serve as a fundamental resource.

In my spare time, I've trained over 2,000 generative image models, primarily using the Stable Diffusion architecture. It's important to emphasize that I am self-taught in the realm of machine learning. Much of my understanding has been formed through a long process of iterative experimentation.

Dive into Deep Learning

For a more comprehensive exploration of deep learning, I recommend you check out:

DeepLearning.AI with Andrew Ng
Course.Fast.AI with Jeremy Howard

These resources will provide you with a much better understanding of the foundations of deep learning, as well as the hyperparameters involved in training models yourself.

Model Semantics: LLMs vs. Computer Vision

Before going head-first into generative image models, there are a few key points I'd like to make. When working with machine learning models, it's crucial to understand the differences between different data modalities. In this case, between large language models (LLMs) and computer vision models. LLMs process and generate text by learning from massive datasets of written language, enabling them to predict the next token in a sequence, handle long-range context, and even translate text. How ChatGPT Works

In contrast, traditional image classification models use visual data, typically RGB values, to recognize patterns and categorize image features based on assigned labels. Advanced vision models like Vision Transformers (ViT) and CLIP push beyond traditional classification tasks by bridging visual and textual understanding, although their focus remains distinct from that of LLMs. CLIP Architecture

Prompting

To preface, as datasets are re-captioned using advanced techniques and larger models, generative performance continues to grow. The notes below may not be as important at a later date.

A prompt is typically text-based conditioning data provided to a generative model's input. Image models are trained on large datasets of image-caption pairs, not large amounts of conversational text. When prompting image models, it's generally best to use language that resembles image captions rather than conversational text.

For example, instead of a verbose prompt:

"Please create a photo of a beautiful landscape, picturing a large grassy field, with a bright blue sky and a beautiful sun shining brightly overhead. Add an artistic flair to the image, giving it an illustrative touch with hints of aesthetically pleasing qualities. Make sure the image portrays a beautiful scene filled with rich details and include elements of sci-fi and poetic themes throughout the image."

Try a more concise, caption-like prompt:

"Large grassy field, beautiful day, detailed, illustrative, aesthetic, sci-fi, poetic musings"

This approach helps prevent using excessive tokens, and stops you from pulling the model in various directions through the embedding space. See 3Blue1Brown's video on visualizing attention in transformers!

Note that this may not apply to apps/services like Midjourney, as many platforms often alter user prompts and the models performing the generation process in some way. Without a local, open-source environment, it's uncertain what prompts or conditioning data the generative models receive.

LoRA Learning

Understanding LoRAs

LoRA (Low-Rank Adaptation) is a method used to fine-tune models with fewer trainable parameters. All this really means is we take a large base model, and we train over the top of small pieces of that model, rather than the whole thing. Instead of adjusting all the model's weights, LoRAs introduce a small set of trainable matrices for each layer, significantly decreasing the number of parameters and making it more computationally efficient for fine-tuning large models on consumer hardware.

Learning Targets

There are two main learning targets:

U-Net: Consists of downsampling, bottleneck, and upsampling paths, responsible for processing image data and generating output images. If you wish to learn more about this, see the links provided below.
- PapersWithCode - U-Net Explained
- GeeksForGeeks - U-Net Architecture

U-Net Architecture

Text Encoder: Converts text prompts into vector representations (numbers in a matrix) shared by all attention blocks in the U-Net, playing a crucial role in the "Cross Attention" mechanism.
- Understanding CLIP - Szymon Palucha on Medium
- OpenAI's CLIP Paper

Text Encoder

Most often, the U-Net uses a single text encoder. However, with architectures such as that of the FLUX generative image model family, it's not uncommon to see multiple text encoders utilized to improve model performance.

Although the text encoder in Stable Diffusion models is typically considered "finished," it can be further tuned using LoRAs.

DyLoRA, Lo-con, LoHA? - LoRA Types

KohyaSS supports a range of LoRA types, each with specific use cases. Choosing the optimal type depends on the data you have prepared, your desired effect on the outputs, your hardware, and other factors. Experimentation is vital in machine learning due to the lack of a one-size-fits-all configuration.

There's nothing wrong with using the standard LoRA type. Explore other types if you encounter specific issues or wish to experiment.

DyLoRA (Dynamic LoRA)

DyLoRA training enables simultaneous learning across various network ranks within a defined maximum rank, helping find the optimal rank for your specific data.

LoHA (Low-Rank Hadamard Adaptation)

LoHA is designed for smaller hardware setups, producing slightly lower fidelity compared to standard LoRA and learning feature representations differently.

Lo-con (Low-Rank Convolution)

Lo-con extends learning to the ResNet blocks within the U-Net, enhancing the model's ability to capture subtle patterns. I have limited experience with this configuration.

Training Parameters

Batch Size

Batch size defines how many images are processed concurrently during training. Larger batch sizes can speed up training, but it may reduce individual feature precision. To put this simply, you may not capture some finer details from the images you provide in your dataset. Adjust the learning rate accordingly—if you double the batch size, the learning rate should be doubled to maintain effective training. Larger batches capture broader patterns (think overall style or tone of the images), but it ensures diversity to avoid overfitting (recreating 1:1 features from your images).

Repeats vs Epochs

In KohyaSS, repeats and epochs serve different roles:

Repeats: Define how many times each image is processed within a single epoch. For example, setting repeats to 25 means each image is trained 25 times during one epoch.
Epochs: Represent the number of full passes through the dataset. After each epoch, the model has seen the entire dataset once (or more if repeats are used).

The balance between these two arguments is really important. Typically, you'd only really increase the number of repeats with sparse datasets. That being said, LoRAs can learn quite effectively even on small sample sizes (with the right configuration), while epochs have a more direct impact on the model's weights themselves, thus the way in which it learns.

Epochs vs Repeats

Number of Regularization Images

Regularization images can be beneficial but aren't always necessary. A common guideline is to set their number equal to the training images multiplied by the number of epochs. I've achieved good results both with and without them, but will typically use them if I'm trying to train over a large dataset (1,000+ images).

Learning Rate

The learning rate (LR) determines how quickly the model's weights are updated. It can vary depending on factors like batch size, dataset complexity, and model architecture. Adjust the learning rate iteratively based on model performance. The default of 2e-4 is usually where I start, but I adjust it frequently based on the number of epochs, network alpha, and batch size.

Learning Rate

Network Alpha

Network alpha is a scaling factor that fine-tunes the magnitude of weight updates. It varies depending on the LoRA method and optimization algorithm (commonly AdamW).

Understanding LoRAs - LR Schedulers, NDim, Alpha - Dreams Are Real - Medium

Mixed Precision Training

KohyaSS supports mixed precision training, reducing memory usage and improving performance using lower precision formats like FP16 and BF16. All FP32/FP16 means is typically how the weights are stored in the model- 32 digits or 16 digits.

FP16: Cuts memory requirements in half, but can cause numerical instability.
BF16: Retains FP32's 'numerical range', offering a good compromise. See the documents below.

BF16

Random Seeds

Seeds determine aspects like image processing order and noise patterns. I typically use an arbitrary value for consistency across training runs, like 42.

Cache Latents

Caching latents pre-compresses images into their latent space, speeding up subsequent training and reducing VRAM usage. It limits augmentations to flip augmentation only but is useful for repeated training on the same dataset. This is something I tend to keep on, because I find myself frequently adjusting training parameters and re-training until I find a good valley my data is sitting in.

Optimizing AI: Guide to Stable Diffusion and Efficient Caching - Zilliz

LoRA Network Rank

The network rank defines the number of neurons in the hidden layer of the additional neural network used in LoRA learning. Experimentation is key to finding the optimal rank. Larger networks means your network will "store" more information. However, this may not be a good thing, because it means the model is more likely to learn undesirable features from the images. Additionally, it may lead to overfitting if left too small. 64 is a good place to start.

LoRA Network Alpha

LoRA network alpha prevents weights from becoming too small, ensuring effective learning. It should not exceed the network rank. Changes to this value will have to be reflecting in the learning rate.

Some notes:

Changes to your alpha influence your optimal learning rate. If the network alpha is 16 and the network rank is 32, the learning rate is halved (16/32). Adjust parameters as needed.

Understanding LoRAs - LR Schedulers, NDim, Alpha - Dreams Are Real - Medium

Preventing Text Encoder Training

Updating the text encoder can lead to overfitting. Specify a stopping point as a percentage of total steps.

Advanced Configurations for LoRA Training

Weights, Blocks, and CONV Settings

Weights

weights determine learning rate weights for each U-Net block. The threshold value prevents neural networks from being added to blocks with weights below it.

Blocks: Bdim and Balpha

blocks specify dimension (Bdim) and alpha (Balpha) values for each U-Net block. Some blocks are ignored during learning.

CONV: CONVdim and CONValpha

LoRA updates convolutional neural networks within blocks. CONVdim and CONValpha control CONV layers.

LoRA Block Weights

LoRA block weights control block influence during training. The U-Net Middle Block typically has higher weight.

Token Padding

token padding ensures input sequences have the same length.

Weighted Captions

weighted captions use syntax like token:1.3 to assign token weights, useful for complex captions.

Prior Loss

prior loss balances training data and regularization images. Adjust based on your needs.

Learning Rate Settings

LR Num Cycles

Specifies cycles for cosine with restarts scheduler.

LR Power

Used with polynomial scheduler; controls learning rate decrease.

Saving and Tokenization

Save Last N Steps

Allows model saving during training at specified intervals.

Keep N Tokens

Determines tokens to keep when shuffle captions is on.

Clip Skip

Set clip skip to 2 if the base model is a novelAI model.

Max Token Length

Specifies total tokens allowed in captions.

Data Loading and Attention

Persistent Data Loading

Keeps image data in memory for faster training.

Memory Efficient Attention

Suppresses VRAM usage; slower than xformers.

Xformers

Trades attention blocking for reduced VRAM usage.

Data Augmentation

Color Aug

Increases images using color augmentations.

Flip Augmentation

Useful for learning symmetrical concepts.

Noise Settings

Min SNR Gamma

Controls noise strength; optimal value is 5.

Noise Offset Type

Determines noise addition method; multires adds complexity.

Noise Offset

Applicable when LoRA type is Standard; adjusts noise levels.

Adaptive Noise Scale

Adjusts noise based on image noisiness.

Multires Iterations

Determines resolutions for multires noise.

Multires Discounts

Weakens noise per resolution for multires noise.

Caption Dropout

Dropout Caption Every N Epoch

Specifies epochs after which training occurs without captions.

Rate Caption Drop

Determines percentage of images trained without captions.

By configuring these settings, you can fine-tune LoRA training to suit your requirements and optimize model performance.

Final Remarks

I hope this guide and the additional resources help you train great models. FLUX has captured my attention, and I'm exploring creating training applications with custom configurations. Updates to this guide may come as I delve further into deep learning and generative image models.

KohyaSS: GenAI Image Model Training

GenAI Image Model Training

Introduction

Preface and Experience

Dive into Deep Learning

Model Semantics: LLMs vs. Computer Vision

Prompting

LoRA Learning

Understanding LoRAs

Learning Targets

DyLoRA, Lo-con, LoHA? - LoRA Types

DyLoRA (Dynamic LoRA)

LoHA (Low-Rank Hadamard Adaptation)

Lo-con (Low-Rank Convolution)

Training Parameters

Batch Size

Repeats vs Epochs

Number of Regularization Images

Learning Rate

Network Alpha

Mixed Precision Training

Random Seeds

Cache Latents

LoRA Network Rank

LoRA Network Alpha

Some notes:

Preventing Text Encoder Training

Advanced Configurations for LoRA Training

Weights, Blocks, and CONV Settings

Weights

Blocks: Bdim and Balpha

CONV: CONVdim and CONValpha

LoRA Block Weights

Token Padding

Weighted Captions

Prior Loss

Learning Rate Settings

LR Num Cycles

LR Power

Saving and Tokenization

Save Last N Steps

Keep N Tokens

Clip Skip

Max Token Length

Data Loading and Attention

Persistent Data Loading

Memory Efficient Attention

Xformers

Data Augmentation

Color Aug

Flip Augmentation

Noise Settings

Min SNR Gamma

Noise Offset Type

Noise Offset

Adaptive Noise Scale

Multires Iterations

Multires Discounts

Caption Dropout

Dropout Caption Every N Epoch

Rate Caption Drop

Final Remarks

Return to Blog