Train a LoRA model for an anime character

This article was completed with the help of www.DeepL.com/Translator.

I saw the LoRA model of Hero from “Wolf and Spice” posted by Eric Fu earlier. My heart ignited an impulse, who does not want to refine a model for waifu? So… I tried to make it done. The process was recorded for your reference.

To complete this thing requires you have:

A certain understanding of Python coding and Python environment building, ML projects or old projects inevitably have a bit of dependency conflicts. In order to avoid dependency confusion, I have prepared a separate venv for each tool.
If your computer’s GPU is Nvidia supported CUDA GPU, remember to install pytorch-cuda before installing dependencies for each project, this package uses GPU while pytorch will drive your CPU crying. Installation reference this discussion.
A nice computer, even if your hardware is not enough for the final fine-tuning, and you have to run it on the cloud with Colab or something similar, it’s more convenient if you can run the dataset preprocessing locally.
- BTW I ended up refining the model locally, with pytorch-cuda installed, the torch still choose to run with the CPU, not sure what is going on but there are benefits: slow but no need to worry about memory overflow.
- For your Reference: I got an AMD 3700X CPU with 3600 Mhz, the fine-tuning takes about 40% load of it. I’ve forgotten the memory consumption but it definitely less than 10GB.

Tools/projects used:

The famous stable-diffusion-webui for resizing and auto-tagging the dataset.
Model：anime-segmentation - SkyTNT for removing background of character.
Model：anime-face-detector - qhgz2013 for face recognizing and cropping to create some face-focused training picture.
Use the lora-scripts - Akegarasuto control the sd-scripts - kohya-ssto train the LyCORIS (LoCon-improvement?) Model (the author of the model is a really fun).

Also note that models have their own specialties, the tools I use in this process are for anime characters, if you refine real people (for example, Chilloutmax as the base model), you should find other tools for the same purpose. If we want to train an embedding (textual inversion) instead of a LoRA model, the form of the required dataset will be different from this article.

#Some pre-knowledge for better understanding

Stable Diffusion is a text-img model that learns how to denoise noisy images in reverse by observing the gradual addition of noise to the images.
- article: How and why stable diffusion works for text to image generation
- video: Stable Diffusion in Code (AI Image Generation) - Computerphile
LoRA is a model fine-tune technique.
- article: Hugging face: Using LoRA for Efficient Stable Diffusion Fine-Tuning
- Comparison of several fine-tuning techniques: video: LoRA vs Dreambooth vs Textual Inversion vs Hypernetworks, don’t want to see the video? here is a visual comparison chart
学習で使われる用語のごく簡単な解説

In addition to these “basic knowledge”, when you see terms that you do not understand, just search for concepts for about five minutes, and these terms will gradually become a system of knowledge. For now most of the parameters for controlling AI-generation are jargon, so in order to figure out what values are appropriate to set, we have to learn their meanings and operate on the basis of understanding. I found that "<noun you don't know>" [ML|stable diffusion|fine-tune|lora] keyword search works well (only search nouns are easily confused by the more general meaning of the search results, adding a domain word is much better).

#Prepare the dataset

#Collecting pictures

First I found the article Stable Diffusion Training for Personal Embedding, which mentioned this video about training a personal portrait model requires:

three different angled full body pictures
five different angled half body pictures
and twelve different angled face pictures

Preparing this dataset is as simple as finding a white wall and taking selfies, since the model is yourself anyway. Virtual characters can’t pose for me…… Where to find the pictures? This depends on the character, whether the carrier work has multiple multimedia derivatives, and whether the character is popular…… all have an impact. For example, Eric Fu’s Hero used illustrations from light novels and comics.

The character I want to train is a less popular character, and there are not many “official pictures” that I can cut. I learned from Eric Fu that “the setting needs to be close, and the drawing style can be varied”, so…… I opened the collection directory of fan art about her, plus the “official pictures” finally I picked about a hundred pictures (what, don’t have a collection directory for your waifu? Your love is not enough hahaha). From the results, it still works well to use images with different drawing styles for the dataset.

Although I think it’s not a big deal to play with yourself, but using fan arts to train AI is a cardinal sin in some people’s view. To be clear, if you train with the fan arts, do not make a statement, do not publish the model, I used it only because I have no choice.

I do not know if it is appropriate, my personal criteria for selecting pictures is “obvious features”, fine-tuning is letting the model learn your target’s features after all. For example:

✔️ character dresses “same with the official settings” (most of the pictures are qualified).
✔️ character dresses non-official-setting, but hair, facial details, etc. “very similar to the official”.
✔️ multiplayer pictures, if you don’t want to waste them, cut your character out.
❌ the environment (light and shadow) causes a big impact on the character’s color.
❌ The artist’s drawing style is unique, or too good to be used.
❌ Non-color images.

#Remove pictures background

First I observed the portraits of the two who training for personal portrait Embedding, and it seems that the white background was purposely selected for the selfies. Then I watched the video Glitch Tokens - Computerphile, which mentions that researchers found that the feature learned by a certain model for dumbbell text was an upper arm with muscles. These made me speculate that image backgrounds are a noise to character feature learning, so I decided to remove the backgrounds. (Lately I learned that we can use caption to subtract certain features from the image so that the model doesn’t learn them, but I don’t think that work as well as just removing the background.)

It’s hard to find an automatic background removal tool with ordinary internet users of the keyword photo remove background, the vast number of sites want to earn your money. I found a Photoshop script, but the effect is not satisfactory. A friend recommended this project anime-segmentation - SkyTNT, which led me to image segmentation as an ML field, and the results of this project given by my friend have been very good. Thank you, my friend!

Built it successfully with Python 3.9, be aware that it only seems to support png and jpg format input. This converted our original image to a transparent background image of just the people in png format. Some of the output may not have been removed cleanly, and I didn’t bother to remove it manually again myself.

#Anti-pattern: Enlarge the image size to square

First I experimented the resize operation and found that this would truncate the corner parts of the picture. But I didn’t want to lose any features (especially the rare official-setting shoes), so I found a Photoshop script on the internet, and expanded all the images into squares according to the larger one in length and width, so that resizing result would remain the whole image. This step proves to be useless and makes the features smaller and fuzzier, which is not conducive to training.

#Crop the face out

According to the portrait dataset, the face (up to the shoulder) picture accounts for quite a bit. But the fan artists will certainly not just draw an ID photo, it seems that we need the face pictures out from the fan arts (transparent background). After having tasted the sweetness of dealing things in ML, I directly searched anime face detect ML, and Github gives out some projects.

Picking a project, especially a Python ML project (to write some code based on the project to crop out the image and save it), I care more about the development time than the star count, afraid of the age of the project dependencies may have problems. anime-face-detector - hysts, it’s the newest one, developed two years ago, but it took me a long time to solve the versioning problem about mmcv dependencies, finally I failed to run it up. Moved to the second newest (developed three years ago) project anime-face-detector - qhgz2013.

Wow, it works on Python 3.9, so I forked the project and changed a couple of lines to make cropped out picture larger than the facial recognition results, including the hair and shoulders and animal ears (animal ears up the long). Thanks qhgz2013 has developed the crop function, save me the trouble. By this way the face is cut out from each picture and the training set size is doubled.

#Resize and Caption

Now that the dataset is prepared, open the web-ui -> Train -> Preprocess images, and follow the steps in this tutorial to:

Resize the images to 512*512 (although LoRA is supporting non-512 size, I didn’t quite understand LoRA’s requirements for training set images, so for insurance just follow the classic 512 (the input image size for stable diffusion v1.* version)).
Check the Auto focal point crop
Check the Use Deepbooru for caption to use DeepDanbooru - KichangKim - the image tag extraction model trained with anime characters. It automatically labels the training set, the set of these is called a caption, and as mentioned above, the purpose is to keep the model from learning (=reinforcing the weight of these tags) these annotated features and focus on learning the “new” (character) features.

Now we finally have the training set ready, a bunch of 512*512 images with corresponding caption files. According to the results, the images generated by face-detect-model and web-ui both replace the transparent background with a pure black background, and the training set with black background does affect the generation of the resulting model (intend to generate pure black background). But if we give the background prompt, the generation backs to normal, so no need to worry about it.

#Check and modify the captions

FYI: Do image captions have any effect during training?

After further search and experimentation, I found caption to be of great use for training and using the model. I wrote a few simple scripts to do the related automation work. The character I trained is 姫宮真歩 Himemiya Maho from the mobile game “Princess Connect! Re: Dive”, using her as an example.

an anime girl picture generated by AI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# The official image in the linked website is very similar to the image in the AI-generated diagram, I'm pretty happy about this good result.

(masterpiece:1.331), best_quality,
(1girl:1.21),
fox_ear, fox_tail,
(brown|gold) hair, long_hair, hair_tubes, blunt_bangs,
green_eyes, narrowed_eyes,
medium_breasts,
smile,
print_kimono, frills, maid_headdress, hair_flower,
sit,
nature,
looking_at_viewer,
<lora:himemiyamaho:0.7>

#For training

By removing the character feature tags from the caption, the model can be reinforced to learn those features. Those features should be the ones that the character has in all scenes you want to generate in the future (e.g., Shinpachi Shimura’s glasses). For example, costumes should not be removed because character costumes change from scene to scene. For Maho, I removed the following tags:

blunt_bangs
eyebrows_visible_through_hair: the so-called wrong-layer eyebrows, not obvious on Maho, other characters such as Inuyama Aoi is more obvious.
low_twintails
*_breasts: Why remove the breasts tag? The initial version of the model trained without removing these tags generates 100% huge breasts without specifying the size of the breasts. Therefore, the tags is deleted to let the model learn the real size of Maho.

I didn’t remove very basic features such as fox_ear, fox_tail, (brown|gold) hair, long_hair, green_eyes, which, judging by my current knowledge base, are fine to be removed.

If you have features you want to avoid the model to learn, add the corresponding tag in the caption of each image. For Maho, I deleted bare_shoulders. Why? I used two sets of clothes for training, the usual clothes (kimono as shown in the picture) and the swimsuit. In the initial version of the training model, I observed that the vast majority of the usual clothes generated pictures, are like the flower leader to show the shoulders, not in line with the impression of the role of Maho. At first, I thought the model had mixed it up with the swimsuit (with bare shoulders), but later I checked the training set and found that many of the usual clothes pictures I used were also strapless. I can not collect other enough qualified pictures for training, so how to solve this problem? Finally, I decide to use the caption to control the model learning, and experiment result shows that it worked well, and finally Maho would no longer showed her shoulders.

#For usage

The caption helps the model understand the images in the training set as it learns. It is also very eye-opening for me, who has a poor English vocabulary, to learn a lot of new words that will help me communicate more accurately with the model.

prompt for training pictures = caption (what the model doesn’t learn) + what the model learns

When it comes time to use it we can reverse this formula: tags in caption + model = what looks like the result of the training images! If you want to generate results similar to a certain image in the training set, look at its caption.

#Training

First of all, my fine-tune technology is to follow Eric Fu selected LoRA, because I heard that LoRA technology is not only fresher in the timeline, and seems to have low memory requirements. Then LoRA also divided into several algorithms, I chose the LyCORIS model as same as Eric Fu, take the road that others have verified, while I do not have enough ML knowledge to do the selection myself. The LyCORIS’s README and The README of lora-scripts - Akegarasu project found in a tutorial in forum both point to this project: sd-scripts - kohya-ss. The relationship between the three of them is:

sd-scripts integrates fine-tune scripts that are easy to use directly and can perform training with all four fine-tune methods.
lora-scripts further simplifies the use of sd-scripts for LoRA training, and each parameter has a recommended value, which is a big help.
Run sd-scripts with parameters specifying LyCORIS as the training algorithm.

With Python 3.10, first install lora-scripts (which will also install the sd-scripts dependency), then install LyCORIS library, which the venv is shared by all three projects. The base model I used is [anything-v4.5](https://huggingface.co/ andite/anything-v4.0). I changed lora-scripts to add new parameters about LyCORIS. Note here that sd-scripts is in the lora-scripts directory, but the dataset directory in the run configuration seems need to add .. (parent directory) (is the work directory of sd-scripts runtime its own home directory?) .

#Conclusion

It would be nice if I could use techniques like Feature Visualization (for image task models) or token clustering (for text task models) to visually study the inside of my LoRA model, I’m curious about which features it learns from Maho. It seems that there is no ready-made tools based on the stable diffusion model, so I gave up. And I’m still not quite sure how caption works, but I can’t find any more explanation online. If you know the relevant material, please do not hesitate to give advice.

I wish everyone can fine-tune their waifus successfully, set our hopes(or delusions?) free.

Train a LoRA model for an anime character

See Also: