I want to thoughtfully reverse engineer how Midjourney works so good comparing to DALL-E. On their Discord server the bot shows intermediate results and they are blurry pictures which for me confirms that they use a diffusion model. So, text to picture is done with diffusion. What could be a dataset for that diffusion?
Diffusion model produces small pictures, I assume 64x64 or 128x128. Then GAN models work for upscale, adding details and polishing. Yes, here’s my another assumption it applies a gentle StyleGAN that is not changing everything to Van Gogh style, but adds an artistic touch. this works in the range from 64x64 to 256x256. And then SuperRes GAN to make 1024x1024. DeviantArt pictures is the dataset for StyleGAN here?
From the text about IF model I understand that models working in pixel space can generate normal text, but if models are operating in the latent space then they produce gibberish and have problems with hands.
I saw diffusion tutorials with butterflies, irises, faces - when all pictures are very similar and it’s quick to find the distribution for such set and start generating something similar. But in research papers they use CIFAR-10 (only ten classes of real life photos) and it immediately has falling apart producing garbage pictures. This is what happens with DALL-E. DALL-E certainly doesn’t have StyleGAN inside, it outputs directly from diffusion model, any garbage that was in the training set. I assume they just trained it long enough to make it accurate with details.
So, this is the dilemma: limited objects and refined dataset and thus quick training process or some dataset unknown training time and unpredictable results?
One person said about current state of image generation: there is no one single model that will generate you a hi res picture on a random topic, it’s a combination of different methods and tools which in some sense close to professional manipulations with photos and that’s where any discussion about AI loses its meaning. In Midjourney look closely at pictures with water: how do they create reflections?
So to reverse engineer Midjourney results, we need to think what if the image got a few passes of StyleGAN only? If it’s just a transformation of some existent picture from the Internet. I also think that they mix element from different pictures together.
Transformation in latent space is a big deal from a stand point of artists and photo editors. It takes only core concepts like light, composition, perspective, objects from the picture and can alter many details.
I incline to small but coherent datasets. This will make every diffusion model precise in their little purposes. Which means I need to work on my dataset. It has already some categories and I will start to train boobs and tits recognition models using GAN, but to generate something interesting I need to recognize poses and generate new ones (I saw GAN models can do it). Pose recognition is very helpful for robots anyway.
That’s how they can learn new movements, though their bone and “muscle” structure is different. But maybe they can imitate to some degree.
The images in the STL10 have a lot of variation meaning more "features" need to be encoded in the latent space to achieve a good reconstruction. Using a data-set with less variation (and the same latent vector size) should results in a higher quality reconstructed image.