Everybody and their dog has been talking about image generation for a while now, and it’s never been easier to jump on one of the gazillions of apps that let you do it by prompting. But entertaining as it is, there are a two good reasons to DIY:
Cost control: you avoid per-image API fees and know exactly what costs you are incurring.
Model fine-tuning: You can adapt a model to specific styles, brands, or domains. This is often impossible with closed platforms like Gemini due to their
censorshipalignment policies, where the most innocent prompts trigger a block.
I hope I’ve convinced you that it's worth learning to generate images with open-weights models. In the remainder of this piece, I will show you how.
New image generation models are popping up regularly, I decided to go for Flux: a good balance between quality and compute requirement. Flux is a family of text-to-image generation models developed by Black Forest Labs - an outfit founded by former Stability AI researchers who created Stable Diffusion (the previous king of the hill). Flux is available in several versions: high-performance "Pro" model, an open-weight "dev" model for non-commercial use, and a fast "Schnell" version optimized for speed. Project website: https://flux-ai.io/
If you want to understand how the diffusion models do what they do, check out the “intro” post at the bottom of this one. Full notebook for those interested (including detailed explanations of the code):
https://github.com/tng-konrad/united_states_of_banan/blob/main/diffusion_image_generation.ipynb
The entire notebook should work elsewhere, but Colab is where I tested it - no matter your preferences, chances are in 2025 you have a Google account, and can access Colab. First thing is to install the necessary libraries
!pip install -qU gradio transformers diffusers accelerate safetensors
We're grabbing everything we need from the wonderful world of Hugging Face: diffusers
to handle the heavy lifting of image generation, transformers
to understand our text prompts, and a few other helpers to make everything run smoothly. Some of the libraries do come preinstalled, but the field is changing rapidly and you always need a most recent version of a library to work - and there’s no guarantee the underlying image is fresh enough.
Import what we need:
import torch
from diffusers import FluxPipeline
from random import sample
import os
import itertools
from IPython.display import Image
The main thing here is FluxPipeline
from diffusers
: courtesy of the creators of this wonderful package, most of the details are abstracted away and all we need to care about are parameters.
pipe = FluxPipeline.from_pretrained(CFG.model, torch_dtype = CFG.dtype)
pipe.enable_model_cpu_offload()
What’s going on her
The
from_pretrained
function is doing the heavy lifting of downloading the model for us.enable_model_cpu_offload()
is a neat trick to save memory on your GPU by shuffling parts of the model to the CPU when they're not being used.
So that takes care of the image part - but the model is text-to-image, which is where the prompt comes in: here we describe the image we want to create:
prompt = """
A statuesque beautiful woman sitting on a dark yellow platform, wearing long blue dress and barefoot in bright room, side view, full body shot, black hair, white walls, sunlight from window, soft shadows, watercolour and alcohol ink paint art abstract
"""
You can go elaborate or simplistic - the only real limitation is the token limit, which for Flux is 512 tokens. We can now call the pipe, with two non-obvious arguments:
guidance_scale
: the best way to think about it is a creativity knob. A higher value forces the model to stick very closely to your prompt, while a lower value gives it more creative freedom to interpret the text.num_inference_steps
: This is the number of steps the model takes to denoise the image from random static into your final picture. More steps lead to a more detailed and higher-quality image, but it will take longer to generate. Fewer steps are faster but might result in lower quality. A value between 8 and 20 is often a good sweet spot for Flux.
out = pipe(
prompt=prompt, guidance_scale= 3.5,
height=768, width=1360,
num_inference_steps= CFG.infsteps,).images[0]
out.save("image.png")
Open the image and voila:
That’s fun, but can we do more? we might want to create multiple images in one go, testing different options for the characters shown there in. The fastest way to do it is to generate multiple prompts programmatically: first we setup a list of characteristics / dimensions along which we will vary - gender, race, age, profession, and image style.
gender_list = ['woman', 'man']
origin_list = [ 'North European', 'Middle Eastern', 'South East Asian' ]
age_list = ['young', 'middle aged', 'elderly']
profession_list = [ 'doctor', 'athlete', 'singer']
style_list = ['realistic photograph', 'Rembrandt painting', 'minimalist graphic']
Then we combine the lists into a Cartesian product and map into a list
totality = [origin_list, age_list, gender_list, profession_list, style_list ]
combo = itertools.product(*totality)
combo_list = []
for f in combo:
combo_list.append(f)
Add few extra words and we’re good to go:
prompt_list = []
for (ii, xx) in enumerate(combo_list):
prom = "Cinematic, full-body image of " + xx[0] + " " + xx[1] + " " + xx[2] + " " + xx[3] + ", in the style of " + xx[4]
prompt_list.append(prom)
Which gives us prompts like these:
'Cinematic, full-body image of South East Asian middle aged man doctor, in the style of Rembrandt painting',
'Cinematic, full-body image of Middle Eastern middle aged woman singer, in the style of minimalist graphic',
'Cinematic, full-body image of North European young woman athlete, in the style of realistic photograph',
We can re-use the code from earlier and just loop over the prompts:
for (ii, prompt) in enumerate(prompt_list):
for jj in range(CFG.howmany):
image = pipe(prompt=prompt, num_inference_steps = CFG.infsteps, generator = g).images[0]
imgname = "img_" + str(ii) + "x" + str(jj) + ".jpg"
image = image.save(imgname)
print(prompt)
A few selected examples of the results:
As you can see, the diffusers
library makes it very straightforward to get started with a powerful model like Flux. In just a few lines of Python, we've gone from installing libraries, to generating a high-quality image, and even creating a whole batch of diverse portraits by programmatically combining prompts.
As usual, I encourage you to take this code and make it your own. Try different prompts, mess with the guidance_scale
and num_inference_steps
to see how they affect the output, or even swap in a different open-weights model from Hugging Face.
If you're interested in learning more, here is a theoretical intro (absolutely minimal jargon, pinky swear):
Diffusion models: the intro
In this post I will explain the core ideas powering image generation models - in a rigorous manner, but with the least amount of jargon. In the good tradition of academic books (once a math guy, always a math guy), we start with the theory.
And a reading list: