I’ve realized something after writing this blog and talking with friends and co-workers about generative AI. It seems for most people, even in technology, it is still under their radar. There is a lot of competition for our attention in everyday life, and particularly so in IT where it can be a challenge to keep up with all the change. There are headlines about new technology every day and the signal easily gets lost in the noise.
Because of this I want this post to put generative AI in context and provide some direction for anyone who is interested, but has no idea where to start to use the technology.
Feel free to skip to the end of this post if all you care about is where to start.
Generative AI in historical context:
Artificial Intelligence concepts have been around for a long time, but research took off in the 1950s with approaches that encoded information and processes as procedural algorithms and structured data. Out of this approach eventually came expert systems and software that could perform “intelligent” tasks like playing checkers chess, parsing language and conversing, pattern matching, etc. All artificial narrow intelligence, or ANI.
With the development of neural networks, inspired by nature, it became possible to create systems that learned from examples, from data, without knowledge having to be explicitly programmed. They can be thought of as sophisticated statistical models, or n-dimensional functions, that encode numerical abstractions of information. They are black boxes (for practical purposes), in contrast to the earlier, procedural systems where it wasn’t difficult to explain how it made any given output decision. Until recently these systems were still only capable of performing a specific task, still narrow intelligence.
There has been an explosion of neural network model development in the last decade. The amount of training data available has grown exponentially, as has the processing power available to these systems. More sophisticated neural network designs and techniques have been developed that allow these systems to exhibit more general learning capabilities. Google’s DeepMind is an example.
Generative AI is not a new topic, but early generative models were mostly limited to simple models like Gaussian Mixture Models (GMMs) and hidden Markov models (HMMs). With the rise of powerful GPUs and large training datasets, deep learning has made it possible to train much more complex models that are capable of generating realistic data. This has led to a resurgence of interest in generative AI with the ability to generate realistic images, videos, and even speech.
A key inflection point happened in 2017 when a team at Google Brain introduced transformer models. They are becoming the de facto model of choice for NLP problems, as they exhibit significantly better performance and have the advantage that their design allows for parallel processing of input.
Early large language models built on transformer networks used millions of parameters, but researchers continue to scale them up. GPT-2 has 1.5 billion parameters and GPT-3, created by OpenAI in 2020, has 175 billion parameters. GPT-3 has been succeeded by a number of other models with over 1/2 trillion parameters, as well as smaller models that outperform it.
Just as other neural networks are trained for a specific task, these large language models are trained on a task, to predict the next word when provided with some text input (it’s actually tokens, not words, but you can think of it as words).
For example, one would expect it to complete the text “why did the chicken cross the” with “road”. GPT-3 was trained on a large amount of the content on the internet, consisting of Wikipedia, books, Reddit, etc, and it does work as expected with that type of input.
It also exhibits more emergent abilities that you wouldn’t necessarily intuitively expect from the ability to predict text. Some examples are:
- Write original content that can often pass as human generated
- Summarize articles
- Write poems, songs in a specific style
- Write plots for TV shows
- Respond in the voice or perspective of a specific person
- Basic reasoning, basic math
- Exhibit theory of mind
- many more tasks
In the imaging space, we have diffusion models that are trained on images (along with their text descriptions), and are capable of producing new images based on a text description (a prompt). Dall-E (also from OpenAI) and Stable Diffusion (from StabilityAI) are two of the more popular models at this point in time.
There is a bit of a gold rush at the moment where companies are wrapping these models with friendly user interfaces. Some of them will disrupt other industries, some of them will themselves be disrupted as the landscape rapidly evolves, and some, possibly many, will not survive.
What these models are capable of right now is less interesting to me than what they will be capable of as they improve. It seems reasonable to think models could be capable of discovering better models, setting up a rapid cycle of improvement.
You can do interesting things with it today, of course. Here is some info to get started:
There are various LLMs (large language models) you can try out, but GPT-3 is one of the better ones, and one of the more accessible ones. You can sign up for an account at OpenAI and access the interactive playground at their web site to query GPT-3 in various ways. You have control over many of the parameters , if you choose to tweak them, and you can create API keys to programmatically interact with it. They helpfully provide code samples for interaction you perform in the UI.
This is what the playground UI looks like:

In the image generation space there are a number of options to choose from, but the most popular publicly available models are Dall-E, Stable Diffusion and MidJourney.
They each have their strengths and weaknesses.
MidJourney is known for producing images with a more artistic aesthetic. It tends to use complimentary colors and soft edges. It is very popular for fantasy art and does a nice job with scenery. It seems to have more trouble with getting the shapes of man-made objects correct. I sometimes use it as a starting point, and then use Dalle-E to refine it with inpainting (more on that below).
Here are a few MidJourney examples:



You can try MidJourney for free. The way you use their service is via chat, with Discord. You make requests to their bot and it responds with images. You can also view what others are creating in various channels.
Stable Diffusion is quite popular and excels at producing fine detail. I would definitely choose it over MidJourney for producing realistic photographic images. It seems to have more trouble with complex prompts with more visual elements. It is permissive and does not filter or disallow types of imagery (political, NSFW, and so on)
Here are some Stable Diffusion examples:



You can try Stable Diffusion using DreamStudio on StabilityAI’s web site. The UI allows you to perform inpainting, where you can erase part of the picture and regenerate with content according to your prompt; basically a way of iterating on an image to refine it (or drastically change it). You can try it out for free with the credits they provide when you create an account. It is also possible to run Stable Diffusion on your own hardware, a topic way too big for this blog post.
Lastly, Dall-E offers similar functionality as Stable Diffusion, but is better at parsing complicated prompts and complex scenes.
Here are some Dall-E examples:



You can try out Dall-E for free at OpenAI. There UI also allows inpainting, generating variations, and uploading your own image to manipulate.
Hopefully this provides some context for my earlier posts and is enough for you to get started if you didn’t know where to start.
Corrections and suggestions are welcome if I’ve made any errors in this post.