I’ve written extensively on the advent of alien technology that is large language models (LLMs) and transformers. Magical files with numbers that can take thousands of tokens as inputs and generate thousands of tokens as outputs.
If you’re like me, you may have wondered:
How does a machine’s creativity compare to that of a human? Can AI truly be original, or is it merely a reflection of its training data?
In this article, I’ll give you a glimpse into three recent research papers that delve into the heart of these questions, benchmarking generative AI against human performance in standardized creativity tests.
The results are both enlightening and complex, revealing remarkable fluency in content generation by AI, yet an average level of creative output. The findings also underscore the need to reassess traditional creativity metrics, especially when applied to AI, and highlight that human “prompters” still have an integral role in getting the most creativity out of the models.
π Recommended: Annual Income of Prompt Engineers in the US (ChatGPT)
Creativity has been seen as a uniquely human trait
But is it really?
π‘ Let’s define creativity as the ability to produce ideas or works that are both original and adapted to a specific situation.
This definition, however, has been challenged by the emergence of Generative Artificial Intelligences like ChatGPT, Bard, LLAMA, and Claude, which have been claimed by many developers to possess creative capabilities.
AI systems are designed to perform tasks that require human intelligence, and have been refined to accomplish various professional domains such as music, manufacturing, therapies, human resources, and health.
Pros | Cons |
---|---|
High Fluency: Can generate a large number of ideas quickly. | Lack of Originality: Often relies on existing data, limiting true originality. |
Speed: Can produce content much faster than humans. | Potential for Plagiarism: May inadvertently recreate existing works. |
Consistency: Can maintain a consistent style and tone. | Repetition and Predictability: May fall into repetitive patterns. |
Collaboration with Humans: Can work with human creators to enhance creativity. | Lack of Emotional Insight: Cannot infuse work with personal emotion or insight. |
Adaptability: Can be trained to work in various creative domains. | Inconsistency in Judging Creativity: Struggles to accurately judge creativity. |
Accessibility: Makes creative tools more accessible to people. | Ceiling Effect in Scoring: May reach maximum scores, indicating lack of nuance. |
Data-Driven Insights: Can utilize large datasets for creative insights. | Lack of Intentionality: Lacks conscious intention behind creative acts. |
Cost-Effective: Can be more cost-effective than human labor in some cases. | Potential Loss of Human Creativity: May lead to standardized, less innovative content. |
From an academic perspective, studies have begun to focus on the creative capabilities of AIs, with some impressive results.
GPT-4 scores top 1% in creativity
For instance, ChatGPT’s GPT4 version has shown abilities to reach the top 1% of the general human population in fluency and the top 3% in flexibility in creativity tests.
π Research Study AI Tests Into Top 1% for Original Creative Thinking The study conducted by the University of Montana, led by Dr. Erik Guzik, has found that the AI application ChatGPT, powered by GPT-4, can match the top 1% of human thinkers in creativity. Using the Torrance Tests of Creative Thinking (TTCT), the researchers found that ChatGPT excelled in fluency and originality, scoring in the top percentile, and was in the 97th percentile for flexibility. The results were compared with a control group of students and a national sample, with ChatGPT outperforming the majority. The findings suggest that AI is developing creative abilities on par with or even exceeding human ability, and may become a significant tool in business and innovation.
Let’s delve into another recent scientific study conducted by French researchers that shows that ChatGPT surpasses average creativity of human probands: π
Creativity is more than just Fluency
π Research Study The Creative AI-Land: Exploring new forms of creativity To explore creativity of Generative Artificial Intelligences (GAIs), the study authors tested 100 "individuals," comprising 50 GPT-3.5 and 50 GPT-4 models. To assess originality and creativity, the authores used the EPoC verbal test in French, consisting of two sessions with various tasks (AUT, DV1, IV1, DV2, IV2). Since AIs have no notion of time, "relaunch" instructions were used to simulate time constraints. Scoring was based on fluency, elaboration, and creativity, with both human researchers and ChatGPT itself evaluating the stories. Inter-rater reliability was satisfactory, and the results indicated a superior performance by GPT-4 over GPT-3.5 in generating more ideas, as confirmed by statistical tests like ANOVA and Games-Howell post-hoc test. However, some texts were noticeably plagiarized from well-known stories. The study also found a recurrence of certain character names in the stories, and ChatGPT's own scoring of creativity was inconsistent and uncorrelated with human judges. Hierarchical clustering analyses revealed three general story types with variations, and the multifactorial approach using EPoC's norms showed that GPT4 performed better on divergent verbal quotient (DVQ) but not exceptionally high on verbal integrative quotient (IVQ).
The researchers provide an interesting interpretation:
π‘ “In a creative test that would only consider scores such as fluency or elaboration, LLMs appear to be far better than humans. When put side by side, the ideas provided by ChatGPT are not particularly creative. The IVQ scores show scores slightly above average, but within the first standard deviation of the EPoC scoring system. These results qualify the “creative performance” promised by the developers. The [AIs] are indeed capable of generating a great deal of content, but what people find “creative” is rather put aside when the AI is faced with a standardized and newly defined protocol. The problem here surely lies in the format of LLMs, which successively predict which word should come after the other, depending on the command given.”
Let’s dig a bit deeper into the study results:
GPT-4 Outperforms GPT-3.5
The results showed slightly better performance for GPT4 compared to GPT3.5.
Significant differences were found in the AUT and divergence tasks (DV1 and DV2) for fluency, confirming GPT4’s superiority in generating a large number of ideas. However, no significant differences were found in elaboration between the two models.
The absence of significant differences in elaboration was attributed to the standardized methodology and character limitations of ChatGPT.
Plagiarism Concerns
A detailed qualitative analysis revealed that some texts were noticeably plagiarized from well-known stories, such as “Alice in Wonderland.”
This finding highlighted the statistical nature of LLMs, which may yield similar stories in content and form.
Additionally, the recurrence of certain character names in the stories was noted, further emphasizing the statistical responses of the AI models.
Inconsistency of ChatGPT as a Judge
When ChatGPT was used to assess the creativity of its own stories, it showed inconsistency, with low inter-judge reliability and correlations not significantly different from zero. This result indicated that ChatGPT was not reliable in assessing creativity according to the EPoC system.
Correlation Matrix and Clustering Analysis
The correlation matrix revealed moderate to strong positive correlations between various tasks, suggesting associated generative capacity and elaboration.
Hierarchical clustering analyses, using indices like the Silhouette Index and DBI, identified generally three story types present in most conditions, with variations corresponding to the probabilities of LLMs displaying words sequentially.
Comparison with Human Creativity
The multifactorial approach allowed for an objective comparison with human creativity, specifically with teenagers in “ninth grade.”
GPT-4 was found to be better overall on the Divergent Verbal Quotient (DVQ) than GPT-3.5.
However, the Verbal Integrative Quotient (IVQ) scores were not particularly high for either model, though GPT4 performed statistically better.
Co-Creativity
A GPT-3 based (not-so-recent) study has introduced the concept of “Human-AI Co-creativity,” a three-stage collaboration process involving
- Ideation,
- Illumination, and
- Implementation.
In the Ideation stage, Large Language Models (LLMs) act as a source of inspiration, generating fresh ideas. Here’s an example:
During Illumination, they help translate vague concepts into concrete outcomes. For example, we may polish and modify our Midjourney prompt and enrich it with new ideas, i.e., the AI may be able to express our thoughts better than we are.
In Implementation, they assist in experimenting and polishing ideas and create the final “product”.
Here are four example outputs, we could now iterate and go back to ideation or any of the previous phases:
The study also highlights the value of uncertainty in creativity, where randomness and unexpected results from LLMs can serve as inspiration.
To answer the question: Is ChatGPT creative?
Researchers are shy to give an answer but I am not. Yes, ChatGPT is 100% creative. It is even more creative than many humans. It will become more creative over time with exponentially increasing model and training data sizes. You can leverage its creativity best in an iterative prompting process for ideation, inspiration, illumination, and implementation.
π‘ Action Step: Simply use LLMs every day trying to produce new and creative outputs and share them on social media. This will build your prompting skills and generate true value for yourself and the world.
Also make sure to check out our recent blog post: π
π‘ Recommended: No, GPT-4 Doesnβt Get Worse Over Time (FUD Debunked)
Here’s the full poem about the love story between a whale and a galaxy in a nutshell:
π³ Galactic Serenade: A Whale's Love In the ocean's vast embrace, a tale unfolds, A female white whale, her heart pure as snow, Falls in love with a wonder, a mystery to behold, A galaxy in a nutshell, a distant cosmic glow. Her eyes catch the shimmer, a dance in the deep, A universe confined, yet boundless in its sweep, A love so strange, so distant, yet so near, A galaxy in a nutshell, a longing so sincere. She swims through the waters, her heart in a race, Chasing the glimmer, that ethereal space, A love unattainable, yet she yearns to explore, A galaxy in a nutshell, a myth, a folklore. Her songs fill the ocean, a serenade to the stars, A melody for a love that's both close and afar, She dreams of the cosmos, a dance in the night, A galaxy in a nutshell, her unreachable light. But the ocean's a boundary she cannot transcend, Her love for the galaxy, a means with no end, A longing unquenched, a desire so pure, A galaxy in a nutshell, a love with no cure. So she sings to the night, her heartache in verse, A love for a galaxy, both blessing and curse, A female white whale, in love with a dream, A galaxy in a nutshell, an unreachable gleam. In the waves and the tides, her love story's etched, A romance with the cosmos, a longing unquenched, A tale of a whale, and a love out of reach, A galaxy in a nutshell, a lesson it may teach. For love knows no bounds, not in space or in sea, A love so profound, as wild as can be, A female white whale, a galaxy confined, A love story written in the stars, forever intertwined.