AI, Creativity, and a Changing World

Talk Session 2: Wednesday, May 21 3:30-4:30 PM, ICM Auditorium

Training AI to Assess Human Creativity across Tasks, Modalities, and Languages

Roger E. Beaty, Pennsylvania State University
Simone A. Luchini, Pennsylvania State University
Benjamin Goecke, University of Tübingen
Mete Ismayilzada, École Polytechnique Fédérale de Lausanne (EPFL)
Antonio Laverghetta Jr., Pennsylvania State University
Peter Organisciak, University of Denver
John D. Patterson, Pennsylvania State University
Claire Stevenson, University of Amsterdam
Roni Reiter-Palmon, University of Nebraska Omaha

Large language models (LLMs) are increasingly used to automate creativity assessments, reducing reliance on onerous human scoring. However, current AI-based approaches to creativity scoring remain narrowly focused—limited to specific tasks (e.g., AUT), single modalities (e.g., text), or English-language contexts. We introduce ORACL—Originality Assessment Across Languages—a multimodal LLM capable of handling both images and texts across many languages. Fine-tuned on a novel dataset of 280,000 human-rated creative responses collected from the global creativity research community, ORACL spans 30 tasks (text and visual) in 10+ languages, from laboratory tasks like the AUT to naturalistic tasks like problem solving and story writing. Computational experiments demonstrate ORACL's ability to reliably predict human creativity ratings on unseen responses, indicating it captures consistent cross-cultural patterns in creativity evaluation. Importantly, ORACL shows evidence of generalization, predicting human ratings from languages and creativity tasks it was not trained on. Our results establish the first multilingual, multimodal AI system for creativity evaluation, with potential to assess creativity on other tasks and languages—pending further validation to understand the model’s limits and potential biases. We will release both the training dataset of 280k creative responses/human ratings, and the ORACL model, to enable automated creativity assessment globally and advance understanding of how humans and AI models evaluate creativity.

How do Humans and Language Models Reason About Creativity? A Comparative Analysis

Antonio Laverghetta Jr., Pennsylvania State University
Jimmy Pronchick, Pennsylvania State University
Krupa Bhawsar, Pennsylvania State University
Roger E. Beaty, Pennsylvania State University.

Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how contextual information (example solutions with expert ratings) impacts creativity evaluation. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (oracle) to those who did not (no oracle). Computational text analysis revealed that no oracle experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In contrast, oracle experts used less comparative language and focused more on direct assessments of cleverness. In Study 2, parallel analyses with state-of-the-art large language models (LLMs) revealed that AI instead prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the oracle condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially - to upwards of 0.99 - suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

All Eyes On The Smartphone Canvas: Expert and Non-Expert Viewing Patterns, Preferences and Memory of Online AI and Human Art

Bernard Vaernes, University of Oslo

This study aims to assess how visual art expertise impacts viewing patterns, decision making and memory retention of different artworks when viewed online. In other words, it explores the differences in consumption of art in a modern real-life setting. To achieve this, 116 images of different works of art from the collection of the Norwegian National Gallery, and 116 equivalent AI generated works were presented to 24 graphic art, fine art and new media arts students from Fine Art Academies and University Art programs, and 23 non-art University students in Poland online on their smartphone screens, while their integrated selfie camera recorded their eye-movements. Participants rated the aesthetic appeal of the paintings, and in a second experiment completed a memory test of previously viewed and new, human and AI-generated works. Eye tracking data was analyzed for the different groups and stimuli, in order to objectively compare expert’s and non-expert’s visual processing differences, aesthetic preferences and memory of images of different types of works in an ecologically valid setting. Art students used more global (ambient) saccades and had smaller fixation spreads than non-art students, especially for abstract paintings. No convincing evidence was found for differences in aesthetic evaluations between art students and non art students. Evidence for higher memory scores overall and for AI and abstract paintings in art students was found. Future studies should use other methods as well as AI stimuli to study visual art expertise differences, and test the validity of existing expertise theories.

Automated Utility Scoring for the AUT

Rebeka Privoznikova*, University of Amsterdam
Surabhi Nath*, Max Plank Institute Tuëbingen
Raoul Grasman, University of Amsterdam
Luke Korthals, University of Amsterdam
Claire Stevenson, University of Amsterdam
* shared first authorship

Out of the two components of creativity - originality and utility - the automated scoring of the former has received much more attention in the literature than the latter. However, the utility of creative responses is nearly as important when assessing creativity, especially when evaluating responses generated by AI (and adolescents). For example, using a pen to "build a house", is humorous, but not very effective. In this project, we create a series of machine learning (ML) models to automatically predict expert ratings of response utility on the Alternative Uses Test (AUT). Evaluating utility differs from originality scoring and requires greater understanding of object characteristics and common-sense knowledge on how they function in the real-world. Therefore, besides traditional predictors related to response content and its uniqueness, we also include predictors relating responses to real-world knowledge (e.g., semantic distance to uses generated from a knowledge graph's 'UsedFor' relation). After finding the best predictors of AUT response utility, we examine the effects of training data on ML model performance, by comparing training regimes of different subsets of human, LLM and curated out-of-distribution responses. We compare our best performing ML algorithm's predictions to out-of-the-box and prompt-engineered LLM performance on both standard and novel challenge benchmarks. Preliminary results show that our ML model outperforms LLMs not trained for this task. We discuss the importance of curated training data and evaluating models' generalization capabilities on challenging datasets.

Investigating the validity evidence of automated scoring methods for different response aggregation approaches

Janika Saretzki, University of Graz
Mathias Benedek, University of Graz

Divergent thinking (DT) ability is widely regarded as a central cognitive capacity underlying creativity, but its assessment is typically challenged by relying on effortful human ratings and by persistent uncertainty on how to aggregate scores across a variable number of responses. Recent work demonstrated that automated scoring based on large language models (LLMs) substantially predicts human creativity ratings. Other work evaluated the psychometric quality of different response aggregation methods across human ratings (including summative and average scoring, as well as top- and max-scoring) by comparing their concurrent criterion with respect to external criteria such as real-life creative behavior, creative self-beliefs, and openness. The present study integrates these two lines of work and investigates the criterion validity evidence of automated creativity scorings derived from three LLMs (CLAUS, OCSAI 1.6, and GPT-4) under different response aggregation methods. Importantly, instead of just relating LLM-based ratings to human ratings, this study compares the validity evidence between rater-based and LLM-based scores which opens up the possibility that automated scoring could even prove more valid (under certain aggregation conditions). Analyses are based on existing data from 300 participants who completed five AUT task and are still ongoing. Findings will offer new insights into the potential of LLMs to complement or enhance traditional DT assessment and contribute to the broader integration of automated scoring in creativity research.

Pencils to Pixels: Studying Drawing Creativity in Children, Adults and AI

Surabhi S Nath, Max Planck Institute for Biological Cybernetics; Max Planck School of Cognition; University of Tübingen
Guiomar Del Cuvillo Y Schröder, University of Amsterdam
Claire Stevenson, University of Amsterdam

Visual creativity has received far less attention compared to verbal creativity, with only a handful of empirical investigations. This is due to the greater complexities involved in producing and evaluating visual outputs. To tackle this, we curate a novel dataset of drawings, a medium that offers sufficient control without compromising on creative potential. Using a popular creative drawing task, we curate a novel dataset comprising of ~1500 drawings by children (n=148, age groups 4-6, 7-9), adults (n=148) and AI (Dall-e, generated using three different prompts) and devise methods to systematically investigate visual creativity. We use computational measures to characterize two aspects of the drawings---(1) style and (2) content, at both product and process levels. For style, we define measures based on ink density, ink distribution and number of visual elements. For content, we use manually annotated categories to study conceptual diversity, and use embeddings of images and captions to compute distance measures. We compare the style, content and creativity of children, adults and AI drawings and build simple models to predict expert and automated creativity scores. We find significant differences in style and content of the different groups---children drawings had more components, while AI drawings had greater ink density and lines, and adult drawings revealed maximum conceptual diversity. Notably, we highlight a misalignment between creativity judgments obtained through expert and automated ratings and discuss its implications. Through these efforts, our work provides a framework for systematically studying human and artificial creativity beyond the textual modality.

Generative AI vs. Creative Brains: AI could beat us in Art… but also in Science?

Vera Eymann, Center for Cognitive Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, Germany
Thomas Lachmann, Center for Cognitive Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, Germany; Centro de Investigación Nebrija en Cognición (CINC), Universidad Nebrija, Madrid, Spain; Brain and Cognition Research Unit, Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
Daniela Czernochowski, Center for Cognitive Science, University of Kaiserslautern-Landau (RPTU), Kaiserslautern, Germany

Scientific creativity encompasses the ability of conducting creative science experiments and developing creative approaches to solve science problems. Today, our world is in desperate need for creative minds to master the many challenges of our time, such as pollution, socioeconomic inequality, and disinformation. At the same time, we are witnessing an upsurge of generative artificial intelligence (AI), which has been proclaimed to permanently end our human creativity (e.g., Sternberg, 2024). In fact, AI already seems to be challenging the field of visual arts and music. But what about science? We developed a task that requires the generation of scientific hypotheses, the design of experiments and a justification in terms of their usefulness and originality to assess scientific creativity. Using a fictious scientific scenario, we asked students (enrolled in the study program Cognitive Science) as well as ChatGPT to create/generate an abbreviated version of a research proposal. Using a structured (blinded) rating, an expert from the respective field evaluated the students' research proposals and the proposals generated by ChatGPT in terms of their scientific quality and originality. Our results indicate that ChatGPT reached significantly higher overall scores in the task, associated with overall longer and more detailed responses. However, the subscale for scientific originality revealed that students’ ideas were rated as more original and creative. We will discuss further implications of our findings along with future directions for the research on scientific creativity. And whether or not the writing of grant proposals should be placed in artificial hands in the future.