AI, Creativity, and a Changing World
Talk Session 2: Wednesday, May 21 3:30-4:30 PM, ICM Auditorium
Training AI to Assess Human Creativity across Tasks, Modalities, and Languages
Roger E. Beaty, Pennsylvania State University
Simone A. Luchini, Pennsylvania State University
Benjamin Goecke, University of Tübingen
Mete Ismayilzada, École Polytechnique Fédérale de Lausanne (EPFL)
Antonio Laverghetta Jr., Pennsylvania State University
Peter Organisciak, University of Denver
John D. Patterson, Pennsylvania State University
Claire Stevenson, University of Amsterdam
Roni Reiter-Palmon, University of Nebraska Omaha
Large language models (LLMs) are increasingly used to automate creativity assessments, reducing reliance on onerous human scoring. However, current AI-based approaches to creativity scoring remain narrowly focused—limited to specific tasks (e.g., AUT), single modalities (e.g., text), or English-language contexts. We introduce ORACL—Originality Assessment Across Languages—a multimodal LLM capable of handling both images and texts across many languages. Fine-tuned on a novel dataset of 280,000 human-rated creative responses collected from the global creativity research community, ORACL spans 30 tasks (text and visual) in 10+ languages, from laboratory tasks like the AUT to naturalistic tasks like problem solving and story writing. Computational experiments demonstrate ORACL's ability to reliably predict human creativity ratings on unseen responses, indicating it captures consistent cross-cultural patterns in creativity evaluation. Importantly, ORACL shows evidence of generalization, predicting human ratings from languages and creativity tasks it was not trained on. Our results establish the first multilingual, multimodal AI system for creativity evaluation, with potential to assess creativity on other tasks and languages—pending further validation to understand the model’s limits and potential biases. We will release both the training dataset of 280k creative responses/human ratings, and the ORACL model, to enable automated creativity assessment globally and advance understanding of how humans and AI models evaluate creativity.
How do Humans and Language Models Reason About Creativity? A Comparative Analysis
Antonio Laverghetta Jr., Pennsylvania State University
Jimmy Pronchick, Pennsylvania State University
Krupa Bhawsar, Pennsylvania State University
Roger E. Beaty, Pennsylvania State University.
Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how contextual information (example solutions with expert ratings) impacts creativity evaluation. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (oracle) to those who did not (no oracle). Computational text analysis revealed that no oracle experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In contrast, oracle experts used less comparative language and focused more on direct assessments of cleverness. In Study 2, parallel analyses with state-of-the-art large language models (LLMs) revealed that AI instead prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the oracle condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially - to upwards of 0.99 - suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.
All Eyes On The Smartphone Canvas: Expert and Non-Expert Viewing Patterns, Preferences and Memory of Online AI and Human Art
Bernard Vaernes, University of Oslo
This study aims to assess how visual art expertise impacts viewing patterns, decision making and memory retention of different artworks when viewed online. In other words, it explores the differences in consumption of art in a modern real-life setting. To achieve this, 116 images of different works of art from the collection of the Norwegian National Gallery, and 116 equivalent AI generated works were presented to 24 graphic art, fine art and new media arts students from Fine Art Academies and University Art programs, and 23 non-art University students in Poland online on their smartphone screens, while their integrated selfie camera recorded their eye-movements. Participants rated the aesthetic appeal of the paintings, and in a second experiment completed a memory test of previously viewed and new, human and AI-generated works. Eye tracking data was analyzed for the different groups and stimuli, in order to objectively compare expert’s and non-expert’s visual processing differences, aesthetic preferences and memory of images of different types of works in an ecologically valid setting. Art students used more global (ambient) saccades and had smaller fixation spreads than non-art students, especially for abstract paintings. No convincing evidence was found for differences in aesthetic evaluations between art students and non art students. Evidence for higher memory scores overall and for AI and abstract paintings in art students was found. Future studies should use other methods as well as AI stimuli to study visual art expertise differences, and test the validity of existing expertise theories.