United States

Studying large language models (LLMs) can provide valuable insights into their strengths and limitations. This study explores problem-solving capabilities of GPT-4 by comparing the model’s performance in solving Black Stories riddles, to human performance. The study utilized a set of 12 adjusted Black Stories, each tested twice within the human and GPT-4 group. The experiment was conducted through text messaging for a comparable set-up. The primary measure of performance was the number of questions and hints needed to solve the riddle. Results indicated no significant difference between the groups. Qualitative results showed that GPT-4 excelled in precise questioning and creativity but often fixated on details. Humans covered broader topics and adapted the focus quickly but struggled with uncommon details. This research suggests that despite different approaches, GPT-4’s performance was comparable to that of humans, demonstrating its potential as a capable participant in these types of problem solving games.

CogSci 2025

The Black Stories Experiment: Two Groups are Trying to Solve a Riddle Game Behind a Screen, Only One Group Is Alive

computer-based experiment

problem solving

artificial intelligence

natural language processing

reasoning

poster

### Welcome to CogSci Conference 2025!

The 47th Annual Meeting of the Cognitive Science Society was a hybrid meeting held in San Francisco. 

<div style="position:relative;padding-top:0;width:900px;height:500px;"><iframe style="position:absolute;border:none;width:100%;height:100%;left:0;top:0;" src="https://online.fliphtml5.com/ebtyf/amvr/"  seamless="seamless" scrolling="no" frameborder="0" allowtransparency="true" allowfullscreen="true" ></iframe></div>

#### About

The Cognitive Science Society brings together researchers from around the world who hold a common goal: understanding the nature of the human mind. The mission of the Society is to promote Cognitive Science as a discipline, and to foster scientific interchange among researchers in various areas of study, including Artificial Intelligence, Linguistics, Anthropology, Psychology, Neuroscience, Philosophy, and Education.

The Society is a non-profit professional organization and its activities include sponsoring an annual conference and publishing the journals Cognitive Science and TopiCS.

#### Our History 

* **Society Creation**<br>
The Society was incorporated as a 501(c)(3) non-profit professional organization in Massachusetts in 1979. The organizing committee included Roger Schank, Allan Collins, Donald Norman, and a number of other scholars from psychology, linguistics, computer science, and philosophy. 
<br><br>
* **Conference Creation**<br>
The first conference on cognitive science was held at La Jolla, California in August, 1979, and has occurred annually since then. The proceedings of each conference are published, and those from most years are available through Lawrence Erlbaum Associates, Inc. The annual proceedings of the Cognitive Science Conference represent a major source of information on new work and new ideas in the scientific study of thinking. In 1990, the Society, with help from an anonymous donor, established the David Marr Prize for the best student paper at each annual meeting.
<br><br>
* **Journal Creation**<br>
The Journal, Cognitive Science, began publication in 1976, and is now published by Wiley-Blackwell. The Executive Editor is currently Richard P. Cooper of Birkbeck, University of London, and there are 18 Associate Editors and a 30-member editorial board. It serves as the premier outlet for research reports that intersect two or more disciplines. Copyrights for articles published in the journal are held by the Society. The Governing Board of the Cognitive Science Society voted in late 2006 to found a new journal, Topics in Cognitive Science (topiCS). The Editor in Chief is Wayne Gray, Cognitive Science Department, Rensselaer Polytechnic Institute. The journal seeks to fill a niche not occupied by Cognitive Science Journal or other cognitive science journals. Membership in the Society includes a subscription to Cognitive Science and TopiCS. Copyrights for articles published in the journal are held by the Society.
<br><br>

#### Code of Conduct

By attending the CogSci 2025 Conference, you are required to adhere to the society’s **[Code of Conduct](https://drive.google.com/file/d/1ChPuihLy6jE_BWqfO7J2KKgX35JW2zsM/view?usp=sharing)**.
<br><br>


You need to log in with the email address you registered with. 

Login credentials were sent to you from Underline -  subject line "Welcome to the CogSci 2025 Conference". Please be sure to check your spam/promotional inbox  if you do not see an email confirmation right away.





Please log in to join this event.

To access the site, please register [**here**](https://cognitivesciencesociety.org/registration/).

If you are registered and feel like you are seeing this message by mistake, please make sure you are logged in with the same email that you registered with. 

Please register!

The 47th Annual Meeting of the Cognitive Science Society presents the latest research across cognitive science and highlights the theme of Cognition in Context.

Deepfake videos challenge the quality of information in deliberative democracies. In a mixed-methods study, we examine the role of emotions in the detection of political deepfakes by focusing on trust, empathy, and inspiration to assess how deepfakes influence public perception and engagement with political communication. The research unfolds in two phases: an initial qualitative investigation through 3 focus groups (N = 13), followed by a quantitative survey (N = 261) where focus group insights inform the design and interpretation of the quantitative study. Participants were exposed to real, ChatGPT-generated, and historical speeches presented in modern contexts to gauge perceived authenticity and emotional responses, including trust, empathy, and inspiration. Results indicate no significant difference in perceived authenticity between real and deepfake content, with both eliciting comparable emotional reactions. The quantitative analysis reveals a marginal negative correlation between exposure to deepfakes and trust in political communication. Qualitative findings emphasize the influence of contextual cues and pre-existing biases, showing participants often prioritized emotional resonance over technical accuracy when evaluating content. The study highlights the intricate relationship between AI-generated media and public perception, underscoring the necessity for nuanced regulatory policies and improved media literacy to mitigate the impact of Deepfakes on public trust.

Unmasking political deception: Investigating the Discernment and Emotional Impact of Deepfake Political Speeches Featuring American Presidential Candidates

Negation is a linguistic universal, and its processing is often assumed to require extra cognitive steps: representing an idea and then suppressing it (Kaup et al., 2006).
Recently, it has been shown that linguistic negation processing resembles basic conflict processing in both behavioral and electrophysiological data (Dudschig & Kaup,
2018, 2020), in line with standard conflict tasks (i.e. Stroop Task, see Botvinick
et al., 2001). The present study implements mouse tracking to allow the analysis
of fine-grained changes in responses during negation processing (e.g., trajectories,
deviations from the ideal path, and partial errors). Participants responded to affirmative and negated phrases. The key dependent measures were influenced by the
polarity (affirmative vs. negated) of the phrases on current trials (indicating the
activation of the to-be-negated information) and by the polarity of preceding trials
(indicating negation processing is context dependent). The theoretical implications
in light of negation and conflict processing accounts will be discussed.

Investigating the parallels of Negation and Conflict using a Mouse Tracking Paradigm

Math is all around us, but propensity to notice the role it plays in everyday life might differ from person to person. Here, we test whether children with broader conceptions of math experience lower levels of math anxiety. In Study 1, we gathered data from 98 Indian middle schoolers in Vadodara, Gujarat. Children who categorized more activities in a provided list as “math” demonstrated more positive attitudes towards math on a math anxiety scale. We also found that breadth of math category predicted how skilled children believed themselves on activities they included in their math conception. In Study 2, we explore when these effects emerge. We tested 94 children aged 7-10. We found that while children in this range exhibit significant variability in math conception, their breadth of math conception does not predict their math anxiety. We discuss implications of our findings for interventions to mitigate math anxiety in children.

Does cooking involve ‘math’?: The relationship between math conception and math anxiety in Indian elementary and middle-school students

Recent analyses of human creativity and curiosity have identified the existence of three styles of exploration: busybody, hunter, and dancer. These styles were recognized largely by observing participants’ explorations within a task, converting those observations into networks, and measuring networks’ properties. But do these exploration styles still appear across different tasks? And when graph-based descriptors of an individual’s exploration style are identified, how well do they transfer to similar tasks? We study inter- and intra-individual differences in two similar, but distinct, word association tasks: Chain Free Association and Semantic Fluency. We demonstrate that in some cases, graph-theoretic features do seem to capture individual semantic exploration patterns across tasks. Furthermore, our results provide evidence supporting the existence of the dancer style and its relationship to the Busybody-Hunter score. These findings highlight the potential of graph analysis as a tool for characterizing and exploring individual cognitive styles in semantic tasks.

Curiosity Exploration Styles in Word Association Tasks

Understanding how data visualizations shape reader takeaways is critical for designing effective displays, but measuring these affordances remains a challenge. While free-response studies provide a rich source of human interpretations, they are costly to analyze and often contain ambiguities. We investigate alternative elicitation methods, including ranking charts, ranking conclusions, and rating salience, to determine their effectiveness in capturing visualization affordances. Alternative approaches varied in their sensitivity to chart familiarity and specific affordance factors. Salience ratings aligned well with gold-standard affordances collected from free-responses but failed to capture chart-specific insights, while ranking methods overemphasized familiar chart types. Additionally, we compared human responses across all elicitation methods to outputs from GPT-4o to evaluate the extent to which large language models (LLMs) could replicate human-derived affordances. These findings underscore the importance of evaluating multiple elicitation methods and clarify the potential and limitations of LLMs as proxies for human interpretation.

Elicitation Strategies for Capturing Information Visualization Affordances

If a person answers a question correctly, how can we tell if the answer reflects an underlying understanding of the phenomenon, or if it is based on merely surface-level associations? Cognitive science has developed multiple tests, such as Winograd Schemas, that ostensibly require a respondent to use some kind of world/situation model rather than just associations. What then are we to make of large language models (LLMs) successes on some of these tasks? We present a series of probes to LLMs and people about everyday situations, finding that models sometimes respond correctly for the wrong reason and in other cases make seemingly 'catastrophic' mistakes by applying the wrong model--often in human-like ways. Our results suggest that probing the basis of LLMs' successes and failures can help inform human problem solving and in some cases call into question our previous tests of human understanding.

Wrong for the Right Reason? Using Successes and Failures of Large Language Models to Understand Human Thinking

Large language models (LLMs) such as ChatGPT have replaced conventional interface designs with prompt-based natural language interactions. LLMs exhibit dynamic capabilities to fulfill a broad range of tasks and ad-hoc functionalities (e.g., “rewrite these appliance installation instructions for a five-year-old”). However, their open-ended interface replaces Norman’s gulf of execution with a new cognitive challenge for end-users; namely, the gulf of envisioning clear intentions and task descriptions in prompts to obtain a desired LLM response. To address this gap, we propose a cognitive model of the Envisioning process based on protocols of generative AI prompt-based interactions. The model highlights three cognitive challenges people face when requesting help from LLMs: (1) what the task should be (intentionality gap), (2) how to give instructions to do the task (instruction gap), and (3) what to expect in the LLM’s output (capability gap). We make recommendations to narrow the gulf of envisioning in human-LLM interactions.

Envisioning: The Cognitive Challenge of Prompt-based LLM Interactions

How do systems of measurement influence our conceptualization of relative magnitudes? This study investigates the cognitive interplay between measurement precision and magnitude categorization. By employing morphed shapes organized by an arbitrary dimension, we examine whether exposure to high- vs low-precision numerical systems affects conceptual judgments and well-known phenomena such as semantic distance and semantic congruity effects as found for familiar dimensions. Participants trained on novel scales revealed differences in their sensitivity that depended on the precision of the trained measurement system, consistent with high-precision systems leading to relatively expanded dimensional encodings compared to low-precision systems. Our findings also shed light on other topics such as the interplay of perception and language in learning novel dimensions and the association of directionality with a mental number line.

Does Precision Affect Categorization? Magnitude Categorization and Measurement Scales

We frequently interact with others daily and experience a sense of joint agency—the feeling of performing an action together. Recent studies suggest that this sense of joint agency is influenced by the perceived "human-likeness" of partner. This study examined how a partner's behavioral process, specifically adaptation and fluctuation, affects joint agency in a cooperative task mediated by human-likeness. Participants completed a cursor-tracing task simulating collaboration, with cursor movement determined by combining their input with pre-recorded data. In this experiment, adaptation was approximated by preprogrammed changes in the cursor movement. The results revealed that adaptation enhanced joint agency, whereas fluctuation had no significant effect. Human-likeness is thus positively correlated with joint agency. Moreover, individual traits such as extraversion and attachment shaped these perceptions in unexpected ways. Poor task performance increases joint agency. These findings contribute to this field by identifying factors that influence the sense of joint agency.

Influence of a Partner's Behavioral Process on the Sense of Joint Agency During Collaborative Task 

Humans can readily integrate visual, linguistic, and numerical information to extract meaning from symbolic displays of information. For instance, answering even a simple question about a data visualization requires connecting tokens of language to visual features in the plot to support quantitative inferences. What are the core computational mechanisms that enable integration across modalities to support such reasoning? Open-source vision-language models (VLMs) might provide a useful testbed for investigating these mechanisms, but doing so requires a high degree of experimental control. To achieve this control, we procedurally generated a large dataset containing pairs of questions and data visualizations that varied along several independent and ecologically important dimensions, including the number of observations and how they were distributed. We identified several open VLMs whose performance was sensitive to this variation, establishing their viability for further exploration of the mechanisms underlying multimodal reasoning.

Downloads

Next from CogSci 2025

Unmasking political deception: Investigating the Discernment and Emotional Impact of Deepfake Political Speeches Featuring American Presidential Candidates

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from CogSci 2025

Unmasking political deception: Investigating the Discernment and Emotional Impact of Deepfake Political Speeches Featuring American Presidential Candidates

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads