Search

Scholarly Works (6 results)

Sort By:

Article
Peer Reviewed

The Role of Physical Inference in Pronoun Resolution

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 43 (2021)

When do people use knowledge about the world in order to comprehend language? We asked whether pronoun resolution decisions are influenced by knowledge about physical plausibility. Results showed that referents which are more physically plausible in described events were more likely to be selected as antecedents of ambiguous pronouns, implying that resolution decisions were driven by physical inference. An alternative explanation is that these decisions were driven instead by distributional word knowledge. We tested this by including predictions of a statistical language model (BERT) and found that physical plausibility explained variance on top of the statistical language model predictions. This indicates that at least part of people's pronoun resolution judgments comes from knowledge about the world and not the word. This result constrains psycholinguistic models of comprehension—world knowledge must influence propositional interpretation—and raises the broader question of how non-linguistic physical inference processes are incorporated during comprehension.

Cover page: The Role of Physical Inference in Pronoun Resolution

Creative Commons 'BY' version 4.0 license

Article
Peer Reviewed

Does word knowledge account for the effect of world knowledge on pronoun interpretation?

UC San Diego Previously Published Works (2024)

Abstract: To what extent can statistical language knowledge account for the effects of world knowledge in language comprehension? We address this question by focusing on a core aspect of language understanding: pronoun resolution. While existing studies suggest that comprehenders use world knowledge to resolve pronouns, the distributional hypothesis and its operationalization in large language models (LLMs) provide an alternative account of how purely linguistic information could drive apparent world knowledge effects. We addressed these confounds in two experiments. In Experiment 1, we found a strong effect of world knowledge plausibility (measured using a norming study) on responses to comprehension questions that probed pronoun interpretation. In experiment 2, participants were slower to read continuations that contradicted world knowledge-consistent interpretations of a pronoun, implying that comprehenders deploy world knowledge spontaneously. Both effects persisted when controlling for the predictions of GPT-3, an LLM, suggesting that pronoun interpretation is at least partly driven by knowledge about the world and not the word. We propose two potential mechanisms by which knowledge-driven pronoun resolution occurs, based on validation- and expectation-driven discourse processes. The results suggest that while distributional information may capture some aspects of world knowledge, human comprehenders likely draw on other sources unavailable to LLMs.

Article
Peer Reviewed

Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)

UC San Diego Previously Published Works (2024)

Abstract: We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPITOME: a battery of six experiments that tap diverse ToM capacities, including belief attribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is sufficient to explain ToM in humans. We compare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the basis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.

Article
Peer Reviewed

Does reading words help you to read minds? A comparison of humans and LLMs at a recursive mindreading task

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 46 (2024)

There is considerable debate about the origin, mechanism, and extent of humans' capacity for recursive mindreading: the ability to represent beliefs about beliefs about beliefs (and so on). Here we quantify the extent to which language exposure could support this ability, using a Large Language Model (LLM) as an operationalization of distributional language knowledge. We replicate and extend O'Grady, et al. (2015)'s finding that humans can mindread up to 7 levels of embedding using both their original method and a stricter measure. In Experiment 2, we find that GPT-3, an LLM, performs comparably to humans up to 4 levels of embedding, but falters on higher levels, despite being near ceiling on 7th-order non-mental control questions. The results suggest that distributional information (and the transformer architecture in particular) can be used to track complex recursive concepts (including mental states), but that human mentalizing likely draws on resources beyond distributional likelihood.

Cover page: Does reading words help you to read minds? A comparison of humans and LLMs at a recursive mindreading task

Article
Peer Reviewed

Cognitive cost and information gain trade off in a large-scale number guessing game

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 43 (2021)

How do people ask questions to zero in on a correct answer? Although we can formally define an optimal query to maximize information gain, algorithms for finding this optimal guess may impose large resource costs in space (memory) and time (computation). To understand how people trade off the information gain and the computational difficulty of choosing the ideal query, we turned to a large dataset of 380,000 guesses made during a number-guessing game with Amazon Alexa. We analyzed whether the arithmetic difficulty of following the optimal strategy predicts how far a guess deviates from theoretically optimal query. We find that when memory load is higher, and when more arithmetic operations need to be performed, human guesses deviate more from the most informative query. These results suggest human computational resource constraints limit how people seek out informative questions.

Cover page: Cognitive cost and information gain trade off in a large-scale number guessing game

Article
Peer Reviewed

Distrubutional Semantics Still Can't Account for Affordances

Proceedings of the Annual Meeting of the Cognitive Science Society, Volume 44 (2022)

Can we know a word by the company it keeps? Aspects of meaning that concern physical interactions might be particularly difficult to learn from language alone. Glenberg & Robertson (2000) found that although human comprehenders were sensitive to the distinction between afforded and nonafforded actions, distributional semantic models were not. We tested whether technological advances have made distributional models more sensitive to affordances by replicating their experiment with modern Neural Language Models (NLMs). We found that only one NLM (GPT-3) was sensitive to the affordedness of actions. Moreover, GPT-3 accounted for only one third of the effect of affordedness on human sensibility judgments. These results imply that people use processes that go beyond distributional statistics to understand linguistic expressions, and that NLP systems may need to be augmented with such capabilities.

Cover page: Distrubutional Semantics Still Can't Account for Affordances