Sil Hamilton

Sil Hamilton

Ithaca, New York, United States
899 followers 500+ connections

About

I’m a PhD student working with David Mimno and Matthew Wilkens at Cornell University in…

Activity

899 followers

See all activities

Experience

  • Epiq Graphic

    Epiq

    New York, NY

  • -

    New York, New York, United States

  • -

  • -

  • -

  • -

    Montreal, Quebec, Canada

  • -

  • -

    Montreal, Quebec, Canada

  • -

    Hamilton, Ontario, Canada

  • -

    Hamilton, Ontario, Canada

  • -

    Hamilton, Ontario, Canada

Education

  • Cornell University Graphic

    Cornell University

    -

    I'm studying machine learning and cultural analytics under David Mimno and Matthew Wilkens. My work investigates computational narrative understanding with large language models.

  • -

    -

  • -

    -

Volunteer Experience

  • Health Tech Without Borders Graphic

    Technology Advisor

    Health Tech Without Borders

    - 1 year 2 months

    Health

    Advising the board on matters of artificial intelligence.

  • The Associated Press Graphic

    Technical Advisor

    The Associated Press

    - 3 months

    Science and Technology

    I advised on AI when the Associated Press was formulating their AI policies.

  • Brown Institute for Media Innovation Graphic

    Research Affiliate

    Brown Institute for Media Innovation

    - Present 1 year 11 months

    Education

    I gave workshops and presentations at the Brown Institute.

Publications

  • NarraBench: A Comprehensive Framework for Narrative Benchmarking

    EACL 2026

    We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style…

    We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.

    Other authors
    See publication
  • Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels

    SIGHUM 2026

    Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary…

    Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.

    Other authors
    See publication
  • The Zero Body Problem: Probing LLM Use of Sensory Language

    COLM 2025

    Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories…

    Sensory language expresses embodied experiences ranging from taste and sound to excitement and stomachache. This language is of interest to scholars from a wide range of domains including robotics, narratology, linguistics, and cognitive science. In this work, we explore whether language models, which are not embodied, can approximate human use of embodied language. We extend an existing corpus of parallel human and model responses to short story prompts with an additional 18,000 stories generated by 18 popular models. We find that all models generate stories that differ significantly from human usage of sensory language, but the direction of these differences varies considerably between model families. Namely, Gemini models use significantly more sensory language than humans along most axes whereas most models from the remaining five families use significantly less. Linear probes run on five models suggest that they are capable of identifying sensory language. However, we find preliminary evidence suggesting that instruction tuning may discourage usage of sensory language. Finally, to support further work, we release our expanded story dataset.

    See publication
  • Million Eyes on the “Robot Umps”: The Case for Studying Sports in HRI Through Baseball

    2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI)

    In this position paper, we argue that baseball-and sports more broadly-provide a unique and under-explored opportunity for researchers to study human-robot interaction (HRI) in real-world settings. Using the rise of robot umpires in baseball as a primary example, we examine emerging themes such as power dynamics among players and umpires, labor implications, and technical challenges. We emphasize the affordances and benefits of studying sports within HRI, including the integration of…

    In this position paper, we argue that baseball-and sports more broadly-provide a unique and under-explored opportunity for researchers to study human-robot interaction (HRI) in real-world settings. Using the rise of robot umpires in baseball as a primary example, we examine emerging themes such as power dynamics among players and umpires, labor implications, and technical challenges. We emphasize the affordances and benefits of studying sports within HRI, including the integration of interdisciplinary perspectives, the large-scale deployment of robots, and the examination of their role in deeply rooted cultural practices.

    Other authors
    See publication
  • A City of Millions: Mapping Literary Social Networks At Scale

    Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

    We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for~ 30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 2,510,021 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating…

    We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for~ 30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 2,510,021 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating previously manual methods of extracting social networks; specifically, we adapt an existing annotation task as a language model prompt, ensuring consistency at scale with the use of structured output. This dataset serves as a unique resource for humanities and social science research by providing data on cognitive models of social realities.

    Other authors
    See publication
  • Detecting Mode Collapse in Language Models via Narration

    Workshop on the Scaling Behavior of Large Language Models

    No two authors write alike. Personal flourishes invoked in written narratives, from lexicon to rhetorical devices, imply a particular author--what literary theorists label the implied or virtual author; distinct from the real author or narrator of a text. Early large language models trained on unfiltered training sets drawn from a variety of discordant sources yielded incoherent personalities, problematic for conversational tasks but proving useful for sampling literature from multiple…

    No two authors write alike. Personal flourishes invoked in written narratives, from lexicon to rhetorical devices, imply a particular author--what literary theorists label the implied or virtual author; distinct from the real author or narrator of a text. Early large language models trained on unfiltered training sets drawn from a variety of discordant sources yielded incoherent personalities, problematic for conversational tasks but proving useful for sampling literature from multiple perspectives. Successes in alignment research in recent years have allowed researchers to impose subjectively consistent personae on language models via instruction tuning and reinforcement learning from human feedback (RLHF), but whether aligned models retain the ability to model an arbitrary virtual author has received little scrutiny. By studying 4,374 stories sampled from three OpenAI language models, we show successive versions of GPT-3 suffer from increasing degrees of "mode collapse" whereby overfitting the model during alignment constrains it from generalizing over authorship: models suffering from mode collapse become unable to assume a multiplicity of perspectives. Our method and results are significant for researchers seeking to employ language models in sociological simulations.

    See publication
  • Mrs. Dalloway Said She Would Segment the Chapters Herself

    Workshop for Narrative Understanding

    This paper proposes a sentiment-centric pipeline to perform unsupervised plot extraction on non-linear novels like Virginia Woolf’s Mrs. Dalloway, a novel widely considered to be “plotless. Combining transformer-based sentiment analysis models with statistical testing, we model sentiment’s rate-of-change and correspondingly segment the novel into emotionally self-contained units qualitatively evaluated to be meaningful surrogate pseudo-chapters. We validate our findings by evaluating our…

    This paper proposes a sentiment-centric pipeline to perform unsupervised plot extraction on non-linear novels like Virginia Woolf’s Mrs. Dalloway, a novel widely considered to be “plotless. Combining transformer-based sentiment analysis models with statistical testing, we model sentiment’s rate-of-change and correspondingly segment the novel into emotionally self-contained units qualitatively evaluated to be meaningful surrogate pseudo-chapters. We validate our findings by evaluating our pipeline as a fully unsupervised text segmentation model, achieving a F-1 score of 0.643 (regional) and 0.214 (exact) in chapter break prediction on a validation set of linear novels with existing chapter structures. In addition, we observe notable differences between the distributions of predicted chapter lengths in linear and non-linear fictional narratives, with the latter exhibiting significantly greater variability. Our results hold significance for narrative researchers appraising methods for extracting plots from non-linear novels.

    Other authors
    See publication
  • Blind Judgement: Agent-Based Supreme Court Modelling with GPT

    Creative AI Across Modalities, AAAI 2023

    We present a novel Transformer-based multi-agent system for simulating the judicial rulings of the 2010-2016 Supreme Court of the United States. We train nine separate models with the respective authored opinions of each supreme justice active ca. 2015 and test the resulting system on 96 real-world cases. We find our system predicts the decisions of the real-world Supreme Court with better-than-random accuracy. We further find a correlation between model accuracy with respect to individual…

    We present a novel Transformer-based multi-agent system for simulating the judicial rulings of the 2010-2016 Supreme Court of the United States. We train nine separate models with the respective authored opinions of each supreme justice active ca. 2015 and test the resulting system on 96 real-world cases. We find our system predicts the decisions of the real-world Supreme Court with better-than-random accuracy. We further find a correlation between model accuracy with respect to individual justices and their alignment between legal conservatism & liberalism. Our methods and results hold significance for researchers interested in using language models to simulate politically-charged discourse between multiple agents.

    See publication
  • MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library

    Journal of Open Humanities Data

    This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by…

    This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.

    Other authors
    See publication
  • The COVID That Wasn’t: Counterfactual Journalism using GPT

    SIGHUM, COLING 2022

    In this paper, we explore the use of large language models to assess human interpretations of real world events. To do so, we use a language model trained prior to 2020 to artificially generate news articles concerning COVID-19 given the headlines of actual articles written during the pandemic. We then compare stylistic qualities of our artificially generated corpus with a news corpus, in this case 5,082 articles produced by CBC News between January 23 and May 5, 2020. We find our artificially…

    In this paper, we explore the use of large language models to assess human interpretations of real world events. To do so, we use a language model trained prior to 2020 to artificially generate news articles concerning COVID-19 given the headlines of actual articles written during the pandemic. We then compare stylistic qualities of our artificially generated corpus with a news corpus, in this case 5,082 articles produced by CBC News between January 23 and May 5, 2020. We find our artificially generated articles exhibits a considerably more negative attitude towards COVID and a significantly lower reliance on geopolitical framing. Our methods and results hold importance for researchers seeking to simulate large scale cultural processes via recent breakthroughs in text generation.

    Other authors
    See publication

Honors & Awards

  • NSERC Postgraduate Scholarships-Doctoral (PGS-D)

    Natural Sciences and Engineering Research Council of Canada

    3 year fellowship to pursue extracting cultural concepts from neural networks.

  • Steamship Fellowship for Language AI at Writing Atlas

    Steamship, Inc.

  • Joseph-Armand Bombardier Canada Graduate Scholarship

    Social Sciences and Humanities Research Council of Canada

Languages

  • English

    Native or bilingual proficiency

  • French

    Professional working proficiency

  • Dutch

    Limited working proficiency

View Sil’s full profile

  • See who you know in common
  • Get introduced
  • Contact Sil directly
Join to view full profile

Explore top content on LinkedIn

Find curated posts and insights for relevant topics all in one place.

View top content

Add new skills with these courses