This talk concerns the challenge of evaluating intelligence in artificial systems such as GPT. While contemporary methods provide fine-grained assessments of task performance, they often fail to distinguish genuine intelligence from sophisticated mimicry, leading to familiar, long-standing debates about what to say about “Block Heads” and the Chinese Room. I think that we can make headway on this stubborn issue by thinking more carefully about how various activities come to be performed rather than focusing only on outward performance. In the context of LLMs, this suggests a path towards “deep benchmarking” which requires attending to the mechanisms AI systems use to complete tasks along with careful thinking about the conditions under which those mechanisms underwrite intelligent activity. Once we do that, how do things look with respect to chatbot intelligence? In my view, there are plausible mechanisms present in contemporary LLMs that underwrite intelligent activities, but the activities in question are a good way off from anything like semantic understanding let alone AGI.