How Do Language Models Like ChatGPT Process Complex Words?

How Do Language Models Like ChatGPT Process Complex Words?
You can also join remotely. See Teams link on the seminar webpage in booking url.
Speaker:

Valentin Hofmann is a final-year DPhil student at the University of Oxford and a research assistant at LMU Munich. His work broadly focuses on the intersection of natural language processing, linguistics, and computational social science, with specific interests in tokenization, socially and temporally aware language models, and graph-based methods. He has previously spent time as a research intern at DeepMind and as a visiting scholar at Stanford University.

Abstract:

Language models (LMs) like ChatGPT have achieved unprecedented levels of performance in natural language processing. One common characteristic of these models is that they segment text into a sequence of tokens from a fixed-size vocabulary, a step commonly referred to as tokenization.
In this talk, I will take a closer look at how linguistic properties of the tokenization impact how LMs process complex words (e.g., “superbizarre”). I will first give an overview of different forms of complex word processing in humans and AI systems. I will then present recent computational studies showing that the tokenization of LMs can lead to linguistically invalid segmentations (e.g., “superb-iza-rre”) that severely affect how LMs interpret complex words. Finally, I will discuss potential solutions of this problem.
Date: 21 February 2023, 14:30
Venue: Wolfson College, Linton Road OX2 6UD
Venue Details: Seminar Room 3 - The Academic Wing
Speaker: Valentin Hofmann (University of Oxford)
Organising department: Wolfson College
Organisers: Prof. Antoniya Georgieva (University of Oxford), Dr. Yi Yin (Wolfson College, University of Oxford)
Organiser contact email address: yi.yin@wrh.ox.ac.uk
Part of: Oxford Cross-Disciplinary Machine Learning (OxfordXML) Research Cluster Seminar Series
Booking required?: Not required
Booking url: https://users.ox.ac.uk/~ndog0178/XML/xml_index.html
Cost: Free (cake, tea and coffee provided)
Audience: Public
Editor: Yi Yin