Object recognition relies on invariant representations. A longstanding view states that invariances are learned by explicitly coding how visual features are related in space. Here, we asked how invariances are learned for objects that are defined by relations among features in time (temporal objects). We trained people to classify auditory, visual and spatial temporal objects composed of four successive features into categories defined by sequential transitions across a two-dimensional feature manifold, and measured their tendency to transfer this knowledge to categorise novel objects with rotated transition vectors. Rotation-invariant temporal objects could only be learned if their features were explicitly spatial or had been associated with a physical spatial location in a prior task. Thus, space acts as a scaffold for generalising information in time.