
This supported a view that early perceptual processes encode speech in terms of categories and abstract away from fine-grained detail in the signal. If perception of VOT is categorical, then VOTs between 0 and 20 ms (/b/) may be encoded as more similar to each other than to VOTs greater than 20 ms (/p/), even if the acoustic distance between them is the same.Įarly behavioral work suggested that perception is categorical: listeners are poor at discriminating acoustic differences within the same category and good at equivalent distances spanning a boundary ( Liberman et al., 1957 Repp, 1984). VOT leaves an acoustic trace that serves as a continuous cue distinguishing voiced (/b,d,g/) from voiceless (/p,t,k/) stops. Consider voice onset time (VOT), the time difference between the release of constriction and the onset of voicing. If perception is nonlinear in one of these ways, listeners will be less sensitive (or completely insensitive) to differences within a category than to differences between categories. Such discreteness could arise from several sources: an inherent, nonlinear encoding of speech into articulatory gestures ( Liberman & Mattingly, 1985) the learned influence of phonological categories ( Anderson, Silverstein, Ritz, & Jones, 1977) or discontinuities in low-level auditory processing ( Sinex, MacDonald, & Mott, 1991 Kuhl & Miller, 1978). Historically, a dominant question was whether perception is graded or categorical (discrete or nonlinear) with respect to the continuous input ( Liberman, Harris, Hoffman, & Griffith, 1957 Schouten, Gerrits, & Van Hessen, 2003).

Theories of speech perception differ in the nature of representations at both levels and the transformations that mediate them ( Oden & Massaro, 1978 Liberman & Mattingly, 1985 McClelland & Elman, 1986 Goldinger, 1998). Speech perception has been framed in terms of two levels of processing ( Pisoni, 1973): the perceptual encoding 1 of continuous acoustic cues and the subsequent mapping of this information onto categories like phonemes or words. This process is fundamental for basic language processing, but is also relevant to other areas, such as language and reading impairment ( Thibodeau & Sussman, 1979 Werker & Tees, 1987) and automatic speech recognition. Thus, a central question in spoken language comprehension is how listeners transform variable acoustic signals into less variable, linguistically-meaningful categories. Despite this variability, listeners can accurately recognize speech. Individual speakers differ in how they produce words, and even the same speaker will produce different acoustic patterns across repetitions of a word. The acoustics of speech are characterized by immense variability. Further, at phonological levels, fine-grained acoustic differences are preserved along with category information. Thus, at perceptual levels, acoustic information is encoded continuously, independent of phonological information. In addition, effects of within-category differences in VOT were present at a post-perceptual categorization stage (P3 component, ca. 100 ms poststimulus) and were independent of categories. We found that VOT effects were present through a late stage of perceptual processing (N1 component, ca. We addressed this in an event-related potential (ERP) experiment in which listeners categorized spoken words that varied along a continuous acoustic dimension (voice onset time VOT) in an auditory oddball task. A central question is whether perceptual encoding captures continuous acoustic detail in a one-to-one fashion or whether it is affected by categories.


Speech sounds are highly variable, yet listeners readily extract information from them and transform continuous acoustic signals into meaningful categories during language comprehension.
