Introduction (The Gossip on Voice Chips)
This essay develops a frequently asked question (FAQ) list for Voice Chips. Like the questions in most FAQs, these questions are not actually frequently asked, but they might be, and like every FAQ, the attempt is to structure the accumulation of experiences in a sociotechnical project.
Voice Chips and their newer partners, speech recognition chips, are small low power silicon chips that synthesize voice, play prerecorded voice messages, or recognize voice commands. Although this functionality is not new, what makes voice chips unique is that they are small and cheap enough to be deployed in many, in fact almost any, product. Sprinkled throughout the technosocial landscape, their presence in products is a (not quite arbitrary) sampling mechanism, and enables us to compare very different products. So their secondary function, the concern of this essay, is as a simple instrument to slice through the history of our attempts to swap attributes with machines and be able to understand the nuances of complex sociotechnical systems -- precisely because the systems are rendered in the form in which we can best recognize nuance: English, be it our own or the machines'.
These chips represent in-the-wild models of the interactions between humans and machines -- as reductive as they are comic, but at least a manageable examination. They are caricatures of more complex human-machine interactive systems. We will ask: What is the structure of participation scripted with these new products and with increasingly ubiquitous information technologies? By examining the "structure of participation" (who addresses what, what addresses whom, who listens, what hears and who or what acts... and other forms of participation elaborated later) rather than focusing on the interaction between the device and the "user," we pay attention to peripheral participation, the participation between users and around things; between users and things within systems. It is an approach to human (singular)-computer (singular) interaction that reconsiders interaction as a form of participation and escapes the simple dichotomy between social and technological.
The question we begin with is simply, when things can talk, what do they say? Our intent is to actually listen, and try to figure out whose voice it is and what it means. We then ask the complementary question: when we can talk to things (i.e., when there is speech recognition capacity embedded), what do we say? Who are we addressing? And what do we sound like? Are we polite, at least? And what is the appropriate way to talk to things (social norms)? Does it change language to be talking across the human/nonhuman divide, or how does it change us? Can we get some new insights into the old question, Is language uniquely human? We ask the voice chips these questions because they literally talk back, insisting on the scripts of participation that they were built with, reflecting the expectations and failures of our interactive technologies.
How have the things that voice chips say changed over the years since voice functionality was implemented? Does the "Chatty Cathy" of the sixties (a tape mechanism precursor to the voice chip) have anything different to say vis-à-vis the Barbie of the eighties or contemporary interactive toys? Or what is the relationship between novelty and familiarity, stability and instability, managed in these devices? Which things are given voices and which things are not, and why not? Why are they different from other talking hardware? What did the patents say they would say? And what did they actually say? What are the differences between these innovations as intellectual property and the novel devices as viable products? Exploring these questions tells us about the process of commodification of an ephemeral device, and explores the pattern of propagation of an innovation.
Where are the voices coming from? Who is hearing them, who isn't? Whose accent do they have? Does the failure of voice chips in automobiles predict anything for the future of speech recognition chips inserted in other modes of transportation, and other places? Do they work in public or private places? What does any of this tell us about ubiquitous computing? Do these voices actually work? Does a voice chip reminder to not leave your things behind, watch your step, stand back, actually make you take your thing, watch your step, or stand back? How does the function of the product change the meaning of the voice? How does the voice change the product? How do nonhuman speech devices change language? And similarly: Now that we can talk to things, what do we say? What would we prefer to say? What would be the correct thing to say? What could we say? What does this tell us about the contingency of meaning?
Can You Capture Voice?
Voice is the icon of person. "To be given a voice" is shorthand for the fundamental units of democracy: voting, "being represented," or participating. A device of sociality and therefore interaction, it is used to interpolate a subject (presumably a person) into society (Althusser 1971), or as a performative device to instantiate social agreements and identities (Butler 1993). We will trace how the responsive and ephemeral social device of voice interaction is commodified and sold back to us.
What's So Special about Voice Chips?
Talking hardware has existed since before the time of Thomas Edison (who is generally credited with having invented the phonograph around 1877), when Alexander Graham Bell's telephone learnt to talk. The proliferation of talking hardware since has brought about the recording industry, the broadcast industry and the multimedia industry. Our exposure to voices (and other communicative sounds) that emanate from inanimate objects has become a significant part of our daily interactions: from radios to the more recent talking elevators, answering machine messages and prerecorded music, television, automated phone menus, automatic teller machines, alarms and alerts, each of which, as we will show, speaks in a language or dialect that makes little distinction between music, sound effects and articulated words, and privileges the situational function of language over the semantic and interactive.
There are, however, distinctions to make between the voice
chips, the concern of this essay, and noisy hardware more generally.
Voice chips refers colloquially to: Texas Instrument TSP50C04/06 and
TSP50C13/14/19 synthesizers; Motorola MC34018 or any other "speech
synthesis chip implemented in C-MOS to reproduce various kinds of
voices, and includes a digital/analog (D/A) converter, an ADPCM
synthesizer, an ADPCM ROM that can be configured by the manufacturer to
produce sound patterns simulating certain words, music or other
effects."
1
The speech recognition chip is exemplified by the ISD-SR3000
Embedded Speech Recognition Engine.
The voice chip differs from other technologies of automated
sound production in that it offers autonomous voices, as opposed to
broadcast voices. That is, voices which are not necessarily associated
with a performer, a brand, or any other preestablished identity. These
chips present what we will call "local talk" in products that refer to
themselves and don't often make claims to another's identity, or to the
faithful reproduction of someone else's voice. In fact, their sound
quality has effectively limited this. The "I" in "I'm sorry, I could not
process your request," or the "I will transfer you now" voice of the
automated operator
2
claims agency by using the first person pronoun. Presumably, the
machine is referring to itself when saying "I,"
3
because it is not identifiably anyone else.
Attributing agency to technologies is a strategy that has been
used by theorists to better understand the social role of technologies
(Latour 1988; Callon 1995). It is a strategy that dislodges the
immediate polarization of techniques and society, a strategy that
refuses reduction to a situation that is merely social or only
technological. Bruno Latour bases his Actor Network Theory -- a theory
that regards things as well as people as actors in any
sociotechnological assemblage -- on the ability of humans and nonhumans
to swap properties. He claims that "every activity implies a generalized
principal of symmetry or, at least, offers an ambiguous mythology that
disputes the unique position of humans." Michel Callon and John Law
(1982) have also explored nonhumans as agents, but their strategy starts
with an indisputable agent (a white male scientist) and strips away his
enabling network of humans and nonhumans to demonstrate that his agency,
his ability to act as a white male scientist, is distributed throughout
his network of people, places and instruments. The more traditional
(default) theory of technological determinism rests on the assumption
that technology has an agency apart from the people who design,
implement or operate it, and hence can determine social outcomes. Voice
chip products take these ideas literally and actually attribute, with
little debate or contest, the human capacity of speech to technological
devices. Voice chips humbly preempted the theory.
4
The voices of chips also differ from those of loudspeakers, TV/radio, and other broadcasting technologies in the social spaces they inhabit. Although radio and TV have become so portable that their voices can emanate from any vehicle, serving counter, or room -- voice chip voices, by virtue of their peripheral relationship to the product, inhabit even more diverse social spaces. The identity of the voice that emanates form TV and radio reminds us that it is coming from elsewhere: "..for CBS News," "It is 8 o'clock GMT; this is London." And although Channel 9 is not a physical place, its resources and speech are organized around creating its identity, as an identifiable place on the dial. The voice chip that tells you "your keys are in the ignition" is not creating a Channel 9 identity, however. Its identity is "up for grabs," not quite settled; it speaks from a position of a product in the social space of daily use.
Similarly, recording media and hardware refer to what they
record. We know we are listening to someone when we listen to an Abba
CD. And although it is the tape player in the car that produces the
sound, we claim to be listening to the violin concerto itself. The tape
player as a product does not itself have a voice; it never pretends to
sing, speak or synthesize violin sounds itself. The recording industry
and associated technologies, born at a very different historical moment
from voice chips, came out of the performance tradition.
5
Its claim to represent someone, from the earliest promotions
using opera singers, to contemporary megastars, has focused the
technologies around "fidelity" issues. Additionally, telephones,
telephonic systems and the telecommunications industry, motivated by the
communication imperative, prioritize real-time voices passing to
real-time ears over fidelity. Simply stated, it is an industry that puts
technologies between people, things to communicate through, "overcoming
the tyranny of distance" (Minneman 1991). Invisible distance and
seamless technology reflect the recording industry's ambition to
"overcome the tyranny of time," enabling people to duplicate the
performance regardless of when or where it was originally performed.
Voice chips and their inferior sound quality do not refer beyond
themselves. Their position in a product becomes their position as a
product.
How Are Voice Chips Distributed?
Voice chips provide the opportunity to add "voice functionality"
to the whole consumer-based electronics industry. They are the
integrated circuits that can record, play and store sounds, and more
importantly, voice. They are the patented chips that play "Jingle Bells"
in Hallmark greeting cards.
6
They are the voice in the car that reminds you, "Your lights are
on."
7
They are the technology that makes dolls say "Meet me at the
mall,"
8
and gives voices to products ranging from picture frames to
pens.
9
The well-sung virtues of integrated circuits (chips) are that
they are cheap, tiny and require little power. Smaller than a baby's
fingernail, they have the force of a global industry behind them and an
entire economic sector invested in expanding their application.
Technically, they can be incorporated into any product without
significant changes in their housing, their circuit design, power
supply, or price. Wherever there is a flashing light, there could
instead, or as well, be a voice chip.
Although most personal computers can record and play voice, the
voice chip is different in that it is dedicated solely to that function.
The same integrated circuit technology found in calculators and
computers allows this tiny package to be placed ad hoc in consumer
devices. Their development exploited the silicon chip manufacturing
processes and its dedication to miniaturization. With sound storage
capacities ranging from seconds of on-board memory to minutes and hours
of recording time when configured with memory chips, they were conceived
to enable voices in existing hardware, to be incorporated into products.
They are the saccharin additive of consumer electronics.
10
They were first mass marketed in 1978 by Texas Instruments,
though they had existed in several forms before that, particularly in
the vending industry. It was not until seven years later, in 1985, that
the Special Interest Group in Computer-Human Interface (SIGCHI) of the
Association for Computing Machinery (ACM) professional society broke off
into their own conference from other more general computing conferences.
This institutionalization formalized the discussion in design
communities on the Human-Computer Interface as a site of scientific
investigation that differs from earlier formulations of this interface,
such as Englebart's human augmentation thesis or Turing's
standing-in-for ideal (Bardini 1997), but whose concerns for evaluating
an interface tends toward task decomposition, with metrics of efficiency
still dominating (Dourish 2001). This liminal zone where people and
machine purportedly interact is where the voice chips were intended to
reside. The voice chips arrived to mediate, even to negotiate, this
boundary. Voice chips promised to make hardware "user-friendly," a
phrase that defines the technical imagination of the time, by turning
the person into an interchangeable standardized "user" and attributing a
personality (i.e., friendliness) to the device. In this context the
problem for designing user-friendly devices begins with the assumption
that the hardware has agency in the interaction.
Writes Turkle: "Marginal objects, objects with no clear place, play important roles. On the lines between categories, they draw attention to how we have drawn the lines. Sometimes in doing so they incite us to reaffirm the lines, sometimes to call them into question, stimulating different distinctions" (Turkle 1984, 31).
Do Marginal Voices Have Any Say in the Market?
Finally, before listening to the voices themselves, I want to emphasize the peripheral relationship of the voice chip to the product. It is the position of the voice chip as marginal, not particularly intended to be the primary function of the product, that increases the present curiosity in it. The motor vehicle, for example, is not purchased primarily for its talking capacity, and pens that speak are still useful for writing. This marginality gives voice chips a mobility to become distributed throughout the product landscape and mark, like fluorescent dye, a social geography of product voices.
The chips are usually deployed -- to borrow the economic sense of the term -- for their marginal effects, to distinguish one product (e.g., an alarm system) from another, and give it some marginal advantage over a competing product. However, the chips are not evenly distributed throughout competitive markets (e.g., consumer electronics) in the manner one would expect for the propagation of a low-cost technical innovation driven by market structure alone.
Although consumer preferences are often claimed to have a causal determination on the appearance or disappearance of marginal benefits, it is difficult to see how the well-developed paths of product distribution have the capacity to communicate those "preferences" developed after the point of purchase. Lending the market ultimate causality (or agency) ignores the specific experience of conversing with products, the micro-interactions that enact the market phenomenon, and occludes the attribution of agency to the voice chip products, insomuch as these products speak for themselves. The voice chip products themselves have something to say, although their voices are usually ignored. In this essay we will not be examining voice chip products in the interactions of daily use, as contrapuntal to market descriptions - however, by recognizing the social assumptions that determine their physical design, we frame the imagined interactions and social worlds in which these products make sense.
Hearing Voices?
The marginality of the product makes it difficult to
systematically study. Neither of the two largest manufacturers of voice
chips of various types (Motorola and Texas Instruments) keep information
on what products incorporate this technology, partly because they can be
configured in many different ways -- not necessarily as voice chips --
and partly because products that talk are not a marketing category of
general interest. This essay traces voice chips in two ways: first via
the patent literature, and second through a more ad hoc method of
searching catalogues, electronics, and toy and department stores, to
compile a survey of products that have been available in the last six
years (my voice chip collection was begun in June 1996).
11
What is initially observable from the list of products and patents that contain voice chips is that there is no obvious systematic relationship between the products that include voice chips and the uses or purposes of those products. Except for children's toys, no particular electronics market sector is more saturated with talkative products. These chips are distributed throughout diverse products. However, we can view the voices as representatives, as in a democratic republic where voices are counted. Just as in a republic, each citizen has a vote, but most chose not to exercise it; likewise, most products could incorporate voice chips but most do not. We will count what we can.
19.1. The talking watch. (These images are drawn from the ephemeral propaganda form known as the product catalog. They have been deliberately pixelated.)
What Do Voice Chips Say?
A review of the patent literature yields a loose category scheme
or typology, not by where the voice chips appeared (a technology sector
approach that we will visit later), but by what they said. The patents
themselves hold a tentative relationship to the products. For only two
of the products on the market did I find the corresponding patents, the
CPR device
12
and the recordable pen.
13
Though patents do not directly reflect the marketed products,
they do represent a rather strange world of product generation, a
humidicrib for viable and unfeasible proto-products. Patents track how
products have been imagined and protected; while they do not by any
means demonstrate market success, they do reflect a conviction of their
worth, being invested in and protected. Patents are a step in the
process of becoming owned, are therefore worth money, and thereby
demonstrate how voice, a social technology, becomes property.
There were as of October 2001 only 163 North American patents that included a voice chip. (More recent years show a proportional increase.) In the context of the patent literature, the first thing to note is that this is a very small number -- compared, that is, to the integrated circuit patent literature more generally. The question "Why not more?" we will return to later. The federal trademark office offers a suggestive list of speech-invoking names, including: who's voice; provoice; primovox; ume voice; first voice; topvoice; voice power; truvoice; voiceplus; voicejoy; activevoice; vocalizer; speechpad; audiosignature. These monikers introduce how the voice is conceptualized in the realms of intellectual property, in a different form, claiming that these voices are premium (should be listened to?) in various ways. However, the voice chips themselves seem to fall into the following loose categories:
1. Translators, which range from reporting and alerting to alarming and threatening and include "interactive" instructional voices;
2. Transformers, which transform the voice;
3. Voice as Music, that make speech indistinguishable from music or that present voice as sound effect;
4. Locating Voices, speaking from here to there about being here;
5. Expressive Voices, expressing love, regret, anger and affection;
6. Didactic Voices and Imitative Voices, mainly as in educational and whimsical children's toys;
7. Dialogue Products, which explicitly intend to be in dialogue with the user, as opposed to delivering instructions to a passive listener.
Products and patents often exist in more than one of these categories; for instance, the Automatic Teller Machine will not only apologize (expressive) for being out of order but will also simply function to translate the words on the screen into speech. This said, the categories remain, for the most part, distinguishable and useful.
Translators
A large category, this is the voice that translates the language
of buzzes and beeps into sentences -- whether English, French, or
Chinese. A translator is a chip that translates the universal flashing
LED, the lingua franca of the peizo electric squeal, the date code, the
bar code, the telephone ringer adapter that translates that familiar
ring, the tingling insistent trill of an incoming call, into "a
well-known phrase of music"
14
(an approach that has since become popular in cell phones, where
this function is useful in differentiating whose phone is ringing), or
the unrelated patent that translates the caller identification signal
into a vocal announcement.
Within the translators there are distinct attitudes; for instance, the impassive reporting, almost a "voice of nature." This is exemplified by the patent for the menstrual cycle meter. The voice reports the date and time of ovulation, in addition to stating the gender more likely to be conceived at a particular date or time during a woman's fertility cycle. Another example is the patent for the "train defect detecting and enunciating system," which "reports detected faults in English." These chips speak with a "voice of reality," reporting "fact" by the authority of the instrument that triggers them.
Another type of translator claims more urgency than those that
simply report fact. These raise an alarm and expect a response. They are
less factual, more contestable perhaps. Take the "Writing device with
alarm,"
15
an "invention which relates to a writing device which can emit a
warning sound -- or appropriate verbal encouragement -- in order to
awaken a person who has fallen asleep while working or studying"; or the
baby rail device which exclaims, "The infant is on the rail, please
raise the rail"... and then if there is no subsequent response from an
attendant caregiver, raises it automatically.
16
A product on the market that will politely tell you if there is
water on the ground is pictured in figure 19.2. These voice chips ask
for and direct the involvement of their humans counterparts -- they
assume "interactive humans."
19.2. Flood warning.
These chips articulate not only simple commands, but series of
instructions as well. The CPR device
17
in figure 19.3 guides the listener through the resuscitation
process. And finally, these chips translate menus of choices into
questions. The car temperature monitor that asks the driver, "Would you
like to change the temperature?" translates from the visual menu of
choices, but in the process also takes over the initiating role. What is
lost or gained in the translation generates many questions: Does
translating from squeals to a more articulate alarm make it any more
alarming? How do spoken instructions transform written instructions? We
will try to address these questions later.
19.3. CPR prompt rescue aid.
There is an notable set of aberrant but related patents that
exist in this "alarming" category: "Alarm system for sensing and for
vocally warning a person approaching a protected object,"
18
"Alarm system for sensing and for vocally warning of an
unauthorized approach towards a protected object or zone,"
19
and "Alarm system for sensing and for vocally warning a person
to step back from a protected object."
20
What seems almost like hair-splitting turns of phrase to get three separate patents has little technical consequence: the second patent has the extra functionality to detect authorized persons (or their official badge), and the third can, but need not, imply a different sensor -- but each implies a different attitude. Although all patents are contestable, patent attorneys typically advise that you would not be able to successfully claim as separate patents an alarm system that warned at 15 feet and one that alerted at two feet. The "novel use" being patented here depends on the wording: the phrasing of the instruction that determines the arrangement of the sensor and alarm/voice chip. On the strength of a differently worded warning, the importance of the technically defined product description seems to have diminished. Perhaps ElectroAcoustic Novelties, the owner of the patents, has a linguist generating an alarm system for other phrases. These patents seem to be articulating the semantics of the technology. The intentionality of the system is its voice.
19.4. Voice changer.
Transformers
Transformers are distinct from patents that translate the voice.
They translate in the other direction -- not from the buzzes and squeals
to spoken phrases, but from the human voice to a less particular voice.
For instance: to assist the hearing impaired, a chip that transforms
voices into frequency range the listener can still hear (usually a
higher frequency); or the "Electronic Music Device" effecting a
"favorable musical tone." "The voice tone color can be imparted with a
musical effect, such as vibrato, or tone transformed."
21
Into this category fall children's products like the "YakBak," popular in the 1997-1999 seasons, which plays back a child's voice with a variety of distortions; and the silicon-based megaphones that allow children to imitate technological effects, or sound like machines. These are voice masks, for putting on the accent of techno-dialect. The socializing voices broadcast on radio and TV, the voices of authority heard over public address systems, and the techno-personalities of androids and robots are practiced and performed by playing with these devices. This is also the category of voice chips that is concentrated in products for the hearing impaired or the otherwise disabled, and for children. These transforming devices act as if to integrate these marginalized social roles into a sociotechnical mainstream.
Speech as Music
Many of the patents that are granted specifically collapse any
difference between music and speech. This contrasts with the careful
attention given to the meaning of the words used in the alarm system
family of the translators. An explicit example is the business card
receptacle, which solves the problem of having business cards stapled
onto letters -- making them more difficult to read -- and provides an
"improved receptacle that actively draws attention to the receptacle and
creates an interest in the recipient by use of audio signals, such as
sounds, voice messages, speech, sound effects, musical melodies, tones
or the like, to read and retain the enclosed object."
22
Another example is the Einstein quiz game that alternately
states, "Correct, you're a genius!" or sounds bells and whistles when
the player answers the question correctly. This interchangeability of
speech and music is common in the patent literature presumably because
there is no particular difference technically. In this way patents are
designed to stake claims -- the wider the claim the better. The lack of
specificity, and deliberate vagueness in the genre of intellectual
property law contradicts the carefulness of copyright law, the dominant
institution for "owning" words.
Local Talk from a Distance
One would expect chips that afford miniaturization and inclusion
in many low-power products to be designed to address their local
audience, in contrast to booming public address systems or broadcast
technologies. However, several of these voice chip voices recirculate on
the already-established (human) voice highways, imagined to transmit
information as you or I would. The oil spill detector
23
that transmits via radio the GPS position of the accident, or
"the cell phone-based automatic emergency vehicle location system" that
reports the latitude and longitude into an automatically dialed cell
phone
24
-- these are examples of a voice chip standing in for and
exploiting the networks established for humans, transmitting as pretend
humans. This class of products, local agents speaking to remote sites,
is curious because the information can easily be transmitted efficiently
as signals of other types. Why not just transmit the digital signal
instead of translating it first into speech? The voice networks are more
"public access," more inclusive, if we count these products as part of
our public, too. The counterexample, of voice chips acting as the local
agent to perform centrally generated commands, is also common, as in the
credit card-actuated telecommunication access network that includes a
voice chip to interact locally with the customer while the actual
processing is done at the main switchboard. Although the voice is
generated locally, the decisions on what it will say (i.e., the
interactions) are not.
Expressives
The realm of expressiveness, often used to demarcate the boundaries between humanity and technology, is transgressed by voice chips. There are, of course, expressive voice chips ranging from a key ring that offers a choice of expletives, swear words and curses to the "portable parent" that plays stereotypical advice and parental orders to the array of Hallmark cards that wish you a very happy birthday, or say, "I love you." These expressive applications also remind us of the complexities of interpreting talking cards. The meaning of these products is of course dependent on the details of the situation, rather than on the actual words being uttered: who sent the card, and when; or what traffic situation preceded the triggering of the key ring expletive.
19.5. Recordable pen product.
These novelty devices lead into the most populous voice chip category: those intended for children. The toy department store Toys "R" Us currently has seven aisles of talking and sound-making products -- approximately 45 different talking books alone, in addition to various educational toys, dolls and figures that speak in character. The voices are intended for the entire age range, from the earliest squeaking rattles for babies, to strategy games for children 14 years of age and up -- for example, the "Talking Battle Ship," in which you can "hear the Navy Commander announce the action" as well as "exciting battle sounds." The categorization of the multitude of toys extends far beyond "expressive" types, from the encouraging voices inserted in educational toys ("Awesome!," "No, try again" or "You're rolling now") such as the Phonics learning system, the Prestige Space Scholar, and Einstein's trivia game, to the same recordable voice chips used for executive voice memo pads. Chips for children are placed in pens, balls, and YakBaks; then there is the multitude of imitative toys that emulate cute animals, nonfunctional power tools and many trademarked personae, from Tigger and Pooh to Disney's recent animation characters Sampson and Delilah, Ariel the mermaid, and others.
This listing demonstrates a cultural phenomenon that
enthusiastically embraces children interacting with machine voices, and
articulates the specific didactic attitudes projected onto products.
These technological socialization devices have already been subject to
analysis, as in Sherry Turkle's study of children's attitudes towards
"interactive" products.
25
Barbie, for instance, was taken very seriously for what she had
to say about the polarized notions of gender she embodies. Since
Barbie's introduction in 1957 she has been given a voice three times
(each with slightly different technology); her most controversial voice
during the 1980s was censored for saying, "Math is hard." This
controversy rests on the assumption that voice chips are social actors
and do have determining power to affect attitudes -- in this case a
young Barbie player's attitude to math.
Although Barbie is currently silent, a myriad of talking dolls
remain, from Tamagotchi virtual pets, with their simple tweets, to
crying dolls that ask to be fed, and an ever-increasing taxonomy of
robotic dolls and creatures. The utility patent literature continues to
award "new and novel" applications in this area. One of the "new" voice
chip patents is for a doll that squeals when you pull her hair (dolls
that cry when they are wet or turned upside-down are technically
differentiated by their simple response triggers).
26
There is also a new doll patent that covers an "electronic
speech control apparatus and methods and more particularly for...
talking in a conversational manner on different subjects, deriving
simulated emotions... methods for operating the same and applications in
talking toys and the like."
27
The functional categories at work here are not linguistic, nor
do they resemble other ways in which a voice has been transformed into a
document -- for example, as in the copyright of a radio show. It would,
in other realms, be very difficult to get copyright on "talking in a
conversational way." In the material world the ownership of voice has
been redefined.
Recording Chips
This category encompasses many of the most recent voice chip products. It is the existence of these products that tests the nature of the communication we have with these technologies: do we, can we, converse with these products? This category draws from the other typologies but is distinguishable, for the most part, by the recording functionality that is the raison d'être of the product. The category includes those products that perform a more specific speech function that could not be alternatively represented by lights, beeps, or visual display, i.e., perhaps they are more communicative. This category includes the products that seem to hold dialogue.
The category's range of products includes the shower radio (see figure 19.6) that reinterprets bathing as a time for productive work, an opportunity to capture notes and ideas on a voice chip, consistent with the theory that there is an ongoing expansion of the work environment into "private" life. It also includes both the recordable pen and its business-card-size counterpart, the memo pad. Both the pen and the pad have many versions on the market currently, and they seem to be becoming more and more populous. The YakBak is the parallel product for children, deploying the same technology with different graphics, and to radically different ends.
The growing popularity of this category compared to the others
arouses a number of questions. Firstly, how do we understand why this
category is popular? Is the popularity driven by consumers because these
products are successful at what they do? And is what they do dialogue?
Or is it that the cost and portability of the technology make it an
affordable newtech symbol beyond what is attributable to its function
alone? Is this category popular because it alone can be marketed as a
work product?
28
And then conversely, why are these devices not more popular? Why
is it that only a few types of products become the voice sites? Pens,
photo frames and memo pads are all documents of a sort, in contrast to
switches or menu choices.
According to the patent literature, "the failure of the market
place to find a need for voice capability on home appliances has
discouraged the use of voice chips in other products,"
29
but lending the market agency for design assumptions is circular
logic. This does express, however, the sentiment that many more products
could have speech functionality then do.
Although miniaturization has made these products possible, the concept of embedding recording capability in products has been possible with other technologies. There has been no technical barrier to providing recording capability in cars, or in any of the larger products -- a refrigerator, for instance -- certainly since the existence of cheap magnetic recording technologies. Why is it that now we want consumer products that talk to us?
It is striking that the majority of talking products on the market currently are for conversing with oneself. Although deeply narcissistic, this demonstrates a commodification of self-talk that transforms the conceptualization of the self into subjectivity in relationship with our products. It suggests, without subtlety, that the relationship with these products is a relationship with the self. The constitution of personal and social identity by means of the acquisition of goods in the marketplace (Shields 1992) -- the process of identifying products that provide the social roles we recognize and desire -- cannot be excluded from the consideration of the social role of products.
Where Are the Voices Coming From?
The preceding typologies focus on what the voice chips say rather than where they say it. However, because voice chips are distributed throughout the product landscape, where they appear (and disappear) is also interesting to examine. Although a very detailed analysis could yield an interesting geography, it is beyond the scope of an essay intended to generate preliminary questions about why they say what they do where they do.
The automobile industry, a highly competitive, heavily patented industry that quickly incorporates cheap technical innovations (where they do not substantially alter the manufacturing process) is a place to expect the appearance of voice chips. Indeed, there was early incorporation of voice chips in automobiles. A 1985 luxury car, the Nissan Maxima, came with a voice chip as a standard feature in every vehicle. The voice chip said, "Your lights are on," "Your keys are in the ignition" and "The door is ajar." There were also visual displays that marked these circumstances, yet the unfastened seatbelt warning only beeped. By 1987, you could not get a Nissan Maxima with a voice chip, even on special request. In this case, the voice was silenced, but only for a time, reemerging with a very different role to play in the automobile.
By 1996, the voice chips reappeared in the alarm system of cars. Cadillac's standard alarm system uses proximity detection to warn, "You are too close. Please move away." In this 10-year period the voice shifted from notification to alarm, a trajectory from user-friendly to a distinctly unfriendly position. It is also interesting to note another extension of the action/reaction voice chip logic, if not the voice itself. The current Nissan model no longer notifies that the lights have been left on, it simply turns the lights off if the keys are taken out of the ignition. The courtesy of notification has been dispensed with, as well as the need for a response from the user. The outcome of leaving the lights on is already known, so the circuit will instead address that outcome. This indicates that when the results are exhaustively knowable, the need for interaction diminishes.
Of the seven patents specifically for vehicles,
30
all bar one are intended for private and not public
transportation. However, in late 1996, voice chips began to appear in
the quasi-private/public vehicles of New York's Yellow Cabs. After
debate about what ethnic accent
31
should be ascribed to the voice that reminded you to "please
fasten your seatbelt" and "please check for belongings that you may have
left behind," the prerecorded (68k-quality) voices of Placido Domingo
and other celebrities won the identity contest, and have since
proliferated into many well-known New York characters, from sports stars
to Sesame Street's Elmo. The voice chip in this quasi-public sphere
adopted a broadcast voice, albeit one of poor quality, or a
microbroadcast voice. Whether they are effective in increasing seatbelt
wearing or reducing the number of items left in the cabs in any accent
is less certain than the manner in which they articulate the social
relations of the cab. The voice chips address only the passengers and
assume that the drivers don't hear them, although it is the drivers who
bear the brunt of their monotony. Their usefulness delegates the human
interaction of service and rests on the assumption that the chips are
more reliable and consistent in repeating the same thing over and over,
no matter the circumstance, and that the customer responds to Placido
Domingo's impassive, recorded reminder more than they would to a driver
who may be able to bring some judgment to bear upon the situation. In
the transformation of the passenger into a public audience (not unlike
that of a radio station) the product or service itself is not attributed
with the voice. Instead the voice becomes identified with a celebrity.
In the transportation sector alone we can see the voice chip
develop from an anonymous to an identifiable voice, and from a polite
notification to an alarm for deterring approach. Cars have struggled
with the problem of talking to humans and seem to have exploited the
nonhuman qualities of their speech
32
-- the things that the technology is better at doing, like
faithful repetition or careful reproduction of the identity of another
-- rather than any particularly human attribute of their speech. It is
also notable that talking cars have not endured.
In the health industry, another social sector highly saturated with electronic product, the distribution of voice chips is almost exclusively on one side of the home/professional, expert/non-expert divide. Although in number there are more products made for hospitals and clinics than the home market, the placement of voice chips is inversely represented. In home products, from the menstrual cycle meter to the CPR device, electronic voices seem to play the role of the health professional or "expert." In addition, the large number of products for the visually impaired are intended for patients and not professionals (a demographic with more spending power); see, for example, the addition of a sound indicator to the syringe-filling device "for home use," which testifies that the user of this device is imagined at home, without the help of the professional for whom the product can stand in. Ironically, the most vocal equipment in this industry are the relaxation and stress reduction products, e.g., those by which you talk to yourself or are reassured and relaxed by the sounds of the ocean (see figure 19.7). The reassuring factuality of these technovoices focuses its attention on the lay audience. These are preliminary observations of the voices introduced into transportation and in the health and medical areas, and are cursory at best. But they demonstrate that for the voice to make sense, the technological relationship itself needs to make sense. The speech from devices is as culturally contingent as language.
There are many other areas in which the introduction of voice
chips provides insight into what technological relationships make sense.
Their incorporation into work products articulates the transformation
and reorganization of work structure, particularly into "mobile" work
(Zuboff 1984).
33
They speak to a culture's popular notions of where work gets
done, a culture in which providing a product to take voice notes while
in the shower makes sense. The voice chip population of areas of novelty
products, children's toys, and educational products, and of the safety,
security and rescue products also maps the social relationships we
engage in with our products. Conversely, where we don't find voice
chips, for example in biomedical equipment for health professionals,
also maps the social relationships that the technologies play out --
they stand in for experts with an authoritative voice one wouldn't use
on a colleague. However, to understand the dialogue we are having with
these voices requires us to also examine how we listen.
Discussion: What Do the Voice Chips Actually Mean When They Speak? Do They Actually Work?
Voices Chips as Music
The preceding categories survey what voice chips say, where it is they say it, and to whom they say it. To understand what the voice chips are saying, however, means engaging strategies for listening that may not be automatic. Products, with or without voices, are well-camouflaged by what Clifford Geertz described as "the dulling sense of familiarity with which... our own ability to relate perceptively to one another is concealed from us." Modes and strategies for listening that can help us hear these voice chips can be borrowed from music. Music, unlike machinery, is commonly understood as "culture," or a cultural phenomenon, and its analysis looks very different from the analysis of technology. The structure of participation enables multiple listeners (vs. a "user"); the "use" of music is widely divergent; and interaction with it is more specifically understood as interpretation (we don't speak of the task's decomposition, efficiency or effectiveness). Perhaps the most glaring difference is the concept of improvisation, which is prevalent in theories of music, yet is unusual in the analysis of human-machine interaction. (The striking exception is Lucy Suchman's work, which we will discuss in depth later.) Is it that improvisation is absent from our interaction with machines, or our models for designing interaction?
Our strategy here is to avoid the contested terms "reality," "progress," and "rational choice" that usually inform the analysis of technology - thus we can provide more emphasis on the interpretative experience. Additionally, some of the voice chip products themselves demonstrate an indifference to the distinction between speech and music, by blurring the distinction between words and beeps (see the "speech as music" category of products).
Music, like product, is also easily recognized as involved in
the production of identity. That is, subcultures identify through and
with music (Fabbri 1981). Where technological product is presented to
the consumer, at what Cowan calls the "consumption junction," we are at
such an identity-producing site.
34
For this reason it is difficult to ascribe any one particular
meaning or mode of listening to the voice chips. In the wide spectrum of
musical styles available, each piece of music can and does exist in
widely different listening situations. This means that each listener has
a variety of listening experiences and an extensive repertoire of modes
of listening. The hearing person who listens to radio, TV, the cinema,
goes shopping to piped music, eats in restaurants, or attends parties,
has built up competence in translating and using music impressions. This
ability does not result from formalized schooling, but through the
everyday listening process in the soundscape of modern city. Stockfelt
asserts that mass media music can be understood as something of a
nonverbal lingua franca,
35
without of course denying the other more specialized musical
subcultures to which we may simultaneously belong.
Listening modes are not, of course, limited to music, and nor
for that matter is a musical experience limited to music. Even so,
teasing out the musical modes of listening from listening modes that
focus on the sound's quality, its information-carrying aspect, or other
nonverbal aesthetic modes is difficult. The "cultural work" of using
unmusical sounds as music is not uncommon; for example, Chicago's Speech
Choir, John Cage's 4'33",
The Symphony of Sirens
36
and the sounds created with samplers, particularly for
percussive effects. At the same time, the sirens, speech choirs, etc.,
do not lose their extramusical meaning as they become music. Conversely,
using musical sounds for nonmusical ends is the conceit of many voice
chip applications.
The two products in figures 19.8 and 19.9 demonstrate the
confusion of musical listening vs. other modes of musical sound
consumption. The Soother uses unmusical sounds for musical effect while
the Funny Animal Piano uses musical sounds to respond to toddler's feet.
The alignment of voice chips with music has interesting implications for
their linguistic claims; if they produce meaningful speech, why don't
they differentiate between music and speech?
37
Is it that the social position of the product determines the
meaning of the sounds and utterances? Indeed, if the speech they produce
is linguistic, then when the voice of the alarm system warns us, are we
altering the meaning of the sound, whether it resembles speech or siren?
Or can we expand linguistic theories to accommodate all meaningful
sounds that humans or machines make? These questions about how we
understand the sounds that voice chips produce complicate the
attribution of agency to these "things with voices." Voice chips seem to
frame sound as a prepackaged cultural product, the identity of which is
located in the manufactured materiality. At the consumption junction
these voices are heard in the buzz and squeal of products, but can we
call it language?
Voice Chips as Speech
What do voice chips tell us about our understanding of language?
The voice chips' languages provide a picture of our on-the-ground,
in-the-market operationalization of our explicit understanding of
language. Even though some voice chips use music and speech
indistinguishably, the words that they say cannot be overlooked. Voice
chips talk and say actual words, but how do we understand these voices
as communicative resources? Are they "speech acts" as defined by
linguistic theorists?
38
Speech acts
39
are used to categorize audible utterances that can be viewed as
intending to communicate something, to make something happen, or to get
someone to do something. To construe a noise or a mark as a linguistic
communication involves construing its production as a speech act (as
opposed to a sound that we decide is not communicative). Categories of
speech acts are given next (examples quoted from voice chips).
1. Commissives: The speaker places him/herself under obligation to do something or carry something out; for example, in a telephone system, "I will transfer you to the operator";
2. Declaratives: The speaker makes a declaration that brings about a new set of circumstances; for example, when your boss declares that you are fired, or when the car states, "The lights are on";
3. Directives: The speaker tells the listener to do something for the speaker; "Please close the door," "Move away from the car";
4. Expressives are without specific function except to keep social interactions going smoothly, like "please" and "thank you," or the more expressive "I love you."
Each of these categories is performed by the voice chips
examined in this essay, as are other verbs and verb phrases associated
with the wider category of elocutionary acts: "to... state, assert,
describe, warn, remark, comment, command, promise, order, request,
criticize, apologize, censure, approve, welcome, express approval, and
express regret."
40
The category in which voice chips are least convincing is the
declarative that requires the reliability or trustworthiness of the
agent (human and nonhuman) to understand whether or not this thing is
going to come about. We note that the declarative notification that your
car will turn off the lights has been removed, and the car simply enacts
the turning off the lights. The voice chips also tend to inhabit the
present tense, or the very recent past tense. Future tense is less
common, perhaps because the autonomy of a system is held in check by the
interactive scripts. And they also prefer the first person, which
supports the idea that they are not referring beyond themselves.
Searle defines the "speech act" as an utterance (action)
intended to have an effect on the hearer, with preconditions and
effects. This has been criticized by other theorists who have pointed
out that meaning is imparted by the work of an "interpretative
community."
41
The limitation of speech act theory in explaining voice chips is
that it ascribes the most intention to the least animate thing in the
interaction. In its failure to elaborate on interpretation, it provides
no place for information about the significance of any particular
assertion, warning, or, more generally, any speech act. Voice chips
amplify this problem because they can inhabit so many different
situations yet repeat the same thing. Because the voice doesn't change,
all flexibility in understanding to accommodate the changing
circumstances needs to be accounted for by the listener's
interpretation. The case of the Cadillac's alarm voice illustrates this.
During a demonstration of the Cadillac's alarm system, the salesman
instructed me to move away from the car and then approach it again.
Despite coming as close as I could to the car, the voice did not sound.
On hearing no voice, the demonstrator toggled the key fob switch. I
approached again and the voice sounded. In the first approach, the voice
chip's silence was interpreted as "the alarm is not working or is not
on." In the second approach, the voice communicated, "Now the alarm is
on and functioning."
By staying in the proximity range of the alarm system, the voice
answered several questions, despite simply repeating the same words:
"move away..." What is the area range in which we are detected? Will the
alarm keep repeating, or will it escalate its command? Although moving
away from the car stopped the voice, we also came to understand the
types of motions that it detected, the speed of approach, what happened
when we physically shook the car, etc. The simple interaction with the
car and its voice demonstrates the interpretative flexibility that
transcends the directive of the words stated, and how, as hearers, we
respond to the voice's imperatives. So in asking how we understand the
significance of speech performed by the voice chip, we are asking
whether speech is abstractable.
42
In other words, is there a difference between talking with a
voice chip and talking with something (human) with which we share
capacities other than speech?
Is Speech Abstractable?
Speech in action, rather than in theory, is conversation. If we
are to claim that we interact with voice chip speech, we need to examine
the fundamental structure of conversation as the primary model for
interaction.
43
One of the voice chip patents claims the rights for "electronic
apparatus(es) for talking in a conversational manner on different
subjects, deriving simulated emotions which are reflected in utterances
of the apparatus." Although the other voice chip products make no
explicit claim to be conversing, they do claim to be "interactive."
44
Lucy Suchman's (1987) work, however, proves more appropriate to describing the interactive "speech" of voice chips. Her work focuses on the inherent uncertainty of intentional attributions in the everyday business of making sense via conversational interaction with another machine, the photocopier. Like voice chips, she characterizes these machines by the severe constraints on their access to the evidential resources on which human communication relies. She elaborates the resources for constructing shared understanding, collaboratively and in situ, rather then using an a priori system of rules for meaningful behavior.
Suchman shows that the listening process of situated language
depends on the listener to achieve the shared understanding of
successful communication. The listener attends to the speaker's words
and actions in order to understand. Although institutional settings can
prescribe the type, distribution and content of talk (e.g.,
cross-examinations, lectures, formal debates, etc.), they can all still
be analyzed as modifications to conversation's basic structure. Suchman
characterizes one form of interactional organization (or structure of
participation) -- in this case the interview -- as a) the pre-allocation
of turns: who speaks when and what form their participation takes; b)
the prescription of the substantive content and direction of the
interaction, or the agenda.
45
More generally she describes a system for situated
communication, or conversation as "an organization designed to support
local endogenous control over the development of topics or activities
and to maximize accommodation of unforeseeable circumstances that arise,
and resources for locating and remedying the communication troubles as
part of its fundamental organization."
Conversation with a Voice Chip?
Prerecorded voices on voice chips are ill-equipped to detect communication troubles, and although they are usually triggered by local inputs, the content of what is said does not change. They will repeat the same thing, or a set of prerecorded phrases, over the indefinite range of unpredictable circumstances. Although they localize control, they for the most part do not localize the direction of speech.
The applications that seem closest to Suchman's characterization
of conversations are the products that include "dialogue chips." These
chips hand over control of the content of talk to the listener,
fulfilling Suchman's characterization of conversational interaction in
this respect. The listener literally controls the speaker and sets up a
relationship with the device. Further, the dialogue chip products use
the turn-taking of conversation collaboration, not as the alternation of
contained segments of talk in which the speaker determines the unit's
boundaries, but in the manner illustrated by the joint production of
single sentence (Suchman 1987, 81, 125).
46
The "turn-taking system for conversation demonstrates how a
system for communication that accommodates any participants, under any
circumstances, may be systematic and orderly, while it must be
essentially ad hoc" (Suchman 1987, 78). The alarm clocks that
incorporate voice recording functions are a new example of how that
control is extended over time, but remains very local.
The response to voice chips, like the applause at the end of a play, is not a response to the final line uttered, or the fact that it just stopped. "The relevance of an action... is conditional on any identifiable prior action of event, insofar as the previous action can be tied to the current action's immediate local environment." The conditional relevance does not allow us to predict a response from an action, but only to project that what comes next will be a response, and retrospectively, to take that status as a cue to how what comes next should be heard. The interpretability therefore relies on "liberal application of post hoc ergo propter hoc " (Suchman 1987, 83). The response that a listener can have to the voice of the train defect enunciation system is not only a response to the words uttered by the product. It also involves a complex series of judgments that include assessments of the information available and how to integrate this into what else the listener knows of the event at hand.
The understanding of talking products does not come so much from the words found at what is popularly conceived as the human-machine interface, but beyond this. The voice is a voice embedded in a network of local control, sequential ordering, interactional organization and intentional attribution. The recordable chips with which we can have a dialogue with ourselves, in which the control remains local, best demonstrate this. These products frame the understanding that we are talking with ourselves through our products.
Whereas dialogue is conversation with another agent, one who is somehow there, monologue is characterized as written speech, inner speech or rehearsed speech. Dialogue implies immediate unpremeditated utterances, whereas monologues are written speech lacking situational and expressive support that therefore require more explicit language. Questioning the abstraction of speech in voice chips does not demonstrate that speech is uniquely human. On the contrary, the stabilized voices of hardware-based speech are subject to reinterpretation, and rediscover the listener's capacity, not the speaker's incapacity. It may simply be viewed as a distinction between dialogue and monologue, neither of which are more or less human. Because we inhabit both sides of a dialogue, we can understand the voice chip's position and compensate, so as to perform dialogue with ourselves.
Can We Summarize What They Said, and What Sort of Response This Suggests?
This essay has so far examined the unique position of voice chip products, differentiating them from the background noise of contemporary culture and other technological configurations that deliver speech. These hardware-bound voices are not broadcast and have no stable identity. The survey of what the voice chips say produces typologies that suggest further modes of investigation into how we understand and use these voices, where they appear, and what their voices mean. The short product life cycle of the consumer electronic devices they inhabit position these products as the E. coli of sociotechnical relations and can demonstrate the formation of product identities and product voices in our shifting understanding of machine interaction. The appearance of voice chips in some types of products and not others, some social sectors and not others, is open to further investigation. Detailing these would reveal the voice chip's oral history of the process by which the very ephemeral social device of speech becomes stabilized and entered into systems of exchange.
Now I will introduce a complementary examination of speech
recognition chip sets, around which there is much more recent product
development activity.
47
Although the voice chip's applications may have peaked, the
equivalent low power, distributed speech recognition function may be
just beginning. Watching their development and deployment carefully,
asking, "Now that we can talk to our products, what do we say?," may
allow us to hear the social scripts they presume. Can these provide
evidence that symmetry between the ambitions for human and nonhuman
attributes holds?
However, because we are more self-conscious about speaking than
listening, this may be an instrument through which to observe our own
roles in sociotechnical interaction. In order to prime this
investigation, and because speech recognition chip sets are not yet (and
may never be) widely available, the author hosted a competition to
survey a range of applications. The competition was advertised on a
large mailing list (12,000 members): the Viridian list owned and
carefully managed by science fiction writer Bruce Sterling. The list is
a forum for discussing technological futures with an emphasis on
addressing environmental problems. Entrants were asked to propose speech
recognition interfaces to an existing product (the prize was a voice
note taker and the prototyping of the proposed device). Just under three
hundred designs were submitted and will soon be available on the web
site
www.cat.nyu.edu/neologue
. While these entries cannot be claimed to represent the
conceptions of human-computer interaction distilled by the social forces
of the market, manufacturing and advertising, they can be treated as
evidence of technological desires and cultural expectations.
Now That We Can Talk to Our Devices, What Do We Say?
The most striking feature the competition entries demonstrated
was the explicit intention to effect social change with technological
change. This may or may not be peculiar to this list (which might be
tested by hosting a similar competition in other contexts); however,
this is consistent with a popular techno-determinism that attributes
social change to technological change and under-represents the dominant
forces of product innovation that can be attributed to sustaining and
continuing a corporate entity.
48
This also contradicts other popular understandings and lay
rationalizations that new products arise to address preexisting social
"needs" or profit opportunities, follow fashion or to optimize existing
applications.
We can summarize the trends illustrated by the proposed products
and product interfaces, which are predominantly the desire for social
and individual envisioning and regulation. This is apart from the
ultimate (and theatrical) control fantasies that this particular type of
interface engages (e.g., on saying "Showtime," the lights dim and the
television and VCR turn on
49
), or the suggestions that dispensed with buttons (e.g., the TV
remote
50
) without explicating what words to use. Entries that did not
explore what happens in the translation from finger-button to
voice-button and the social (and observable) spectacle this makes did
not render the sociotechnical relationship this investigation was trying
to identify. There were also the applications that were similar to voice
chips - with a similar interchangeable use of speech/buzz (e.g., the
cookie container that recognized children's footsteps to trigger
singing, or the TV remote that called out "Polo" when it heard "Marco"
51
).
In addition to self-observation, regulation and control, the applications took on moral, physical, emotional, and consumption monitoring and regulation, in such forms as:
A wallet that recognized words and dispensed consumption
regulation advice
52
;
A pocket device that recognized the phrase "now what am I
supposed to do?" and responded "with a gentle reminder to adhere to the
user's selected ethical set"
53
(regulation of consumption);
A coffee maker that recognized "good morning" ("when you respond, the chip analyzes your tone of voice [for sluggishness]" and "adjusts the strength of the coffee...," thus automating the physical regulation on which Starbucks has so successfully capitalized);
A more extreme circumvention of one's own self-judgment: a
device that monitors bloodflow and when detecting stress whispers
"`relax,' dims the lights a bit, and releases soothing aromatherapy"
54
;
And the very opposite of an alarm clock, a device that on hearing, "Why am I still up?" "...should cause every light and entertainment system in my house to shut off for 4 hours."
An example of self-observation was a voice-triggered "nocturnologue," which would record any sleeptalking.
These devices to regulate the self, presumably with the goal of
social synchronization, do not necessarily imagine the devices as
"companions" and attribute to them a more social performance, although
there is a small subset that do. This subset of entries realize the
"technology-should-be-more-humanlike" expectation, which reflects a
similar school of Human Computer Interface (HCI) designers working
towards adaptive interfaces that can recognize and respond to different
emotive states as an explicit strategy to be "user-friendly." The best
example is a comedic sidekick (Jerry Lewis) built into a watch and ready
with smart rejoinders to recognized phrases (when it hears "nice hair,"
the device says "cha cha cha"). This functionality would have to be
described as reinforcing social performance.
55
This seems both similar to other identification relationships
(cars, furniture, home), and different, insomuch as it is directly
inserted into the conversation.
The promise of emotive interfaces that recognize and respond to
how you are feeling,
56
if these imagined interfaces are any evidence, was demonstrated
and expressed in words that describe an ambivalence, even resentment, of
technological relationships: for example, being able to say "shut up" to
your television set
57
or to your telephone
58
(not "turn off," not "close/finish" or other ending command).
Clearly, this complicates the sort of understanding we can develop about
a person's relationship to a purchased product -- and purchasing is of
course the predominant form of "feedback" that companies and designers
get about products. These voices make audible a strongly polarized
ambivalence. There was no suggestion of saying, "I love my TV," to turn
it on.
Another device was proposed for automated prayers: triggered by
saying "pray for me," it was customizable for different religious
"preferences."
59
Prayers suggested ranged from excerpts from Psalm 23, to those
for "cynical hipster types [who] might want their in-dash prayer boxes
to recite William S. Burroughs' Thanksgiving Prayer (`Thanks for
Indians, to provide a modicum of challenge and danger... thanks for a
nation of finks...,' etc.), and some guilty white liberals (some
Viridians, even) might want theirs to apologize for driving around in a
vehicle spewing noxious fumes into the atmosphere."
60
This is more than an interface that recognizes and responds
appropriately to user emotional states; actually the entertainment is in
delegating the emotionality or at least religiosity itself to the
device.
This impulse is replicated in the delegation of care, social
niceties and other arational and noncalculative tasks to the
computational devices; for instance, a speech recognition chip that
recognizes the sound of flatulence and politely apologizes
61
to the room, relieving the responsibility of any one person to
bear the embarrassment. Another entry, as an extension of
Tamagotchi-like automation care, suggested using a voice recognition
chip to train a parrot to speak.
62
There were actually several other entries exploring information
technology for animals, which seems to be evidence against a voice
interface imagined as "humanizing" the computer, and more a
demonstration that the ready treatment of animal noises as recognizable
sounds imagines these as functionally equivalent in every way to English
words. Speech recognition, reinterpreted as sound recognition.
Finally, and perhaps the most interesting or novel constellation
of projects are the designs that use the opportunity to script
interactions as a form of propaganda -- propaganda that is distributed
(enacted) beyond traditional and corporate monopolized media channels.
The portable ideologue could play the role (even potentially look like)
the soapbox.
63
Another device, the BackTalk, is a portable billboard for one's car. It is triggered by the use of simple trigger words and suggested deep-set LEDs, displaying a message specifically to the driver behind one's own car: "Thanks for letting me in," "Baby on board," or presumably any other bumper sticker expression. This is intended to influence others, and thus belongs in this category of the regulation (or at least influencing) of others.
These propaganda projects take very direct and explicit forms,
including cell phones which, for example, cut out if they hear you say,
"Yeah, I am on the cell phone," "Yeah, I am in the village," or "Dude,"
64
or monitor for swear words, or take other efforts to silence
loud or otherwise "inappropriate" private voices in public spaces.
This impulse for social observation is illustrated by a museum display designed to collect responses (what the entry calls clichés) so that it "will grow as an open-ended accretion or demonstration of the clichés uttered by thousands, tens of thousands, millions of art consumers." This collection is itself the spectacle; the museum exhibit is rethought of as an instrument for the collection of comments.
Another suggestion was the "crowd morality barnacle," which is a
device intended to influence mass behavior -- in the given example, a
riot. The CMD is intended for distribution throughout a crowd and will
respond to key riot phrases; for example, it might respond to "smash"
with "be careful"; "burn" with "it might explode"; or "get them" with
"where are the children?"
65
This is a different conception of regulation than the examples
that illustrated the control of self.
To effect self-control, the designs went beyond turning electronic devices off or regulating the self with insistent and unrelenting reminders (e.g., correcting a habit of speech or cutting the "ums" out of the story) to quite novel punishment. These punitives enacted on the self included squirting water in one's ear, triggering electric shocks, and dribbling water down one's leg. There were few viable designs that offered a simple reward rather than punishment.
For affecting the social body, there were no physical punitives; the reward seems to have been the social behavior itself, or at least the evidence of it (as in the spectacle of clichés). This desire to see a social spectacle is repeated often, and I would like to argue that it is a recurrent theme in the networked context of information technology.
The final category of devices relies on double entendres and the
multiple meanings of words, and demonstrates that speech interface
cannot be understood as making the machine more human. Rather, it is
clearly exploiting the different parsing, context sensitivity and
repeatability of human-vs.-machine models of cognition. For example, to
trigger the discrete recording of conversations, one entry describes a
recorder that is triggered by "What's up, amigo?" This deployment of an
unusual (relative to the user and context of use - i.e., no one else is
likely to say it) filler is used to initiate conversation and direct
attention to the people being addressed, but is simultaneously being
used for an instrumental purpose: as the "on" button. Likewise the
"Don't hurt me, just don't hurt me" cell phone/GPS position locator/911
dialer proposal,
66
which uses the self-defense phrase to dial for help without
alerting the presumed attacker, who is presumed to interpret the plea at
face value - second-guessing a reasonable or "usual" response in a
threatening situation. In these interactions the user is able to
simultaneously employ multiple meanings of his or her words. Clearly the
speech chip is here being used so that the words used to interact with
the machine are understood to be different from the speech used to
interact with humans.
It is also notable that there were categories of speech not explored by these interfaces. Consider the linguistic communication defined as a performative. A performative, such as "I do," is a highly codified and stabilized utterance that communicates a future commitment or social contract (Butler 1993). Because it is a stabilized social technique, it would be technically pragmatic -- the problem of unlimited variation of phrasing is solved. The absence of designs to address this sort of statement is curious, and worth further investigation.
The categories of interaction demonstrated by this brief survey of voice chips are not discontinuous or radically different from other contemporary consumer technologies. The observation of self (or one's own property) is embodied in the consumer video camera market and surveillance systems; self-regulation has extended from alarm clocks once a day to alarming cell phones carried with you and ready for all alarming occasions; handhelds directly regulate sleep and activity; VCRs and TiVo capture, regulate (in order to extend) and meter out media program consumption.
Social observation is also embodied by surveillance systems, but although surveillance looms large in the popular imagination, it has not been used to see or envision the social mass, or one another. The problem of seeing the social body has remained an architectural problem, solved by spectacles of plaza and malls: public and quasi-public places. What the voice chip most clearly demonstrates is that it is this area in which there seems to be the most interest: being able to view mass behavior. The traditional broadcast (e.g., television) media had very little interest in rendering the public to itself, and as such the rise of phone-in, and "reality television" genres suggests that even in the context of high production value broadcast media there is a cultural appetite to "see" each other, no matter how contrived. The collaborative filtering models, such as that popularized by the Amazon people-who-bought-this-book-also-bought-x button show us each other's behavior, to make it a shared experience - to see where others have been. Like the micro-casting of a speech recognition-triggered rear window car display, we see this desire expressed through the car, and the car's peculiar access to the public space of freeways. This is a public space where the rules of communication between and among people are highly constrained (cf. the plaza). This is not the interactive experience of the self with the self, or the self with the machine, but the machine as a proxy for interacting with the social. This is a peculiar and interesting way to think about human-machine interaction.
Conclusion
The interactions we hear with voice chips do not disambiguate the buzzes and beeps used by speechless machines, but speech recognition products do reinforce the idea that we use speech for machines and speech for humans differently, and simultaneously. The other applications also re-imagine how we understand their functions. The products discussed do not exploit the mechanistic, logical and fully controllable functions of machines, but treat them as complicated multifarious social actors. There is a clearly stated desire to enlist these new technologies and product interfaces to promote explicit desired social transformations. We also see here the ambivalent relationship we have with and for our current technological devices.
This essay has explored why listening to voice chips and speech recognition chips might give us a way to examine human-machine interactions in situ. Much real complexity of social and technical interactions is lost in the tradition of examining them within controlled laboratory contexts, and ethnographic analysis can be too rich (though the theoretical perspective that has developed from ethnographic insights, that privileges the improvisational nature of real-world applications, enables us to focus on how speech and turn-taking is used to coordinate the interaction between machines and humans).
This initial analysis is presented in order to set up some preliminary ideas and interpretations, so that as (or if) speech recognition chips become more widely distributed, we can "tune in" to this particular historical moment and hear what it is we expect, want and bring to our human-machine interactions. There are few instruments that give us this viewpoint. Listening to our daily interactions with products can work to contest and complicate the dominant methods used to describe technological trends and patterns of product innovation: demographically driven mass market research and the capture of consumption behaviors at the point of purchase. The examination of speech recognition applications gives unique access to the assumptions, expectations and the imaginative work of products and the interactions they script.
Further examinations of voice chip and speech recognition products and patents can extend what has only just begun. In understanding how voice chips abstract speech, we can examine what we understand interaction to be, and hence how we design and frame interactions in products of daily use, reproducing our understanding of human technical relations. The products make obvious the design assumptions with which they are built, but further investigation of the details of their use will help to elaborate how these micro-interactions perform and realize actual social roles and social structures. A detailed use-analysis of any one of the products could provide further insight into this sort of investigation.
Voice chips also raise other questions. Because they slice through many social and economic sectors but are still in a manageable population of products, they can be used to illustrate the iterative and continuous process of technical change that is intimately involved in a technology's sociality, in contrast to the radical discontinuities of technological change through discovery and paradigm shifts (Dosi 1982, 147-162; Clark 1985, 235-251). They realize a recombinant model of technological change. Furthermore, for the same reasons, they can be used to examine the changing social position of these products in relation to the configuration of power and work relations (Zuboff 1984), and the transformations of the market groups and users that these products presume.
Finally, in the tradition of Turkle's examination of children's understanding of their interactive machines, children's products with voice chips can illustrate what childcare roles we delegate to machines, and articulate clearly the hardwired (per hardware, not neurons) expression of consumption identity of children.
For these reasons, this essay marks the beginning of a project
to collect an ongoing database of products with voices or speech
recognition that appear on the market, or receive patents.
67
As a longer archive of product voices, this may prove a valuable
resource for the examination of changing sociotechnical relations, even
in the event of the products falling silent and voice chips and speech
recognition being abandoned altogether.
The voices of the products reflect back the voices and interactions we have projected and programmed into them, returning them for our reinterpretation. One mode of interaction we have with the consumer products that exist and are imagined at the time of this essay is a dialogue with a monologue. Command and control scripts are more common than improvisational scripts, but other forms of interaction are being scripted. By literally listening to what hardware has to say, and what we say to it, we may better ground our assumptions of interaction in reflexive reinterpretation. Furthermore, we can see from this examination that these technologies can be seen as structures of participation, organizing often indistinguishable human-machine interactions and using them to extend the predictability of individuals and coordinate their interactions. We have an ongoing opportunity, even method, to hear and understand our technologies in terms of these structures of participation, in our own language, and to see these technologies as a distributed system of voices and ears.
Responses
References
Althusser, Louis (1971). "Ideology and Ideological State Apparatuses (Notes Towards an Investigation)." In Lenin and Philosophy and Other Essays. New York: Monthly Review Press.
Austin, J.L. (1962). How to Do Things with Words. Oxford: Oxford University Press.
Bardini, Thierry (1997). "Bridging the Gulfs: From Hypertext to
Cyberspace."
Journal of Computer-Mediated
Communication
3, no.2 (September, 1997).
http://www.ascusc.org/jcmc/vol3/issue2/bardini.html
.
Benveniste, Émile (translated by Mary Elizabeth Meek) (1971). "The Nature of Pronouns." In Problems in General Linguistics. Coral Gables: University of Miami Press.
Butler, Judith (1993). Bodies that Matter. London: Routledge.
Callon, Michel (1995). "Four Models for the Dynamics of Science." In Handbook of Science and Technology Studies, edited by Sheila Jasanoff, Gerald E. Markle, James C. Petersen and Trevor Pinch. Thousand Oaks, CA: Sage Publications.
---., and John Law (1982). "On Interests and their Transformations: Enrollment and Counter-Enrollment." Social Studies of Science 12 (1982): 615-625.
Clark, Kim (1985). "The Interaction of Design Hierarchies and Market Concepts in Technological Evolution." Research Policy 14 (1985): 235-251.
Cowan, Ruth Schwartz (1987). "The Consumption Junction: A Proposal for Research Strategies in the Sociology of Technology." In The Social Construction of Technological Systems, edited by Wiebe E. Bijker, Thomas P. Hughes and Trevor Pinch. Cambridge, MA: The MIT Press.
Dosi, Giovanni (1982). "Technological Paradigms and Technological Trajectories: A Suggested Interpretation of the Determinants and Directions of Technical Change." Research Policy 11, no. 3 (1982): 147-162.
Dourish, Paul (2001). Where the Action Is: A History of Embodied Interaction. Cambridge, MA: The MIT Press.
Fabbri, Franco (1981). "A Theory of Musical Genres: Two Applications." In Popular Music Perspectives, edited by David Horn and Phillip Tagg. Gothenburg and Exeter: International Association for the Study of Popular Music.
Fish, Stanley (1980). "How to Do Things with Austin and Searle." In Is There a Text in this Class? The Authority of Interpretative Communities. Cambridge, MA, Harvard University Press.
Geertz, Clifford (1973). The Interpretation of Cultures. New York: Basic Books.
Jeremijenko, Natalie (1992). "TITLE." Palo Alto, CA: Xerox PARC internal publication.
Latour, Bruno (1987). Science in Action. Cambridge, MA: Harvard University Press.
---. (writing as Jim Johnson) (1988). "Mixing Humans and Nonhumans Together: The Sociology of a Door-Closer." Social Problems 35, no.3 (1988): 298-310.
Minneman, Scott (1991). The Social Construction of Engineering Reality. Ph.D. Thesis, Stanford Department of Mechanical Engineering Dissertation, Stanford, CA.
Oswald, Laura (1996). "The Place and Space of Consumption in a Material World." Design Issues 12, no. 1 (Spring 1996).
Schegloff, E. (1982). "Discourse as an Interactional Achievement: Some Uses of `uh huh' and Other Things that come Between Sentences." In Georgetown University Round Table on Language and Linguistics: Analyzing Discourse Text and Talk, edited by Deborah Tannen. Washington, DC: Georgetown University Press.
Searle, J. (1972). "What is a Speech Act?" In Language and Social Context, edited by P.P. Giglioli. Baltimore: Penguin Books.
Shields, Rob (editor) (1992). Lifestyle Shopping: The Subject of Consumption. New York: Routledge.
Suchman, Lucy (1987). Plans and Situated Action: The Problem of Human-Machine Communication. Cambridge: Cambridge University Press.
Tagg, Philip (1979). Kojak -- 50 Seconds of Television Music. Towards the Analysis of Affect in Popular Music. Göteborg, Sweden: Studies from the Department of Musicology, University of Gothenburg.
Turkle, Sherry (1984). The Second Self. New York: Simon and Schuster.
Willis, Susan (1991). Primer for Daily Life. New York: Routledge.
Zuboff, Shoshana (1984). In the Age of the Smart Machine: The Future of Work and Power. New York, Basic Books.