the “ensoniment” of machines: voice interfaces and conversations with computers

In August a young girl led her family in prayer over a meal at Chick-fil-A. While this is not an uncommon sight in Texas, she began the ritual with “Dear Alexa.” Earlier this year, Alexa, Amazon’s cloud-based voice service, began randomly laughing, unsettling many users. Eight years ago my friend John Benini began working on the world’s first AI “CogniToy” Dino with the support of IBM Watson. When working the project John informed me that yes, indeed, kids ask the green dinosaur anything.

The time of talking machines and talking to machines is here. As software engineers, our paradigm of the user interface is irrevocably changed. For us rendering our code for users has with little exception been a visual process: render data and add action/event listeners. This stands firmly upon centuries of belief that our world is ultimately one of sight — all other senses have been seen as primitive.

There has always been a heady audacity to the claim that vision is the social chart of modernity […] There is always more than one map for a territory, and sound provides a particular path through history.

Jonathon Sterne

In Sterne’s groundbreaking work, The Audible Past: Cultural Origins of Sound Reproduction, he constructs an alternative path through history. He calls this the “Ensoniment”. Like the Enlightenment, it is a process from 1750 to 1930 where the world came to be thought of not as much as Rousseau’s vision-based cognitive metaphor of the process of knowledgewhere the mind goes from darkness to enlightened, but one where the ordered spheres of sound emerge victorious over noise.

A Medieval depiction of the “Celestial Spheres”, each ordered with music.

With the invention of recorded sound, music, human voices began a new ear of disembodiment, furthering Sterne’s vision of an evolving sonic world with its own agency.

Jumping ahead a century from where Sterne leaves off, our era of silent discos, vibrating phone haptics and decrescendo-ing industrial noise continues to refine our sense of what is desirable sound. Furthermore, unlike the radio of a century ago, we expect to carry our devices around and be able to talk to them.

These inventions and expectations alter what it means to be a consumer, family-member, audience or employee. Do we talk or do we text? Do we sing to ourselves or do we put in headphones? Should we record or should we mute?


As product developers, market behavior is showing some interesting trends. By 2020, 50% of all searches will be done through voice. Digital music consumption is on the rise as well, with companies maximizing burgeoning and successful subscription models (Apple is doing especially well). People are spending as much on music streaming per annum as they did at the height of the CD boom in the 90s. Customers are not just to listening to music, but consuming information (podcasting) as well or instead.

The research gets more interesting as we consider that audio streams are beating out video 1.5x.

We would rather check our email while listening to the latest episode of Serial or take notes while listening to music. With video you can’t do both. It takes all of your attention to consume that content the second it starts. The rise of audio is the freedom to listen and talk while moving or using your hands to do something else.

Gary Vee

Like the Golden Age of Radio a century ago, we have re-discovered the power of listening to each other sing, tell stories, share the news, read books (Orson Welles?) and sell products and services. The market potential is significant as 65% of listeners are likely to buy a product after hearing an ad in a podcast.

The Engineering

As engineers, the demand to develop sound searches and speaking machines will continue to grow, changing the very nature of front-end design.

Our own internal research shows that voice searches tend to be more action-based than text searches, with voice search users showing intent to act rather than simply searching with no purpose of acting on it.


In this case it is wise to design websites and applications for action-based questions. How do we iterate over data to not just visually display an exciting, comprehensive list for users and clients, but pool information into the best recommendation for a specific stated need? When someone asks for the best Cantonese food in the area, how do we decide which search result to present?

Actually working with human speech presents many challenges, and is supported by the field of Natural Language Processing.

Natural Language Processing is the technology used to aid computers to understand the human’s natural language.

Dr. Michael J. Garbade

Computers are programmed to do so through syntax analysis, entity recognition, sentiment analysis and content classification.

When designing oral and chatbot conversations with AI, it is wise to clearly present the limitations of the device to users while clearly thinking through edge cases. It is best not to meet with the knowledge experts (marketing, product design, etc.) and create any text right away, but to work forward through the practical goals of the AI conversation, what data can be pulled forward from the database and what makes sense for users. It is not uncommon to reach an engineering hurdle that can throw out months worth of work on language scripting.


When learning to play a musical instrument, there is the Zen practice where musicians must make room inside themselves (literally and figuratively) for the instrument and its sound to reverberate. It is dance between playing the instrument, and it playing you. As we design our machines, we may consider doing the same. How have we gone inside our machines and how have we made room for them inside ourselves?

While I do not believe the rise of Voice UI and sound in software engineering spells the end of visual rendering, we have yet to maximize the potential of haptics. How can information be conveyed through touch? As we design encompassing 3d experiences, how can the touch of a machine guide a dance lesson, help put us on track while we are following directions or enable more intimate FaceTime conversations?

I predict that we will see a convergence of sight, sound and haptics in application design, and the winning companies will be those who can seamlessly yet fantastically rise to this challenge. How can we use our senses to take people further into our world, while at other times spiriting them very far away from it?

About the author