Design Study
VOICE ASSISTANT
Exploring the effects of added sounds and images on voice assistants, using Amazon Alexa. The research study uses an online survey to investigate the effects of additional sounds and images on the voice interaction experience.
Service
Design Concept
Collaboration
University of Siegen
Master Thesis
Year
2021

About the Study
The study aimed to explore the effectiveness of combining different types of media, and to investigate the impact of various media conditions (such as background sound, images, and voice) on the user experience of Voice Assistants. The goal was to examine how media can support and enrich the user's experience.
For the study, 60 survey responses were collected via four identical online surveys that alternated different media conditions. The surveys focused on activities commonly performed by Voice Assistant users, such as setting reminders or playing music.
Methodology
Phase 1
Tools and Materials
The first phase was constructing two user scenarios based on the frequently performed activities with Voice Assistant: the weather report and the recipe step.
Phase 2
Data Collection
Primary data was collected through online surveys that tested four main conditions: voice, images, background sounds, and a combination of all three.
Phase 3
Participant and Recruitment
Fifteen responses per condition were individually and collectively interpreted to determine the effect that images and sounds had on the sound and user experience of the voice assistants. An equal number of responses were considered for each condition.
Phase 4
Evaluation and Results
The data was analyzed from both a qualitative and quantitative point of view, taking into account how images, voice, and background sounds influence users' interpretations.
Tools and Materials
Phase 1
The first phase involved constructing two user scenarios based on frequently performed activities with a Voice Assistant: receiving a weather report and following a recipe step. The scenarios were presented in the form of stories, describing hypothetical situations that involved everyday activities with the device.
A total of eight videos were created using either images or the original speech from an Amazon Echo Show, introducing one or several media conditions: voice, background sound, images, or a combination of these. New audio backgrounds were individually produced for the final survey application, which illustrated actions and surrounding sounds such as natural sounds, distant birds, and rain, in order to shape the atmosphere and provide additional information about the event. By imitating sounds from specific daytime activities and actions, realism was added to the scenarios, which in turn provided more information about the performed actions.
"Today you are preparing a sweet potato curry for your guest. The last time you prepared the this dish they love it, so you want to follow the exact recipe. To make sure everything goes according to plan you decide to ask your Voice Assistant for the recipe steps as follows"
Recipe Step Scenario
Media Condition: Voice + Images + Background Sounds
Data Collection
Phase 2
For this study, primary data was collected from online surveys that tested four conditions. The survey evaluated participants' information recall while simultaneously gathering their perceptions. Information recall focused on retrieving details from memory, while perception inquiry concentrated on participants' thoughts, emotions, and motivation about the overall events.


Survey Media Conditions
Primary data were collected from online surveys testing four conditions: voice, images, background sounds, and a combination of all. After reading each scenario, a video was made available for participants to test between media conditions in each scenario

Survey Question Distribution
The distribution of question types included selection, polarity profile, and text input. The selection questions allowed participants to choose only one possible answer, while input questions gave users the opportunity to provide details and express their thoughts and feelings about the survey experience.

Phase 3
Participants and Recruitment
Participants were randomly selected and recruited by sharing the survey link through digital platforms such as social media, forums, instant messages, and digital applications. Online surveys provided a flexible method to access a larger sample independent of time and location.
Participants completed the survey anonymously and did not receive incentives for their participation in the study. No prior knowledge, educational background, or demographic information was required.
Evaluation and Results
Phase 4
The evaluation aimed to use quantitative descriptive statistics to summarize data that described the relationship between variables. Qualitative evidence was also sought to provide qualifications and descriptions that support the interpretation of users' insights.
The selected methods involved using an instrument to measure the subjects' reception of complex sounds. This approach provided a systematic and objective way to evaluate the sounds and experience. By contrasting the results of the four survey conditions, we were able to identify how the implementation of the sounds or images affected the accuracy of responses during the activity completion
Recipe Scenario
Media Condition Comparison Results
“I can envision myself in a rainy environment”
(P-434)
Background sounds added dynamism to the weather scenario, contributing to the creation of a clearer mental image of the weather conditions mentioned in the video.
“It was too loud and the sounds of the voice and the background sounds collided which made it hard to understand” (P-264)
The added sounds sometimes interrupted the Voice Assistant's instructions, making them hard to follow for participants
“I would add an animation to reinforce what is mentioned vs the visual part so that there is synchrony ” (P-652)
Participants shared their ideas and suggestions to enhance new design ideas and consider potential uses
What participants said
Phase 4
Results
The responses were interpreted considering participants' evaluations, opinions, and feelings about each scenario. The survey's quantitative data analysis focused on identifying characteristics, frequencies, and trends. Unlike the memory questions, the semantic differential questions aimed to understand how participants rated the sound and the experience related to the media played in the scenarios.
Divided Opinions
Participants reported experiencing difficulty with processing and focusing their attention while completing the activity
Simple task
Participants considered the less demanding tasks optimal for implementing sounds and images to enrich the experience."
Event Recognition
Easy identification of the sound source facilitates sound and event recognition
Background Sounds
Combining background sounds with voice improves information recall compared to the original media conditions used in voice assistants.
Voice + Sounds
The voice recording condition, in combination with background sounds, achieved a higher response rate across four surveys.
Vivid Experience
Even though the background sounds were confusing and distracting, the environmental surroundings created an immersive and vivid experience
Design Implications
This section provides an overview of some design concepts that will be investigated further, taking into account the inputs and ideas of participants collected from the survey responses.

Weather scenario example screen.
Alternative designs should aim for a modern minimalist approach. For example, they could integrate color blocking to describe weather characteristics. By using color backgrounds and simple graphical animations, designers can provide flexibility, increase dynamism, and create an aesthetically pleasing composition that supports the message.
Activity suggestions
Voice assistants could suggest activities based on the weather forecast. These suggestions might include outdoor activities, recommended clothing to wear, or simple reminders to bring an umbrella if it is raining. These activities could be customized to the user's individual preferences, adding value to their experience.


Recipe step scenario example screen.
Complex tasks performed with Voice Assistants demand an alternative approach to solutions. By introducing a cooking show metaphor, recipes could be simplified into several steps. This would address the users with short video instructions displaying basic information such as ingredient quantity, while avoiding distracting elements.
Limitations
Background sound volume must be reduced and strategically placed during interactions. Newly added sounds should not interfere with the voice or instructions. Prospective work should focus on enhancing speech quality by considering pauses, tones, and rhythm to avoid monotony in the voice assistants' responses.
Future research could continue to explore and confirm these initial discoveries by clearly defining when and where sounds need to be included to support the experience. Adaptations to the initial experiment are encouraged to better understand the implications. Additional testing under more realistic settings and contexts could strengthen new findings and potentiate discoveries.
References
Carroll, J. M. (1999). Five Reasons for Scenario-Based Design. Nd Hawaii International Conference on System Sciences, 12.
Hoy, M. B. (2018). Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants. Medical Reference Services Quarterly, 37(1), 81–88. https://doi.org/10.1080/02763869.2018.1404391
Luger, E., & Sellen, A. (2016). “Like Having a Really Bad PA”: The Gulf between User Expectation and Experience of Conversational Agents. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 5286–5297. https://doi.org/10.1145/2858036.2858288
Lopatovska, I., Rink, K., Knight, I., Raines, K., Cosenza, K., Williams, H., Sorsche, P., Hirsch, D., Li, Q., & Martinez, A. (2019). Talk to me: Exploring user interactions with the Amazon Alexa. Journal of Librarianship and Information Science, 51(4), 984–997. https://doi.org/10.1177/0961000618759414
Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N., & Taylor, S. H. (2017). “Alexa is my new BFF”: Social Roles, User Satisfaction, and Personification of the Amazon Echo. Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems - CHI EA ’17, 2853–2859. https://doi.org/10.1145/3027063.3053246
Sciuto, A., Saini, A., Forlizzi, J., & Hong, J. I. (2018). “Hey Alexa, What’s Up?”: A Mixed-Methods Studies of In-Home Conversational Agent Usage. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS ’18, 857–868. https://doi.org/10.1145/3196709.3196772
White, R. W. (2018). Skill discovery in virtual assistants. Communications of the ACM, 61(11), 106–113. https://doi.org/10.1145/3185336










