There are several ways to implement immersive audio, each with different computational and hardware needs. Calculation of the head-related transfer function (HRTF) is an important aspect of deploying immersive audio.
This FAQ begins by detailing what an HRTF is, looks at an open-source program for calculating HRTF, and then considers how cameras and inertial measurement units (IMUs) can be used to develop personalized HRTFs for augmented and virtual reality (AR/VR) environments.
An HRTF, also called a head shadow, characterizes how an ear receives sound. It’s a complex function that includes effects from the shape of the head, outer ear (pinna), ear canal, the size and shape of the nasal and oral cavities, and other factors. HRTFs are caused by a scattering process, and they can be numerically calculated by simulating how the sound field is scattered. It varies significantly from person to person and enables people to determine the point in space from which a sound originated. Calculating the HRTF is important to enable the development of immersive audio systems.
Matlab, Python, and C++
Developers of immersive audio systems can turn to Mesh2HRTF 1.x, an open-source and fully scriptable end-to-end pipeline for calculating HRTFs. 3D meshes of the listener’s head, pinna, and torso form the basis of the calculations that are performed with a boundary-element method coupled with multi-level fast multipole calculations. Input files can be mesh file formats like PLY, OBJ, and STL.
The program can simulate the sound field and calculate HRTFs for the entire audible frequency range using head and pinna meshes with over 100,000 triangular elements (Figure 1). The numerical core of Mesh2HRTF is written in C++. The program was developed for use in MATLAB and Python environments. The simulation computer (Linux, Mac, or Windows) should have at least 16 GB of RAM, but 32 GB or more is highly recommended. Mesh2HRTF 1.x code and releases are available on GitHub, and tutorials are available at Mesh2HRTF.org.
Using computer vision to optimize sound
HRTF personalization is highly desirable to realize maximum performance. One technology available for personalization is the video camera. In one implementation, a camera feeding a computer vision algorithm can capture the 3D details of the environment, like the shape of the room and the geometry and sizes of various surfaces and objects. That information enables filters to be used to cancel the undesired environmental impact on the sound field.
That same vision system can be used to measure the listener’s body structure and position, enabling the calculation of a personalized HRTF. The use of video capture enables the listener to move around the room, and the HRTF and environmental filters can be updated in real time.
A development platform has been developed using a multi-core digital signal processor (DSP) along with a time-of-flight (ToF) imager that can capture depth maps with millimeter resolutions and is designed to work effectively across a range of lighting conditions. The DSP includes sufficient computing resources to support audio decoding on one DSP core, with post-processing and personalization implemented on the second core. It also includes an ARM A55 processor for general control and processing needs. The DSP is used to decode encoded object-based ambisonics audio, personalize the room filtering and HRTF processing, and for crosstalk cancellation (Figure 2).
Adding IMUs
In another development effort, IMUs were installed inside earbuds for head tracking. The head tracking data was used to develop personalized filtering for a location and position-dependent HRTF. The same technique could be used for VR headsets or smart glasses. A face pose detector was added based on the Google MediaPipe Library that uses a forward-facing web camera to detect the 3DoF head position. The head tracking runs in less than 10 ms on a Macbook Pro and is used to customize the HRTF.
Summary
The computational requirements of immersive audio are modest. Open-source software for HRTF customization is available that can run on a Linux, Mac, or Windows simulation computer, the primary requirement being adequate memory. HRTF personalization systems have also been demonstrated using computer vision and IMUs.
References
Creating an Immersive Automotive Audio Experience with Higher Output Power and Class-H Control, Texas Instruments
Future of Immersive Audio: Sound Reproduction Assisted by Computer Vision, Analog Devices
Low-cost numerical approximation of HRTFs: a non-linear frequency sampling approach, Proceedings of the 26th International Conference on Digital Audio Effects
Recent Advances in an Open Software for Numerical HRTF Calculation, Journal of the Audio Engineering Society