Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Understanding HRTF Audio Technology: How It Creates Immersive 3D Sound

The HRTF, or Head-Related Transfer Function, is a technology that makes sounds seem more natural and spatial, as if we are hearing them in a real acoustic space.

Humans have just two ears but can locate sounds in three dimensions - in range (distance), in direction above and below (elevation), in front and to the rear, as well as to either side (azimuth). This is possible because the brain, inner ear, and the external ears (pinna) work together to make inferences about location.

Humans estimate the location of a source by taking cues derived from one ear (monaural cues), and by comparing cues received at both ears (difference cues or binaural cues). Among the difference cues are time differences of arrival and intensity differences. The monaural cues come from the interaction between the sound source and the human anatomy, in which the original source sound is modified before it enters the ear canal for processing by the auditory system.

These modifications encode the source location and may be captured via an impulse response which relates the source location and the ear location. This impulse response is termed the head-related impulse response (HRIR). Convolution of an arbitrary source sound with the HRIR converts the sound to that which would have been heard by the listener if it had been played at the source location, with the listener's ear at the receiver location.

When you hear a sound, various factors such as size and shape of head, ears, and ear canals transform the sound and affect how it is perceived, boosting some frequencies and attenuating others. Generally speaking, the HRTF boosts frequencies from 2-5 kHz with a primary resonance of +17 dB at 2,700 Hz.

A head-related transfer function (HRTF) is a response that characterizes how an ear receives a sound from a point in space. As sound strikes the listener, the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities, all transform the sound and affect how it is perceived, boosting some frequencies and attenuating others.

A pair of HRTFs for two ears can be used to synthesize a binaural sound that seems to come from a particular point in space. It is a transfer function, describing how a sound from a specific point will arrive at the ear (generally at the outer end of the auditory canal).

The HRTF can also be described as the modifications to a sound from a direction in free air to the sound as it arrives at the eardrum. These modifications include the shape of the listener's outer ear, the shape of the listener's head and body, the acoustic characteristics of the space in which the sound is played, and so on.

In the AES69-2015 standard, the Audio Engineering Society (AES) has defined the SOFA file format for storing spatially oriented acoustic data like head-related transfer functions (HRTFs).

HRTF describes how a given sound wave input (parameterized as frequency and source location) is filtered by the diffraction and reflection properties of the head, pinna, and torso, before the sound reaches the transduction machinery of the eardrum and inner ear (see auditory system). Linear systems analysis defines the transfer function as the complex ratio between the output signal spectrum and the input signal spectrum as a function of frequency. One method used to obtain the HRTF from a given source location is therefore to measure the head-related impulse response (HRIR), h(t), at the eardrum for the impulse Δ(t) placed at the source.

Even when measured for a "dummy head" of idealized geometry, HRTF are complicated functions of frequency and the three spatial variables. For distances greater than 1 m from the head, however, the HRTF can be said to attenuate inversely with range. It is this far field HRTF, H(f, θ, φ), that has most often been measured.

HRTFs are typically measured in an anechoic chamber to minimize the influence of early reflections and reverberation on the measured response. HRTFs are measured at small increments of θ such as 15° or 30° in the horizontal plane, with interpolation used to synthesize HRTFs for arbitrary positions of θ.

In order to maximize the signal-to-noise ratio (SNR) in a measured HRTF, it is important that the impulse being generated be of high volume. In practice, however, it can be difficult to generate impulses at high volumes and, if generated, they can be damaging to human ears, so it is more common for HRTFs to be directly calculated in the frequency domain using a frequency-swept sine wave or by using maximum length sequences.

The head-related transfer function is involved in resolving the cone of confusion, a series of points where interaural time difference (ITD) and interaural level difference (ILD) are identical for sound sources from many locations around the 0 part of the cone.

When a sound is received by the ear it can either go straight down the ear into the ear canal or it can be reflected off the pinnae of the ear, into the ear canal a fraction of a second later. The sound will contain many frequencies, so therefore many copies of this signal will go down the ear all at different times depending on their frequency (according to reflection, diffraction, and their interaction with high and low frequencies and the size of the structures of the ear.)

These copies overlap each other, and during this, certain signals are enhanced (where the phases of the signals match) while other copies are canceled out (where the phases of the signal do not match). If another person's ears were substituted, the individual would not immediately be able to localize sound, as the patterns of enhancement and cancellation would be different from those patterns the person's auditory system is used to.

Assessing the variation through changes between the person's ear, we can limit our perspective with the degrees of freedom of the head and its relation with the spatial domain. Through this, we eliminate the tilt and other co-ordinate parameters that add complexity. For the purpose of calibration we are only concerned with the direction level to our ears, ergo a specific degree of freedom.

Typically, sounds generated from headphones are perceived as originating from within the head. In the virtual auditory space, the headphones should be able to "externalize" the sound.

Let x1(t) represent an electrical signal driving a loudspeaker and y1(t) represent the signal received by a microphone inside the listener's eardrum. Similarly, let x2(t) represent the electrical signal driving a headphone and y2(t) represent the microphone response to the signal. The goal of the virtual auditory space is to choose x2(t) such that y2(t) = y1(t).

where L is the transfer function of the loudspeaker in the free field, F is the HRTF, M is the microphone transfer function, and H is the headphone-to-eardrum transfer function.

Therefore, theoretically, if x1(t) is passed through this filter and the resulting x2(t) is played on the headphones, it should produce the same signal at the eardrum. Since the filter applies only to a single ear, another one must be derived for the other ear.

There is less reliable phase estimation in the very low part of the frequency band, and in the upper frequencies the phase response is affected by the features of the pinna. Earlier studies also show that the HRTF phase response is mostly linear and that listeners are insensitive to the details of the interaural phase spectrum as long as the interaural time delay (ITD) of the combined low-frequency part of the waveform is maintained.

A scaling factor is a function of the anthropometric features. For example, a training set of N subjects would consider each HRTF phase and describe a single ITD scaling factor as the average delay of the group. This computed scaling factor can estimate the time delay as function of the direction and elevation for any given individual.

The HRTF phase can be described by the ITD scaling factor. This is in turn quantified by the anthropometric data of a given individual taken as the source of reference. that represents the subject's anthropometric features as a linear superposition of the anthropometric features from the training data (y' = βT X), and then apply the same sparse vector directly on the scaling vector H. We solve the above minimization problem using least absolute shrinkage and selection operator.

where The HRTFs for each subject are described by a tensor of size D × K, where D is the number of HRTF directions and K is the number of frequency bins. All Hn,d,k corresponds to all the HRTFs of the training set are stacked in a new tensor H ∈ RN×D×K, so the value Hn,d,k corresponds to the k-th frequency bin for d-th HRTF direction of the n-th person.

Accumulation of HRTF data has made it possible for a computer program to infer an approximate HRTF from head geometry.

Recordings processed via an HRTF, such as in a computer gaming environment (see A3D, EAX, and OpenAL), which approximates the HRTF of the listener, can be heard through stereo headphones or speakers and interpreted as if they comprise sounds coming from all directions, rather than just two points on either side of the head.

Windows 10 and above come with Microsoft Spatial Sound included, the same spatial audio framework used on Xbox One and Hololens 2. On a Windows PC or an Xbox One, the framework can use several different downstream audio processors, including Windows Sonic for Headphones, Dolby Atmos, and DTS Headphone:X, to apply an HRTF.

Apple similarly has Spatial Sound for its devices used with headphones produced by Apple or Beats. Linux is currently unable to directly process any of the proprietary spatial audio (surround plus dynamic objects) formats. SoundScape Renderer offers directional synthesis. PulseAudio and PipeWire each can provide virtual surround (fixed-location channels) using an HRTF. Recent PipeWire versions are also able to provide dynamic spatial rendering using HRTFs, however integration with applications is still in progress.

Some consumer home entertainment products designed to reproduce surround sound from stereo (two-speaker) headphones use HRTFs.

How HRTF Creates 3D Audio

Near-Field Rendering

This post is a high level overview of our near-field rendering tech. This is the first article in a series reviewing new functionality in the Audio SDK. The following post covers our near field rendering tech.

Binaural 3D audio works by applying to a sound a unique filter for each ear based on the 3D position of the sound source. The term “filter” can be used to describe very different things from simple EQ all the way to complex reverberation.

Just as a reverberation filter captures in its binaural impulse response (IR) all the ways a sound can interact with the surrounding environment on its way to the listener’s ears, a binaural spatialization filter captures all the ways a sound can interact with the listener's body on its way to the ears.

In the reverberation case, the IRs are much longer and chaotic due to the size and complexity of the environment. We’ve been taking advantage of this for years to make an approximation of an environment with a single binaural reverb IR because beyond the 1st few bounces, spatialization is buried in a fading chaos that we perceive unconsciously as a diffuse connection to our surrounding environment.

In the binaural 3D spatialization case, the IRs are tiny, but extremely directional. Beyond a few feet, the IRs don’t change much with distance. We’ve been taking advantage of this to make another approximation of 3D audio that is independent of distance and that we call “far-field”.

Our HRTF database is captured/sampled around the head as a grid on a sphere rather than a volume. We’re spatializing along azimuth and elevation angles, but not distance.

Distance is addressed in a separate dedicated modeling:

  • rolloff attenuation curves
  • medium absorption filtering
  • wet/dry balance

Near-field rendering begins with the acknowledgement that this model doesn’t work as well when sound distance from the listener shrinks to the point of being comparable to the size of the human head. In that case, spatialization and distance modeling become closely intertwined and are better synthesized from an ear-centric, rather than a head centric, spatial reference.

In far-field, the center of the world is the center of our head. In near-field, the center of the world is the ear canal entrance, and we have two of them, which makes near-field even more “binaural” in some way than far-field. The Near-field distance (radius of the Near-field sphere around the listener’s head) is commonly defined as ~0.5 - 1.0 m (~3 feet, “within arm's reach”).

A logical evolution from our current far-field HRTF tech would be extending it to near-field by adding more filter samples to the database (red dots) to fill up the entire near-field sphere volume all the way to the head boundary:

Near-field sphere volume

This will likely come down the line from R&D, but will take more resources. In the meantime, just like for the reverberation and far-field spatialization cases, we're looking for a perceptual approximation that runs fast on hardware with limited resources.

So, what's special about near-field audio?

For our approximation to work, we first have to identify the main perceptual cues of near-field rendering:

  • getting closer means louder with the inverse-square law in free field
  • but loudness increase is mainly expressed as ILD (Interaural Level Difference) as the head interferes with propagation and a sound can be much closer to one ear than the other, thus generating much higher ILDs than far-field.

Also worth noting is the absence of ITD (Interaural Time Difference) specific cues: being in close proximity does not affect the timing differences between each ear in a perceivable way, but does generate wider ITD and ILD variations than when they're moving similarly but farther away (remember that pesky mosquito!).

Near-field rendering model:

Near-field rendering model

az: azimuth angle
el: elevation angle
d: sound distance to the listener
a: head diameter

The 1st step takes our far-field HRTF database (as usual), but re-interpreting it geometrically from each ear rather than from the head center.

The next step is convolving our source signal as usual, but now with the near-field HRTF we just built. At this point, we’ve compensated for the directional error in our HRTF lookup so the spatialization is more accurate, but we're still sounding “far” because we're using a far-field HRTF.

Finally, we apply in real-time the physical modeling of the head shadowing effect. The key physical phenomenon at play here is acoustic diffraction: the bending of waves around rigid obstacles like the head.

This phenomenon is frequency dependent:

  • low frequencies can bend around an obstacle
  • high frequencies cannot
  • the cutoff frequency depends on the size of the obstacle

It can be thought of as a binaural (each ear will get a different filtering effect) directional lowpass filter with a cutoff frequency directly related to the head size, the azimuth and elevation angles.

Some systems try to do it using a generic HRTF, for example measured with a dummy head. This gives varying results, depending (amongst other things) on how average your HRTF is, or how sensitive your brains are to errors in the HRTF.

So when simulating a multichannel loudspeaker system over headphones you can start with a multichannel source (instead of a stereo source) that is binauralized (which is effectively a sophisticated downmix to stereo). When done correct this can compete with a real multichannel loudspeaker system. Maybe not 100% equal but it can come very close. And there can be some advantages.

Now if some specific game has it's own "HRTF for stereo headphones" and also supports multichannel speakers then chances are that the former can not be personalized to your personal HRTF, and doesn't give optimal results. So in that case using multichannel speakers is probably better.