Head-Related Transfer Functions (HRTFs) Explained
Head-related transfer functions (HRTFs) describe the spatial filtering of acoustic signals by a listener’s anatomy. HRTFs describe how a given sound wave input (parameterized as frequency and source location) is filtered by the diffraction and reflection properties of the head, pinna, and torso, before the sound reaches the transduction machinery of the eardrum and inner ear (see auditory system). With the increase of computational power, HRTFs are nowadays more and more used for the spatialised headphone playback of 3D sounds, thus enabling personalised binaural audio playback. If properly measured and implemented, HRTFs can generate a “virtual acoustic environment”.
The study of HRTFs is a rapidly growing area with potential uses in virtual environments, auditory displays, entertainment industry, human-computer interface for visually impaired, aircraft warning systems and many others.
When a sound is made, it travels through space in every direction in a sound wave. This wave of sound expands outward from the sound source in every direction, like a rapidly expanding sphere. As sound strikes the listener, the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities, all transform the sound and affect how it is perceived, boosting some frequencies and attenuating others. The head and ears diffract and reflect sound in a unique and consistent way on its path to the eardrums. The brain uses this filtering profile to localize sounds.
The head, torso, shoulders and the outer ears modify the sound arriving at a person’s ears. This modification can be described by a complex response function - the Head Related Transfer Function (HRTF).
Linear systems analysis defines the transfer function as the complex ratio between the output signal spectrum and the input signal spectrum as a function of frequency. Blauert (1974; cited in Blauert, 1981) initially defined the transfer function as the free-field transfer function (FFTF). Other terms include free-field to eardrum transfer function and the pressure transformation from the free-field to the eardrum.
The HRTF can also be described as the modifications to a sound from a direction in free air to the sound as it arrives at the eardrum. These modifications include the shape of the listener's outer ear, the shape of the listener's head and body, the acoustic characteristics of the space in which the sound is played, and so on.
The HRTF varies with the Azimuth and Elevation of incoming sound. It’s unique for everyone because each of us has a unique shape of the ears. Our brain determines the position of the sound source based on the difference between signals in the left and right ear.
A pair of HRTFs for two ears can be used to synthesize a binaural sound that seems to come from a particular point in space. It is a transfer function, describing how a sound from a specific point will arrive at the ear (generally at the outer end of the auditory canal). Some consumer home entertainment products designed to reproduce surround sound from stereo (two-speaker) headphones use HRTFs.
Humans have just two ears, but can locate sounds in three dimensions - in range (distance), in direction above and below (elevation), in front and to the rear, as well as to either side (azimuth). This is possible because the brain, inner ear, and the external ears (pinna) work together to make inferences about location.
Humans estimate the location of a source by taking cues derived from one ear (monaural cues), and by comparing cues received at both ears (difference cues or binaural cues). Among the difference cues are time differences of arrival and intensity differences. The monaural cues come from the interaction between the sound source and the human anatomy, in which the original source sound is modified before it enters the ear canal for processing by the auditory system.
These modifications encode the source location and may be captured via an impulse response which relates the source location and the ear location. This impulse response is termed the head-related impulse response (HRIR). Convolution of an arbitrary source sound with the HRIR converts the sound to that which would have been heard by the listener if it had been played at the source location, with the listener's ear at the receiver location.
Coordinate Systems for HRTF Acquisition and Representation
There are several options to set a specific coordinate system to systematically describe directions for HRTFs. From the physical perspective, the spherical coordinate system is a natural choice; in that case, the origin of the system is placed inside the listener’s head at the midpoint between left and right ear and the direction is described by azimuth and elevation angles, see Figure 1a. In this system, one can intuitively define the two main planes: The eye-level horizontal plane, i.e., all directions with the elevation angle of zero, and the median plane, i.e., all directions with the azimuth angle of zero. The eye-level horizontal plane is also called Frankfurt plane and can be anatomically defined as the plane connecting the lowest part of the listener’s orbital cavity and the highest part of the bony ear canal (meatus acusticus externus osseus). This spherical coordinate system resembles a geodesic representation widely used in physics, with the poles located at the top and bottom.
An alternative system that is more relevant from the auditory perspective is given by the interaural-polar coordinate system. This system is shown in Figure 1b and can be constructed by rotating the poles of the spherical system to the interaural axis, i.e., the axis connecting the two ears. A sound direction is then described by the lateral angles (along the horizontal plane) and polar angles (along the median plane). The poles are then located on the left and right sides of the listener. This simple interaural-polar coordinate system was used in various psychoacoustic studies, e.g., [12, 13], and has the disadvantage that the lateral angle does not correspond to the azimuth angle.
Figure 1c shows the modified version of the interaural-polar coordinate system, which does not have this disadvantage. Here, the sign of the lateral angle is flipped, i.e., in the coordinate system, the positive lateral angles are used for sounds located on the left side of the listener. This transformation to a left-handed coordinate system has the advantage of having the lateral angle corresponding to the azimuth angle for all sources placed in the horizontal plane, and the polar angle corresponding to the elevation angle for all sources placed in the median plane. Thus, the modified interaural-polar coordinate system offers a better link between the psychoacoustic research and audio engineering. In that system, the lateral angle ranges from −90° (right ear) over 0° (front) to 90° (left ear), and the polar angle ranges from −90° (bottom) over 0° (front) and 90° (up) to 180° (back) and 270° (bottom again).

The understanding of these coordinate systems is important because state-of-the-art acquisitions and representations of HRTFs utilise those systems. For example, Figure 2 shows HRTFs along the Frankfurt and the median plane. These various coordinate systems are used in HRTF visualisation, in various HRTF-related software packages such as the SOFA toolbox [15], and in auditory modelling, e.g., the Auditory Modelling Toolbox (AMT) [16, 17].

HRTF Acquisition Methods
HRTF acquisition can be classified into three categories: acoustic measurement, numerical calculation, and personalisation [18].
Acoustic Measurement
The acoustic measurement is traditionally designed as the measurement of the impulse response between source and receiver in an anechoic or semianechoic chamber, describing the transmission path from a sound source to the ear [11, 19]. A comprehensive review of the established state-of-the-art acoustic techniques to measure HRTFs can be found in [20]. Thus, in this chapter, Section 3, we only briefly provide an overview of the traditional acoustic HRTF measurement approaches, highlight some of their differences and new trends and focus on the requirements for the acoustic measurement.
Numerical HRTF Calculation
Numerical HRTF calculation simulates the acoustic measurement by considering a 3D representation of the listener’s geometry and the positions of multiple external sound sources, for which the generated sound pressure at the entrance of the ear canal is calculated. This technique has become more popular and is the main focus of this chapter. To this end, in Section 4, we provide an overview of the principles of various numerical calculation approaches including a comparison of the mentioned methods.
Personalization of HRTFs
Personalisation of HRTFs describes the process of adapting an existing set of generic data guided by listener-specific information, either with the help of objective or subjective personalisation method. The objective personalisation has been approached from two different domains: the geometric domain, in which listener-specific anthropometric data are measured and used to personalise a generic geometric model from which HRTFs are then simulated; or the spectral domain, in which a generic HRTF set is directly personalised based on listener-specific information.
Examples for personalisation approaches include utilising frequency scaling [21], parametric modelling of peaks and notches [22], active shape modelling (ASM) [23], principal component analysis (PCA) in both geometric [24] and spectral domains [25, 26, 27, 28, 29], multiple regression analysis [30], independent component analysis (ICA) [31], large deformation diffeomorphic metric mapping (LDDMM) [25, 32], local neighbourhood mapping [33], neural networks [34, 35, 36, 37, 38, 39, 40, 41] and linear combination of HRTFs [42]. Despite many efforts worldwide [43, 44, 45, 46], the link between the morphology and HRTFs is not fully understood yet, mostly because of the high dimensionality of the problem. Most recent tools for studying that link are rooted in aligning high-resolution pinna representations to target representations facilitated with parametric pinna models [47, 48].
In the subjective personalisation, listeners are confronted with several sets of HRTFs and an algorithm (usually based on the evaluation of localisation errors, i.e., the difference between perceived and actual sound-source location) adapts the HRTF sets aiming at converging at listener-specific HRTFs [9, 49]. For an educated guess for the initial sets, anthropometric data can be used to pre-scale the HRTF sets, or the HRTF sets can be pre-selected via psychoacoustic models [50]. Clustering of the HRTF sets can further improve the relevance and reduce the duration of the personalisation procedure [49, 51].
All these methods aim at providing a specific quality in terms of acoustic and psychoacoustic properties. In the following section, we describe the acoustic properties and psychoacoustic requirements for human HRTFs, both of which lay the base for HRTF acquisition.
Acoustic Properties and Psychoacoustic Requirements
In this section, we describe the acoustic properties of HRTFs and relate them to psychophysical properties of human hearing with the goal to derive the minimum requirements for sufficiently accurate HRTF acquisition by means of perception. We analyse spectral, temporal and spatial aspects of HRTFs and consider contributions of distinct parts of the human body to these aspects.
Humans can hear frequencies roughly between 20 Hz and 20 kHz, with frequencies at the lower end being perceived as vibrations or creaks, and with the upper end decreasing with age and duration of noise exposure [52]. From the psychoacoustic perspective, frequencies down to 90 Hz contribute to sound lateralisation, i.e., localisation on the interaural axis within the head [53], and up to 16 kHz to sound localisation, i.e., localisation outside the head [54], defining the smallest frequency range for the HRTF acquisition. Figure 2 shows the amplitude spectra of a binaural HRTF pair of two listeners.
For each listener, the left and right columns show HRTFs of the left and right ear, respectively. The top row shows the HRTFs along the median, i.e., for the lateral angle of zero, from the front, via up, to the back. The bottom row shows the HRTFs along the Frankfurt plane, i.e., the horizontal plane located at the eye level. Figure 2 demonstrates that HRTFs vary across ears, frequency, sound-source positions and listeners. The bottom panels emphasise the difference between ipsilateral and contralateral ear, showing the dynamic range, especially for frequencies higher than 6 kHz.
Assuming the propagation medium is air and a sonic speed of 340 m/s, the human hearing frequency range translates to wavelengths approximately between 1.7 cm and 17 m, resulting in different body parts affecting HRTFs in different frequency regions. The reflections of the torso create spatial-frequency modulations in the range of up to 3 kHz [1]. This effect can be observed in the top row of Figure 2, in the form of elevation-dependent spectral modulations along the median plane [55, 56]. Another contribution comes from the head, which shadows frequencies above 1 kHz. This effect can be observed in both rows of Figure 2, with large changes in the spectra beginning at around 1 kHz [57]. A large contribution is that of the pinna: The resonances and reflections within the pinna geometry create spectral peaks and notches, respectively, in frequencies above 4 kHz [54]. This effect can be observed in the bottom row of Figure 2.
From the perceptual perspective, the quality of these HRTF spectral profiles is important in many processes involved in spatial hearing. For example, sound-localisation performance deteriorates when these spectral profiles are disturbed by means of introducing spectral ripples [58], reducing the number of frequency channels [59] or spectral smoothing [60]. From the acoustic perspective, these spectral profiles show modulation depths of up to 50 dB [11], defining the required dynamic range in the process of HRTF acquisition.
The temporal aspects of HRTF acquisition are shown in Figure 3 as the head-related impulse responses (HRIRs), i.e., HRTFs in the time domain, of the same listeners as in Figure 2. There are a few things to consider. First, the minimum length of the measurement is bounded by the length of the HRIRs. Their amplitude decays within the first 5 ms, setting the requirement for the room impulse response during the measurements [61]. After the 5 ms, the HRIRs decay below 50 dB, setting the requirement on the broadband signal-to-noise ratio (SNR) of the measurements. Further, because of the human sensitivity to interaural disparities, HRTF acquisition also requires an interaural temporal synchronisation.
HRTF in Modern Systems
Accumulation of HRTF data has made it possible for a computer program to infer an approximate HRTF from head geometry. Recordings processed via an HRTF, such as in a computer gaming environment (see A3D, EAX, and OpenAL), which approximates the HRTF of the listener, can be heard through stereo headphones or speakers and interpreted as if they comprise sounds coming from all directions, rather than just two points on either side of the head.
Windows 10 and above come with Microsoft Spatial Sound included, the same spatial audio framework used on Xbox One and Hololens 2. On a Windows PC or an Xbox One, the framework can use several different downstream audio processors, including Windows Sonic for Headphones, Dolby Atmos, and DTS Headphone:X, to apply an HRTF.
Apple similarly has Spatial Sound for its devices used with headphones produced by Apple or Beats.
Linux is currently unable to directly process any of the proprietary spatial audio (surround plus dynamic objects) formats. SoundScape Renderer offers directional synthesis.[21] PulseAudio and PipeWire each can provide virtual surround (fixed-location channels) using an HRTF. Recent PipeWire versions are also able to provide dynamic spatial rendering using HRTFs,[22] however integration with applications is still in progress.
Problems with HRTF
Measuring HRTF’s can be expensive. A typical set up requires an anechoic chamber and high quality audio equipment like speakers and headphones. To take this technology to the masses, generic HRTFs have been used, but they do not work as well as individualized HRTFs. Once measured, HRTFs are convolved with the sound to give it a direction. Depending on the size of these functions, the cost of computing equipment can rise significantly.
There is much to be learned about HRTFs. Even the most carefully taken measurements suffer from the “cones of confusion” and “inside the head” effects. Range cues are poorly understood. It is possible to add a room transfer function to give the effect of distance, but such filters are not flexible, i.e. one cannot obtain a “whisper” effect using the room transfer function.