Posted on 4 Comments

What Is… Ambisonics?

This post is part of a What Is… series that explains spatial audio techniques and terminology.

Ambisonics is a spatial audio system that has been around since the 1970s, with lots of the pioneering work done by Michael Gerzon. Interest in Ambisonics has waxed and waned over the decades but it is finding use in virtual, mixed and augmented reality because it has a number of useful mathematical properties. In this post you’ll find a brief summary of first-order Ambisonics, without going too deeply into the maths that underpins it.

Unlike channel-based systems (stereo, VBAP, etc.) Ambisonics works in two stages: encoding and decoding. The encoding stage converts the signals into B-format (spherical harmonics), which are agnostic of the loudspeaker arrangement. The decoding stage takes these signals and converts them to the loudspeaker signals needed to recreate the scene for the listeners.

Ambisonic Encoding

First-order Ambisonic encoding to SN3D for a sound source rotating in the horizontal plane. The Z channel is always zero for these source directions. First-order Ambisonic encoding to SN3D for a sound source rotating in the horizontal plane. The Z channel is always zero for these source directions.

A mono signal can be encoded to Ambisonics B-format using the following equations:

\(\begin{eqnarray}
W &=& S \\
Y &=& S\sin\theta\cos\phi \\
Z &=& S\sin\phi\\
X &=& S\cos\theta\cos\phi
\end{eqnarray}
\)

where \(S\) is the signal being encoded, \(\theta\) is the azimuthal direction of the source and \(\phi\) is the elevation angle. (These equations use the semi-normalised 3D (SN3D) scheme, as in the AmbiX format used by Google for YouTube. The channel ordering also follows the AmbiX standard.) First-order Ambisonics can also be captured using a tetrahedral microphone array.

B-format is a representation of the sound field at a particular point. Each sound source is encoded and the W, X, Y and Z channels for each source are summed to give the complete sound field. Therefore, no matter how many sound sources are in the scene, only 4 channels are required for transmission.

This encoding can be thought of as capturing a sound source using one omnidirectional microphone (W) and 3 figure-of-eight microphones pointing along the Cartesian x-, y- and z-axes. As shown in the animation to the side, the amplitude of the W channel stays constant for all source positions while the X and Y channels change relatives gains and sign (positive/negative) with source position. Comparison of the polarity of X and Y with W allows the direction of the sound source to be derived.

Ambisonic Decoding

Decoding is the process of taking the B-format signals and converting them to loudspeaker signals. Depending on the loudspeaker layout this can be relatively straightforward or really quite complex. In the most simple cases, with a perfectly regular layout, the B-format signals are sampled at the loudspeaker positions. Other methods, (for example, mode-matching or energy preserving) can be used but tend to give the same results for a regular array.

Assuming a regular array and good decoder, an Ambisonic decoder will recreate the exact sound field up to approximately 400 to 700 Hz. Below this limit frequency the reproduction error is low and the ITD cues are well recreated, meaning the system can provide good localisation. Above this frequency the recreated sound field deviates from the intended physical sound field so some psychoacoustic optimisation is applied. This is realised by using a different decoder in higher frequency ranges that focusses the energy in as small a region as possible in the loudspeaker array. This helps produce better ILD cues and a more precise image.

Ambisonics differs from VBAP because, in most cases, all loudspeakers will be active for any particular source direction. Not only will the amplitude vary, the polarity of the loudspeaker signals will also matter. Ambisonics uses all of the loudspeakers to “push” and “pull” so that the correct sound field is recreated at the centre of the loudspeaker array.

What is a Good Decoder?

A “good” Ambisonic decoder requires an appropriate loudspeaker arrangement. Ambisonics ideally uses a regularly positioned loudspeaker arrangement. For example, a horizontal-only system will place the loudspeakers at regular intervals around the centre of an array.

Any number of loudspeakers can be used to decode the sound scene but using more than required can lead to colouration problems. The more loudspeakers are added the more of a low-pass filtering effect there is for listeners in the centre of the array. So what is the best number of loudspeakers to use for first-order Ambisonics? It is generally agreed that 4 loudspeakers placed in a square be used for a horizontal system and 8 in a cuboid be used for 3D playback. This avoids too much colouration and satisfies several conditions for good (well… consistent) localisation.

There are metrics defined in the Ambisonics literature that predict the quality of the system in terms of localisation. These are the velocity and energy vectors and they are deserving of their own article. For now, it’s worth noting that the velocity vector links to low-frequency ITD localisation cues. Decoders are designed to optimise it at low-frequencies while they are optimised using the energy vector at higher frequencies. The high frequency decoder is known as a ‘max rE’ decoder, so-called because it aims to maximise the magnitude of the energy vector metric. This is just another way of saying that the energy is focussed in as small an area as possible.

Ambisonic Rotations

When it comes to virtual and augmented reality, efficient rotation of the full sound field is to follow head movements is a big plus. Thankfully, Ambisonics has got us covered here. The full sound field can be rotated before decoding by blending the X, Y and Z channels correctly.

The advantage of rotating the Ambisonic sound field is that any number of sound sources can be encoded in just 4 channels, meaning rotating a sound field with one sound source takes as much effort as rotating one with 100 sound sources.


That’s the basics of Ambisonics covered. At some point we’ll look more at measures of quality of Ambisonics decoders and how well ITD and ILD are recreated. This blog has also only covered first-order Ambisonics, but Higher Order Ambisonics (HOA) is likely make its way to VR platforms in a significant way in the near future so I’ll cover that soon.

Do you have any spatial audio questions you’d like to have answered? Just leave a comment and let me know!

Posted on

What Is… Spatial Audio?

This post is the first in a What Is… series. The idea is to explain different techniques, terminology and concepts related to spatial audio. This will range from the most common terms right through to some more obscure topics. And where better to start than “spatial audio” even means!

Spatial audio (with some exceptions) has generally been confined to academia but is rapidly finding applications in virtual reality (VR). There are even moves to bring it to broadcasting so it can be enjoyed by people in the comfort of their living rooms. As spatial audio moves from labs to living rooms it is worth exploring all of the different techniques that have been developed up to this point.

However, defining spatial audio can quickly become rather philosophical. For example, is a mono recording spatial audio? If I take a single microphone to a concert hall and record a performance then I have captured the sense of space, through echoes and reverberation, not just the performances themselves. This means that the space is encoded into the signal – we can tell if a recording is made in a dry studio or a cathedral. For the purposes of this series I will not be considering this to be spatial audio. Instead, I will be defining spatial audio as any audio encoding or rendering technique that allows for direction to be added to the source. How well this is reproduced to the listener will depend on the encoding and playback system but, in general, a spatial audio system will allow different sounds placed in different positions to be directionally differentiated.

There are a large number of different spatial audio techniques available and which one you want to use will depend on the final use. These techniques include (but are in no way limited to):

  • Stereophony
  • Vector Base Amplitude Panning (VBAP)
  • Ambisonics and Higher Order Ambisonics (HOA)
  • Binaural rendering (using HRTFs over headphones)
  • Wave Field Synthesis (WFS)
  • Loudspeaker diffusion
  • Discrete loudspeaker techniques

Each of these will be explained in more detail in future posts but you can see from this non-exhaustive list that there are already quite a few techniques to choose between. To further complicate things, some of these techniques can be combined in order to take advantage of different properties of both. For example, Ambisonics and binaural can be combined in VR and augmented reality (AR) to give a headphone-based rendering that can be easily rotated (a nice property of Ambisonics).

Spatial audio techniques can also be divided between those that aim to produce a physically accurate sound field in (at least some of) the listening area and those that are not concerned with matching a “real” sound field. HOA and WFS can both be used to recreate a holophonic sound scene using an array of loudspeakers. Meanwhile, stereo and VBAP do not recreate any target sound field but are still able to produce sounds in different directions.

Whether or not the spatial audio technique is physically-based or not, we also have to consider the potentially most important element in the whole chain: the listener! All of these techniques rely on how we perceive the sound and there are any number of confounding factors that can take our nicely defined (in a mathematical sense) system and throw many of our assumptions out the window. Therefore, this What Is… series will also include elements of spatial hearing and psychoacoustics that are essential to consider when working with spatial audio.

So, spatially audio can take a number of forms, each with their own advantage, disadvantages, limits and creative possibilities. It is these, along with the technical and psychoacoustic underpinnings, that I will expand upon in upcoming blog posts.

If there are any aspects of spatial audio that you’d like to have explained then leave a comment below.

Posted on

Last House Falls Into the Sea

I’ve added a new page here so you can listen to one of the Ambisonic mixes I have done. The segment on the page has been converted to binaural so you can listen with your headphones.

This one was for a rather long, complex piece that has several movement but I’ve also experimented with more conventional “pop” songs and it can work equally well. It’s a very different approach to mix in spatial audio compared to traditional stereo. It can be very freeing – you have much more space to play with – but it has to be balanced with retaining some sort of coherence – music and spatial – between all of the different element.

If you need your own Ambisonic or binaural mix created then please get in touch.