A Unified Geometry-aware Source Localization and Separation Framework for Ad-hoc Microphone Array

Jingjie Fan1, Rongzhi Gu2, Yi Luo2, Cong Pang1
1School of Information Science and Engineering, Southeast University
2Tencent AI Lab, Shenzhen, China

Abstract

Many of the existing systems for multi-channel sound source localization and separation are built on or designed for specific microphone array geometries, which means that for a new scenario where the array geometry or number of microphones is different, the system needs to be re-designed or re-trained. Recent attempts have been made in building neural network models for ad-hoc microphone arrays, while they cannot fully leverage the array-dependent information used by array-specific models. In this paper, we propose Galatea, a unified geometry-aware source localization and separation framework that can build models for both ad-hoc arrays and geometry-specific arrays. We introduce random array geometry sampling with a fast on-the-fly data simulation process for general ad-hoc arrays, geometry-dependent feature extraction for array-specific finetuning, and a novel model architecture to boost both the localization and separation performance. Experimental results on different array geometries demonstrate the effectiveness of Galatea.

Validation of multi-look DFs across arrays

In order to verify that the essential contents of multi-look directional features (DFs) are similar and their variations are relatively small across different microphone arrays, the below figure shows multi-look DFs computed from a 4 mic linear array (LA), a 6 mic circular array (CA) and a 5 mic ad-hoc array with the same simulation configuration. The two speakers are located at 0° and 180° respect to the microphone array center, respectively. Although the nature of DFs is satisfied under different conditions, in order to better illustrate DFs, the SIR between two speech sources is set to 0dB, and the SNR is set to 15dB.



the image of DFs

A simulated mixture sample with target speech sources located at 0° and 180°, respectively. (a) The geometry of three types of microphone arrays including 4 mic linear array (LA), 6 mic circular array (CA) and 5 mic ad-hoc array. (b) The logarithm power spectrum (LPS) of mixture and target speech signals. (c) Multi-look DFs extracted at four directions with different microphone arrays.


Array geometry
Mixture (multi-channel)
Source 1
Source 2
4 mic LA
6 mic CA
5 mic ad-hoc

Results on simulation data

To verfiy the university of Galatea for both ad-hoc and specific microphone arrays, we set up several acoustic scenarios and simulate the mixture signals received by microphone arrays with different geometries in each scenario. We present the sound source localization results of Loc-BSRNN and the speech separation results of SS-BSRNN.
(Due to the front-back symmetry property of the linear microphone array, we map its SSL results to a range of [-90°, 90°])

the image of pipeline

Results on simulation configuration 1

# of sources
DoAs of sources
overlap(%)
SNR(dB)
N = 1
[30°]
None
5.0dB
Geometry
Mixture
Loc-BSRNN
SS-BSRNN
FasNet-TAC
dual-mic.
with 10.5 cm spacing
N = 1,
[30°]
4-mic. linear array
with a uniform spacing of 3.5 cm
N = 1,
[30°]
4-mic. RA
with a side length of 7.5 cm
N = 1,
[30°]
6-mic. CA
with a diameter of 10.5 cm
N = 1,
[30°]
ad-hoc mic. array
with 5 microphones
N = 1,
[30°]

Results on simulation configuration 2

# of sources
DoAs of sources
overlap(%)
SNR(dB)
N = 2
[-60°, 40°]
60%
8.0dB
Geometry
Mixture
Loc-BSRNN
SS-BSRNN
FasNet-TAC
dual-mic.
with 10.5 cm spacing
N = 2,
[-60°, 40°]
4-mic. linear array
with a uniform spacing of 3.5 cm
N = 2,
[-60°, 40°]
4-mic. RA
with a side length of 7.5 cm
N = 2,
[-60°, 40°]
6-mic. CA
with a diameter of 10.5 cm
N = 2,
[-60°, 40°]
ad-hoc mic. array
with 4 microphones
N = 2,
[-60°, 40°]

Results on simulation configuration 3

# of sources
DoAs of sources
overlap(%)
SNR(dB)
N = 2
[10°, 30°]
10%
6.0dB
Geometry
Mixture
Loc-BSRNN
SS-BSRNN
FasNet-TAC
dual-mic.
with 8.0 cm spacing
N = 2,
[10°, 30°]
4-mic. linear array
with a uniform spacing of 4.0 cm
N = 2,
[10°, 30°]
4-mic. RA
with a side length of 7.0 cm
N = 2,
[10°, 30°]
6-mic. CA
with a diameter of 12.0 cm
N = 2,
[10°, 30°]
ad-hoc mic. array
with 3 microphones
N = 2,
[10°, 30°]

Results on simulation configuration 4

# of sources
DoAs of sources
overlap(%)
SNR(dB)
N = 2
[70°, 170°]
90%
5.0dB
Geometry
Mixture
Loc-BSRNN
SS-BSRNN
FasNet-TAC
dual-mic.
with 12.0 cm spacing
N = 2,
[70°, -10°]
4-mic. linear array
with a uniform spacing of 5.0 cm
N = 2,
[70°, -10°]
4-mic. RA
with a side length of 10.0 cm
N = 2,
[70°, 170°]
6-mic. CA
with a diameter of 8.0 cm
N = 2,
[70°, 170°]
ad-hoc mic. array
with 7 microphones
N = 2,
[70°, 170°]

Results on real-recorded data

The figure below gives an application scenario of offline conference. The conversation takes place in a coference room (about 10 * 6 * 4 m). A conference table is placed at the center of the room with the microphone array put on the table. Two male speakers sit by the the table at 90° and 0° relative to the center of microphone array, respectively. The distance-to-array is about 1.5 meters. A 8-element linear array with a uniform spacing of 3.5 cm is used to record the session, and we select different combinations from it to form arrays with different geometries and numbers of microphones.

the image of conference room

Session: Two male speakers, loud game character voice, music, B-Box



array ID
# of mic.
Index of mic.
array spacing
1
M = 2
[3, 6]
10.5 cm
2
M = 3
[2, 5, 7]
17.5 cm
3
M = 5
[1, 3, 4, 5, 7]
21.0 cm
4
M = 8
[1, 2, 3, 4, 5, 6, 7, 8]
24.5 cm
Time stamp
ID
Recording
Loc-BSRNN
SS-BSRNN
FasNet-TAC
[10.0 ~ 36.0]
1
N = 2,
[0°,90°]
2
N = 2,
[0°,90°]
3
N = 2,
[0°,90°]
4
N = 2,
[0°,90°]
[39.0 ~ 61.0]
1
N = 2,
[0°,90°]
2
N = 2,
[0°,90°]
3
N = 2,
[0°,90°]
4
N = 2,
[0°,90°]