Galatea

Abstract

Many of the existing systems for multi-channel sound source localization and separation are built on or designed for specific microphone array geometries, which means that for a new scenario where the array geometry or number of microphones is different, the system needs to be re-designed or re-trained. Recent attempts have been made in building neural network models for ad-hoc microphone arrays, while they cannot fully leverage the array-dependent information used by array-specific models. In this paper, we propose Galatea, a unified geometry-aware source localization and separation framework that can build models for both ad-hoc arrays and geometry-specific arrays. We introduce random array geometry sampling with a fast on-the-fly data simulation process for general ad-hoc arrays, geometry-dependent feature extraction for array-specific finetuning, and a novel model architecture to boost both the localization and separation performance. Experimental results on different array geometries demonstrate the effectiveness of Galatea.

Validation of multi-look DFs across arrays

In order to verify that the essential contents of multi-look directional features (DFs) are similar and their variations are relatively small across different microphone arrays, the below figure shows multi-look DFs computed from a 4 mic linear array (LA), a 6 mic circular array (CA) and a 5 mic ad-hoc array with the same simulation configuration. The two speakers are located at 0° and 180° respect to the microphone array center, respectively. Although the nature of DFs is satisfied under different conditions, in order to better illustrate DFs, the SIR between two speech sources is set to 0dB, and the SNR is set to 15dB.

A simulated mixture sample with target speech sources located at 0° and 180°, respectively. (a) The geometry of three types of microphone arrays including 4 mic linear array (LA), 6 mic circular array (CA) and 5 mic ad-hoc array. (b) The logarithm power spectrum (LPS) of mixture and target speech signals. (c) Multi-look DFs extracted at four directions with different microphone arrays.

Array geometry	Mixture (multi-channel)	Source 1	Source 2
4 mic LA
6 mic CA
5 mic ad-hoc

Results on simulation data

To verfiy the university of Galatea for both ad-hoc and specific microphone arrays, we set up several acoustic scenarios and simulate the mixture signals received by microphone arrays with different geometries in each scenario. We present the sound source localization results of Loc-BSRNN and the speech separation results of SS-BSRNN.
(Due to the front-back symmetry property of the linear microphone array, we map its SSL results to a range of [-90°, 90°])

Results on simulation configuration 1

# of sources	DoAs of sources	overlap(%)	SNR(dB)
N = 1	[30°]	None	5.0dB

Geometry	Mixture	Loc-BSRNN	SS-BSRNN	FasNet-TAC
dual-mic. with 10.5 cm spacing		N = 1, [30°]
4-mic. linear array with a uniform spacing of 3.5 cm		N = 1, [30°]
4-mic. RA with a side length of 7.5 cm		N = 1, [30°]
6-mic. CA with a diameter of 10.5 cm		N = 1, [30°]
ad-hoc mic. array with 5 microphones		N = 1, [30°]

Results on simulation configuration 2

# of sources	DoAs of sources	overlap(%)	SNR(dB)
N = 2	[-60°, 40°]	60%	8.0dB

Geometry	Mixture	Loc-BSRNN	SS-BSRNN	FasNet-TAC
dual-mic. with 10.5 cm spacing		N = 2, [-60°, 40°]
4-mic. linear array with a uniform spacing of 3.5 cm		N = 2, [-60°, 40°]
4-mic. RA with a side length of 7.5 cm		N = 2, [-60°, 40°]
6-mic. CA with a diameter of 10.5 cm		N = 2, [-60°, 40°]
ad-hoc mic. array with 4 microphones		N = 2, [-60°, 40°]

Results on simulation configuration 3

# of sources	DoAs of sources	overlap(%)	SNR(dB)
N = 2	[10°, 30°]	10%	6.0dB

Geometry	Mixture	Loc-BSRNN	SS-BSRNN	FasNet-TAC
dual-mic. with 8.0 cm spacing		N = 2, [10°, 30°]
4-mic. linear array with a uniform spacing of 4.0 cm		N = 2, [10°, 30°]
4-mic. RA with a side length of 7.0 cm		N = 2, [10°, 30°]
6-mic. CA with a diameter of 12.0 cm		N = 2, [10°, 30°]
ad-hoc mic. array with 3 microphones		N = 2, [10°, 30°]

Results on simulation configuration 4

# of sources	DoAs of sources	overlap(%)	SNR(dB)
N = 2	[70°, 170°]	90%	5.0dB

Geometry	Mixture	Loc-BSRNN	SS-BSRNN	FasNet-TAC
dual-mic. with 12.0 cm spacing		N = 2, [70°, -10°]
4-mic. linear array with a uniform spacing of 5.0 cm		N = 2, [70°, -10°]
4-mic. RA with a side length of 10.0 cm		N = 2, [70°, 170°]
6-mic. CA with a diameter of 8.0 cm		N = 2, [70°, 170°]
ad-hoc mic. array with 7 microphones		N = 2, [70°, 170°]

Results on real-recorded data

The figure below gives an application scenario of offline conference. The conversation takes place in a coference room (about 10 * 6 * 4 m). A conference table is placed at the center of the room with the microphone array put on the table. Two male speakers sit by the the table at 90° and 0° relative to the center of microphone array, respectively. The distance-to-array is about 1.5 meters. A 8-element linear array with a uniform spacing of 3.5 cm is used to record the session, and we select different combinations from it to form arrays with different geometries and numbers of microphones.

Session: Two male speakers, loud game character voice, music, B-Box

array ID	# of mic.	Index of mic.	array spacing
1	M = 2	[3, 6]	10.5 cm
2	M = 3	[2, 5, 7]	17.5 cm
3	M = 5	[1, 3, 4, 5, 7]	21.0 cm
4	M = 8	[1, 2, 3, 4, 5, 6, 7, 8]	24.5 cm

Time stamp	ID	Loc-BSRNN
[10.0 ~ 36.0]	1	N = 2, [0°,90°]
	2	N = 2, [0°,90°]
	3	N = 2, [0°,90°]
	4	N = 2, [0°,90°]
[39.0 ~ 61.0]	1	N = 2, [0°,90°]
	2	N = 2, [0°,90°]
	3	N = 2, [0°,90°]
	4	N = 2, [0°,90°]