Scaper tutorial¶
Introduction¶
Welcome to the scaper tutorial! In this tutorial, we’ll explain how scaper works and show how to use scaper to synthesize soundscapes and generate corresponding annotation files.
Organize your audio files (source material)¶
Scaper creates new soundscapes by combining and transforming a set of existing audio files, which we’ll refer to as the source material. By combining/transforming the source material in different ways, scaper can create an infinite number of different soundscapes from the same source material. The source material is comprised of two groups of files: background files and foreground files:
- Background files: are used to create the background of the soundscape, and should contain audio material that is perceived as a single holistic sound which is more distant, ambiguous, and texture-like (e.g. the “hum” or “drone” of an urban environment, or “wind and rain” sounds in a natural environment). Importantly, background files should not contain salient sound events.
- Foreground files: are used to create sound events. Each foreground audio file should contain a single sound event (short or long) such as a car honk, an animal vocalization, continuous speech, a siren or an idling engine. Foreground files should be as clean as possible with no background noise and no silence before/after the sound event.
The source material must be organized as follows: at the top level, you need a
background
folder and a foreground
folder. Within each of these, scaper
expects a folder for each category (label) of sounds. Example categories of
background sounds include “street”, “forest” or “park”. Example categories of
foreground sounds include “speech”, “car_honk”, “siren”, “dog_bark”,
“bird_song”, “idling_engine”, etc. Within each category folder, scaper expects
WAV files of that category: background files should contain a single ambient
recording, and foreground files should contain a single sound event. The
filename of each audio file is not important as long as it ends with .wav
.
Here’s an example of a valid folder structure for the source material:
- foreground
- siren
- siren1.wav
- siren2.wav
- some_siren_sound.wav
- car_honk
- honk.wav
- beep.wav
- human_voice
- hello_world.wav
- shouting.wav
- background
- park
- central_park.wav
- street
- quiet_street.wav
- loud_street.wav
EXAMPLE SOURCE MATERIAL can be obtained by downloading the
scaper repository
(approx. 50mb). The audio can be found under scaper-master/tests/data/audio
(audio
contains two subfolders, background
and foreground
).
For the remainder of this tutorial, we’ll assume you’ve downloaded this material
and copied the audio
folder to your home directory. If you copy it somewhere
else (or use different source material), be sure to change the paths to the
foreground_folder
and background_folder
in the example code below.
Create a Scaper object¶
The first step is to create a Scaper
object:
import scaper
import os
path_to_audio = os.path.expanduser('~/audio')
soundscape_duration = 10.0
seed = 123
foreground_folder = os.path.join(path_to_audio, 'foreground')
background_folder = os.path.join(path_to_audio, 'background')
sc = scaper.Scaper(soundscape_duration, foreground_folder, background_folder)
sc.ref_db = -20
We need to supply three arguments to create a Scaper
object:
- The desired duration: all soundscapes generated by this Scaper object will have this duration.
- The path to the foreground folder.
- The path to the background folder.
If you’re not sure what the foreground and background folders are, please see Organize your audio files (source material).
Finally, we set the reference level sc.ref_db
, i.e. the loudnes of the
background, measured in LUFS. Later
when we add foreground events, we’ll have to specify an snr
(signal-to-noise ratio) value, i.e. by how many decibels (dB) should the foreground event
be louder (or softer) with respect to the background level specified by
sc.ref_db
.
Seeding the Scaper object for reproducibility¶
A further argument can be specified to the Scaper
object:
- The random state: this can be either a numpy.random.RandomState object or an integer. In the latter case, a random state will be constructed. The random state is what will be used for drawing from any distributions. If the audio kept in all of the folders is exactly the same and the random state is fixed between runs, the same soundscape will be generated both times. If you don’t define any random state or set seed to None, runs will be random and not reproducible. You can use np.random.get_state() to reproduce the run after the fact by recording the seed that was used somewhere.
This can be specified like so (e.g. for a random seed of 123):
seed = 123
sc = scaper.Scaper(soundscape_duration, foreground_folder, background_folder,
random_state=seed)
sc.ref_db = -20
If the random state is not specified, it defaults to the old behavior which just uses
the RandomState used by np.random. You can also set the random state after creation
via Scaper.set_random_state
. Alternatively, you can set the random state directly:
import numpy as np
seed = np.random.RandomState(123)
sc = scaper.Scaper(soundscape_duration, foreground_folder, background_folder,
random_state=seed)
sc.ref_db = -20
Adding a background and foreground sound events¶
Adding a background¶
Next, we can optionally add a background track to our soundscape:
sc.add_background(label=('const', 'park'),
source_file=('choose', []),
source_time=('const', 0))
To add a background we have to specify:
label
: the label (category) of background, which has to match the name of one of the subfolders in our background folder (in our example “park” or “street”).source_file
: the path to the specific audio file to be used.source_time
: the time in the source file from which to start the background.
Note how in the example above we do not specify these values directly by providing strings or floats, but rather we provide each arugment with a tuple. These tuples are called distribution tuples and are used in scaper for specifying all sound event parameters. Let’s explain:
Distribution tuples¶
One of the powerful things about scaper is that it allows you to define a soundscape
in a probabilistic way. That is, rather than specifying constant (hard coded) values for each
sound event, you can specify a distribution of values to sample from. Later on,
when we call sc.generate()
, a soundscape will be “instantiated” by sampling a value
for each distribution tuple in each sound event (foreground and background). Every time
we call sc.generate()
, a new value will be sampled for each distribution tuple,
resulting in a different soundscape.
The distribution tuples currently supported by scaper are:
('const', value)
: a constant, given byvalue
.('choose', list)
: uniformly sample from a finite set of values given bylist
.('uniform', min, max)
: sample from a uniform distribution betweenmin
andmax
.('normal', mean, std)
: sample from a normal distribution with meanmean
and standard deviationstd
.('truncnorm', mean, std, min, max)
: sample from a truncated normal distribution with meanmean
and standard deviationstd
, limited to values betweenmin
andmax
.
Special cases: the label
and source_file
parameters in sc.add_background()
(and as we’ll see later sc.add_event()
as well) must be specified using
either the const
or choose
distribution tuples. When using choose
, these
two parameters (and only these) can also accept a special version of the choose
tuple
in the form ('choose', [])
, i.e. with an empty list. In this case, scaper will
use the file structure in the foreground and background folders to automatically populate
the list with all valid labels (in the case of the label
parameter) and all valid
filenames (in the case of the source_file
parameter).
Adding a foreground sound event¶
Next, we can add foreground sound events. Let’s add one to start with:
sc.add_event(label=('const', 'siren'),
source_file=('choose', []),
source_time=('const', 0),
event_time=('uniform', 0, 9),
event_duration=('truncnorm', 3, 1, 0.5, 5),
snr=('normal', 10, 3),
pitch_shift=('uniform', -2, 2),
time_stretch=('uniform', 0.8, 1.2))
A foreground sound event requires several additional parameters compared to a background event. The full set of parameters is:
label
: the label (category) of foreground event, which has to match the name of one of the subfolders in our foreground folder (in our example “siren”, “car_honk” or “human_voice”).source_file
: the path to the specific audio file to be used.source_time
: the time in the source file from which to start the event.event_time
: the start time of the event in the synthesized soundscape.event_duration
: the duration of the event in the synthesized soundscape.snr
: the signal-to-noise ratio (in LUFS) compared to the background. In other words, how many dB above or below the background should this sound event be percieved.
Scaper also supports on-the-fly augmentation of sound events, that is, applying audio transformations to the sound events in order to increase the variability of the resulting soundscape. Currently, the supported transformations include pitch shifting and time stretching:
pitch_shift
: the number of semitones (can be fractional) by which to shift the sound up or down.time_stretch
: the factor by which to stretch the sound event. Factors <1 will make the event shorter, and factors >1 will make it longer.
If you do not wish to apply any transformations, these latter two parameters
(and only these) also accept None
instead of a distribution tuple.
So, going back to the example code above, we’re adding a siren sound event,
the specific audio file to use will be chosen randomly from all available siren
audio files in the foreground/siren
subfolder, the event will start at time
0 of the source file, and be “pasted” into the synthesized soundscape anywhere
between times 0 and 9 chosen uniformly. The event duration will be randomly
chosen from a truncated normal distribution with a mean of 3 seconds, standard
deviation of 1 second, and min/max values of 0.5 and 5 seconds respectively.
The loudness with respect to the background will be chosen from a normal
distribution with mean 10 dB and standard deviation of 3 dB. Finally, the pitch
of the sound event will be shifted by a value between -2 and 2 semitones
chosen uniformly within that range, and will be stretched (or condensed) by a
factor chosen uniformly between 0.8 and 1.2.
Let’s add a couple more events:
for _ in range(2):
sc.add_event(label=('choose', []),
source_file=('choose', []),
source_time=('const', 0),
event_time=('uniform', 0, 9),
event_duration=('truncnorm', 3, 1, 0.5, 5),
snr=('normal', 10, 3),
pitch_shift=None,
time_stretch=None)
Here we use a for loop to quickly add two sound events. The specific label and
source file for each event will be determined when we call sc.generate()
(coming up), and will change with each call to this function.
Synthesizing soundscapes¶
Up to this point, we have created a Scaper
object and added a background and
three foreground sound events, whose parameters are specified using distribution
tuples. Internally, this creates an event specification, i.e. a
probabilistically-defined list of sound events. To synthesize a soundscape,
we call the generate()
function:
audiofile = 'soundscape.wav'
jamsfile = 'soundscape.jams'
txtfile = 'soundscape.txt'
sc.generate(audiofile, jamsfile,
allow_repeated_label=True,
allow_repeated_source=True,
reverb=0.1,
disable_sox_warnings=True,
no_audio=False,
txt_path=txtfile)
This will instantiate the event specification by sampling specific parameter
values for every sound event from the distribution tuples stored in the
specification. Once all parameter values have been sampled, they are used by
scaper’s audio processing engine to compose the soundscape and save the
resulting audio to audiofile
.
But that’s not where it ends! Scaper will also generate an annotation file in
JAMS format which serves as the reference
annotation (also referred to as “ground truth”) for the generated soundscape.
Due to the flexibility of the JAMS
format scaper will store in the JAMS file, in addition to the actual sound
events, the probabilistic event specification (one for background events and one
for foreground events). The value
field of each observation in the JAMS file
will contain a dictionary with all instantiated parameter values. This allows
us to fully reconstruct the audio of a scaper soundscape from its JAMS annotation
using the scaper.generate_from_jams()
function (not discussed in this tutorial).
We can optionally provide generate()
a path to a text file
with the txt_path
parameter. If provided, scaper will also save a simplified
annotation of the soundscape in a tab-separated text file with three columns
for the start time, end time, and label of every foreground sound event (note that
the background is not stored in the simplified annotation!). The default
separator is a tab, for compatibility with the Audacity
label file format. The separator can be changed via generate()
’s txt_sep
parameter.
Synthesizing isolated events alongside the soundscape¶
We can also output the isolated foreground events and backgrounds alongside the soundscape.
This is especially useful for generating datasets that can be used to train and evaluate
source separation algorithms or models. To enable this, two additional arguments can be
given to generate()
and generate_from_jams()
:
save_isolated_events
: whether or not to save the audio corresponding to the to the isolated foreground events and backgrounds within the synthesized soundscape. In our example, there are three components - the background and the two foreground events.isolated_events_path
: the path where the audio corresponding to the isolated foreground events and backgrounds will be saved. If None (default) and save_isolated_events = True, the events are saved to <parentdir>/<audiofilename>_events/, where <parentdir> is the parent folder of the soundscape audio file provided in the audiofile parameter in the example below:
audiofile = '~/scaper_output/mysoundscape.wav'
jamsfile = '~/scaper_output/mysoundscape.jams'
txtfile = '~/scaper_output/mysoundscape.txt'
sc.generate(audiofile, jamsfile,
allow_repeated_label=True,
allow_repeated_source=True,
reverb=None,
disable_sox_warnings=True,
no_audio=False,
txt_path=txtfile,
save_isolated_events=True)
The code above will produce the following directory structure:
~/scaper_output/mysoundscape.wav
~/scaper_output/mysoundscape.jams
~/scaper_output/mysoundscape.txt
~/scaper_output/mysoundscape_events/
background0_<label>.wav
foreground0_<label>.wav
foreground1_<label>.wav
The labels for each isolated event are determined after generate
is called.
If isolated_events_path
were specified, then it would produce:
~/scaper_output/mysoundscape.wav
~/scaper_output/mysoundscape.jams
~/scaper_output/mysoundscape.txt
<isolated_events_path>/
background0_<label>.wav
foreground0_<label>.wav
foreground1_<label>.wav
The audio of the isolated events is guaranteed to sum up to the soundscape audio if and
only if reverb
is None
! The audio of the isolated events as well as the audio
of the soundscape can be accessed directly via the jams file as follows:
import soundfile as sf
jam = jams.load(jams_file)
ann = jam.annotations.search(namespace='scaper')[0]
soundscape_audio, sr = sf.read(ann.sandbox.scaper.soundscape_audio_path)
isolated_event_audio_paths = ann.sandbox.scaper.isolated_events_audio_path
isolated_audio = []
for event_spec, event_audio in zip(ann, isolated_event_audio_paths):
# event_spec contains the event description, label, etc
# event_audio contains the path to the actual audio
# make sure the path matches the event description
assert event_spec.value['role'] in event_audio_path
assert event_spec.value['label'] in event_audio_path
isolated_audio.append(sf.read(event_audio)[0])
# the sum of the isolated audio should sum to the soundscape
assert sum(isolated_audio) == soundscape_audio
That’s it! For a more detailed example of automatically synthesizing 1000
soundscapes using a single Scaper
object, please see the Example: synthesizing 1000 soundscapes in one go.