Content Description of Very-long-duration Recordings of the Environment
Long-duration sound recordings are an established technique to monitor terrestrial ecosystems. Acoustic sensing has several advantages over personal field-surveys, but a disadvantage is that technological advances enable collection of much more audio than can be listened to. Machine learning methods can identify individual species, but these are time-consuming to build and if the species of interest is absent, nothing is revealed about recording content. Visual methods have also been developed to interrogate long-duration recordings but ultimately, interpretation of acoustic recordings must be ground-truthed by listening to the actual sound. However, the ear is constrained to listen in real-time. Even if one listens to 10 hours of one-minute segments, selected randomly from one year of recording, this represents only a 0.11% sample of the data. For this study, we recorded 13 months of continuous audio in natural Australian woodland. We divided the audio into one-minute segments, which yields a content description at one-minute resolution. The feature set representing each segment consists of summary and/or spectral acoustic indices. Our objective in this investigation is two-fold: (1) to maximise content description of a very-long-duration recording while keeping listening to manageable levels; and (2) to determine how content description is influenced by the choice of acoustic features and other variables. We begin by clustering the acoustic feature vectors using the k-means algorithm. Given sufficient clusters (k = 60), each cluster can be interpreted as a discrete acoustic state within the year-long soundscape. We describe four findings: 1. Listening to the medoid minute of each cluster (the minute whose feature vector is closest to the cluster centroid) yields a similar content description to that obtained by listening to a random sample of ten minutes from each cluster. This represents a ten-fold reduction in listening effort. 2. Although k-means is known to produce different clustering outcomes depending on cluster initialisation, we find that content description is little affected by different runs of k-means. 3. Different feature vectors yield a slightly different content description depending on which acoustic events have been ‘targeted’ by the selected features. 4. Training a Hidden Markov Model on the year-long cluster sequence helps to identify the underlying acoustic communities and can be used to obtain a more fine-grained labelling of sound-sources of interest.