This page is an on-line demo of our recent research results on audio captioning.
Full presentation of results and method is in our paper entitled "WaveTransformer: An architecture for audio captioning", available from here, and submitted for review to the 29th European Signal Processing Conference (EUSIPCO), 2021.
Below you can find three columns. In each column you can see an audio player with two catefories of textual descriptions (captions) beneath it. The captions at the two categories correspond to the sound that you can hear from the audio player and are:
Columns correspond to categorization of the predicted captions according to the employed metrics.
All sounds and original descriptions are drawn from the Clotho Dataset, available here!
Our method was tested using Clotho evaluation split, consisting of 1045 audio files and their associated captions. The result metrics for our method are:
BLEU1 | BLEU2 | BLEU3 | BLEU4 | ROUGEL | METEOR | CIDEr | SPICE | SPIDEr |
---|---|---|---|---|---|---|---|---|
0.498 | 0.303 | 0.197 | 0.120 | 0.143 | 0.332 | 0.268 | 0.095 | 0.182 |