Transformation network 主要是要建立参数到音频的映射,从而能够使得最后求的loss可微
Encoder为谷歌的vgg变式vggish专门用于音频的encode,这里是到128维的hidden embedding
MTG-Jamendo 55000 songs with 195 tags
之前的工作有拿这个数据集去训练sim-CLR 作为Musci encoder
<aside> 💡 We present the MTG-Jamendo Dataset, a new open dataset for music auto-tagging. It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. The dataset contains over 55,000 full audio tracks with 195 tags from genre, instrument, and mood/theme categories. We provide elaborated data splits for researchers and report the performance of a simple baseline approach on five different sets of tags: genre, instrument, mood/theme, top-50, and overall.