APSIPA ASC 2025 Grand Challenge

City and Time-Aware Semi-supervised

Acoustic Scene Classification

1Xi'an University of Posts & Telecommunications, China
3Institute of Acoustics, Chinese Academy of Sciences, China
4University of Surrey, UK
5Northwestern Polytechnical University, China
6Singapore Institute of Technology, Singapore
7Nanyang Technological University, Singapore
*corresponding coordinator: baijs@xupt.edu.cn

Fig. 1 Overview of the APSIPA ASC 2025 GC: City and Time-Aware Semi-supervised Acoustic Scene Classification

Abstract

The APSIPA ASC 2025 GC (City and Time-Aware Semi-supervised Acoustic Scene Classification) extends the work of the ICME 2024 GC (Semi-supervised Acoustic Scene Classification under Domain Shift), which addressed the challenge of generalizing across different cities. This year's challenge explicitly incorporates city-level location and timestamp metadata for each audio sample, encouraging participants to design models that leverage both geographic and temporal context. It maintains the semi-supervised learning setting, reflecting real-world scenarios where large amounts of unlabeled data coexist with limited labeled examples. Participants are invited to develop innovative methods that combine audio content with contextual information to enhance classification performance and robustness.

News

2024-June-14 Baseline system is available.

Timeline

Date(AoE Time) Event
June 15, 2025 Challenge launch
Baseline system & Development metadata release
July 22, 2025 Evaluation metadata release
August 1, 2025 Final result & Code submission
August 8, 2025 Results announcement
August 15, 2025 Special session submission deadline of GC paper
August 22, 2025 Acceptance notification of GC paper

Dataset description

For the APSIPA ASC 2025 grand challenge "City and Time-Aware Semi-supervised Acoustic Scene Classification", we provide a development dataset comprising approximately 24 hours of audio recordings from the Chinese Acoustic Scene (CAS) 2023 dataset. This challenge introduces previously unutilized contextual metadata that accompanies each recording:

City information: Identification of the recording location among 22 diverse Chinese cities (e.g., Xi'an, Beijing, Shanghai)

Timestamp information: Precise recording time accurate to year, month, day, hour, minute, and second

The CAS 2023 dataset is a large-scale dataset that serves as a foundation for research related to environmental acoustic scenes. The dataset includes 10 common acoustic scenes, with a total duration of over 130 hours. Each audio clip is 10 seconds long with metadata about the recording location and timestamp. The data collection spanned from April 2023 to September 2023, covering 22 different cities across China.

Acoustic scenes (10): Bus, Airport, Metro, Restaurant, Shopping mall, Public square, Urban park, Traffic street, Construction site, Bar

More details can be found at https://arxiv.org/abs/2402.02694.

Dataset link

The audio recordings of development dataset can be found at https://zenodo.org/records/10616533.

The audio recordings of evaluation dataset can be found at https://zenodo.org/records/10820626.

Metadata of development and evaluation datasets will be release at https://github.com/JishengBai/APSIPA2025GC-ASC/tree/main/metadata.

Baseline and Evaluation

The baseline system for the APSIPA ASC 2025 GC "City and Time-Aware Semi-supervised Acoustic Scene Classification" challenge is based on a multimodal semi-supervised framework with a pre-trained SE-Trans model. Baseline codes are released in github. Systems will be ranked by macro-average accuracy (average of the class-wise accuracies). If two teams got the same score on the evaluation dataset, the team with the smaller model size will be ranked higher.

Rules

  • Pre-training dataset restriction: Only TAU Urban Acoustic Scenes 2020 Mobile Development dataset and CochlScene dataset are allowed for model pre-training. This ensures fair comparison between approaches by standardizing the external data sources and prevents participants from using proprietary or unreleased datasets.
  • No model ensembles: Model ensembles are NOT allowed in this competition. This focuses the challenge on developing individual models that effectively incorporate city and time information rather than boosting performance through ensemble techniques.
  • No large audio and audio-language models: Large pre-trained models such as Qwen-Audio, Whisper, LTU, etc. are NOT allowed. This ensures that improvements come from the effective use of city and time information rather than leveraging massive pre-trained models.