We gratefully acknowledge support from
the Simons Foundation and member institutions.

Sound

Authors and titles for recent submissions

[ total of 48 entries: 1-48 ]
[ showing 48 entries per page: fewer | more ]

Fri, 5 Dec 2025

[1]  arXiv:2512.04847 [pdf, ps, other]
Title: Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[2]  arXiv:2512.04827 [pdf, ps, other]
Title: Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs
Authors: Wenzhang Du
Comments: 11 pages, 3 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG)
[3]  arXiv:2512.04814 [pdf, ps, other]
Title: Shared Multi-modal Embedding Space for Face-Voice Association
Comments: Ranked 1st in Fame 2026 Challenge, ICASSP
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
[4]  arXiv:2512.04793 [pdf, ps, other]
Title: YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases
Comments: 17 pages, 5 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[5]  arXiv:2512.04779 [pdf, ps, other]
Title: YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance
Comments: 13 pages, 3 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[6]  arXiv:2512.04720 [pdf, ps, other]
Title: M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis
Comments: Submitted to ICASSP 2026
Subjects: Sound (cs.SD)
[7]  arXiv:2512.04711 [pdf, ps, other]
Title: Large Speech Model Enabled Semantic Communication
Comments: 15 pages, 9 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[8]  arXiv:2512.04616 [pdf, ps, other]
Title: Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques
Subjects: Sound (cs.SD); Medical Physics (physics.med-ph)
[9]  arXiv:2512.04552 [pdf, ps, other]
Title: RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS
Comments: Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
[10]  arXiv:2512.04551 [pdf, ps, other]
Title: Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention
Comments: Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Thu, 4 Dec 2025

[11]  arXiv:2512.03637 [pdf, ps, other]
Title: AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning
Comments: 11 pages, 4 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Machine Learning (stat.ML)
[12]  arXiv:2512.03563 [pdf, ps, other]
Title: State Space Models for Bioacoustics: A comparative Evaluation with Transformers
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[13]  arXiv:2512.03783 (cross-list from cs.AI) [pdf, ps, other]
Title: Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning
Subjects: Artificial Intelligence (cs.AI); Sound (cs.SD)
[14]  arXiv:2512.03636 (cross-list from cs.HC) [pdf, ps, other]
Title: Head, posture, and full-body gestures in dyadic conversations
Comments: 7 figures, 10 tables, 29 pages
Subjects: Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[15]  arXiv:2512.03458 (cross-list from eess.SP) [pdf, ps, other]
Title: A Convolutional Framework for Mapping Imagined Auditory MEG into Listened Brain Responses
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Wed, 3 Dec 2025

[16]  arXiv:2512.02783 [pdf, ps, other]
Title: Exploring Definitions of Quality and Diversity in Sonic Measurement Spaces
Subjects: Sound (cs.SD); Neural and Evolutionary Computing (cs.NE)
[17]  arXiv:2512.02669 [pdf, ps, other]
Title: SAND Challenge: Four Approaches for Dysartria Severity Classification
Comments: 7 pages, 5 figures
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[18]  arXiv:2512.02652 [pdf, ps, other]
Title: Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[19]  arXiv:2512.02523 [pdf, ps, other]
Title: Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation
Comments: 16 pages, 5 figures
Subjects: Sound (cs.SD); Machine Learning (cs.LG)
[20]  arXiv:2512.02515 [pdf, ps, other]
Title: VibOmni: Towards Scalable Bone-conduction Speech Enhancement on Earables
Comments: Submitted to TMC
Subjects: Sound (cs.SD)
[21]  arXiv:2512.02432 [pdf, ps, other]
Title: Continual Learning for Singing Voice Separation with Human in the Loop Adaptation
Comments: Proceedings of the 26th International Symposium on Frontiers of Research in Speech and Music, 2021
Subjects: Sound (cs.SD)
[22]  arXiv:2512.02192 [pdf, ps, other]
Title: Story2MIDI: Emotionally Aligned Music Generation from Text
Comments: 8 pages (6 pages of main text + 2 pages of references and appendices), 4 figures, 1 table. Presented at IEEE Big Data 2025 3rd Workshop on AI Music Generation (AIMG 2025)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[23]  arXiv:2512.02759 (cross-list from eess.AS) [pdf, ps, other]
Title: Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
Comments: This paper presents the system description of the UZH-CL team for the FAME2026 Challenge at ICASSP 2026. Our model achieved second place in the final ranking
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Image and Video Processing (eess.IV)
[24]  arXiv:2512.02650 (cross-list from cs.CV) [pdf, ps, other]
Title: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[25]  arXiv:2512.02593 (cross-list from cs.CL) [pdf, ps, other]
Title: Spoken Conversational Agents with Large Language Models
Comments: Accepted to EMNLP 2025 Tutorial
Subjects: Computation and Language (cs.CL); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[26]  arXiv:2512.02206 (cross-list from cs.LG) [pdf, ps, other]
Title: WhAM: Towards A Translative Model of Sperm Whale Vocalization
Comments: NeurIPS 2025
Subjects: Machine Learning (cs.LG); Sound (cs.SD)
[27]  arXiv:2512.02074 (cross-list from cs.CL) [pdf, ps, other]
Title: Dialect Identification Using Resource-Efficient Fine-Tuning Approaches
Comments: Published in APSIPA ASC 2025
Subjects: Computation and Language (cs.CL); Sound (cs.SD)

Tue, 2 Dec 2025

[28]  arXiv:2512.01626 [pdf, ps, other]
Title: Parallel Delayed Memory Units for Enhanced Temporal Modeling in Biomedical and Bioacoustic Signal Analysis
Comments: Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing, 2025
Journal-ref: IEEE Transactions on Audio, Speech and Language Processing, 2025
Subjects: Sound (cs.SD); Neural and Evolutionary Computing (cs.NE)
[29]  arXiv:2512.01559 [pdf, ps, other]
Title: LLM2Fx-Tools: Tool Calling For Music Post-Production
Subjects: Sound (cs.SD)
[30]  arXiv:2512.01537 [pdf, ps, other]
Title: Q2D2: A Geometry-Aware Audio Codec Leveraging Two-Dimensional Quantization
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
[31]  arXiv:2512.00621 [pdf, ps, other]
Title: Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning
Comments: Accepted at Transactions on Machine Learning Research (TMLR)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
[32]  arXiv:2512.00563 [pdf, ps, other]
Title: Explainable Multi-Modal Deep Learning for Automatic Detection of Lung Diseases from Respiratory Audio Signals
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
[33]  arXiv:2512.00451 [pdf, ps, other]
Title: STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition
Comments: The complete source code and online speech reconstruction demo is publicly available at this https URL
Subjects: Sound (cs.SD); Multimedia (cs.MM)
[34]  arXiv:2512.00120 [pdf, ps, other]
Title: Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
[35]  arXiv:2512.00115 [pdf, ps, other]
Title: MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
Comments: 10 pages, 5 figures
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[36]  arXiv:2512.01443 (cross-list from cs.CL) [pdf, ps, other]
Title: MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification
Comments: 10 pages, 5 figures, 4 tables, LibriBrain Workshop, NeurIPS 2025
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Sound (cs.SD)
[37]  arXiv:2512.01428 (cross-list from eess.SP) [pdf, ps, other]
Title: Masked Symbol Modeling for Demodulation of Oversampled Baseband Communication Signals in Impulsive Noise-Dominated Channels
Authors: Oguz Bedir (1), Nurullah Sevim (1), Mostafa Ibrahim (2), Sabit Ekin (2 and 1) ((1) Electrical & Computer Engineering, Texas A&M University, USA, (2) Engineering Technology & Industrial Distribution, Texas A&M University, USA)
Comments: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG), non-archival
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD)
[38]  arXiv:2512.01267 (cross-list from cs.MM) [pdf, ps, other]
Title: ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation
Comments: 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Subjects: Multimedia (cs.MM); Sound (cs.SD)
[39]  arXiv:2512.00883 (cross-list from cs.MM) [pdf, ps, other]
Title: Audio-Visual World Models: Towards Multisensory Imagination in Sight and Sound
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

Mon, 1 Dec 2025

[40]  arXiv:2511.23178 [pdf, ps, other]
Title: HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding
Comments: Accepted by AAAI 2026
Subjects: Sound (cs.SD)
[41]  arXiv:2511.22696 [pdf, ps, other]
Title: Probabilistic Fusion and Calibration of Neural Speaker Diarization Models
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI)
[42]  arXiv:2511.22687 [pdf, ps, other]
Title: PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning
Comments: Accepted by ASRU2025
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
[43]  arXiv:2511.22293 [pdf, ps, other]
Title: GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis
Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
[44]  arXiv:2511.21872 [pdf, ps, other]
Title: Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection
Comments: 16 pages, 6 Figures, 2 Tables, submitted to Marine Mammal Science as part of a special issue on Machine Learning and Artificial Intelligence in Marine Mammal Research
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
[45]  arXiv:2511.23142 (cross-list from cs.LG) [pdf, ps, other]
Title: Adapting Neural Audio Codecs to EEG
Comments: Foundation Models for the Brain and Body (BrainBodyFM@NeurIPS)
Subjects: Machine Learning (cs.LG); Sound (cs.SD)
[46]  arXiv:2511.22503 (cross-list from cs.CL) [pdf, ps, other]
Title: Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking
Comments: submitted to ICASSP 2026
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[47]  arXiv:2511.21780 (cross-list from cs.MM) [pdf, ps, other]
Title: 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation
Subjects: Multimedia (cs.MM); Sound (cs.SD)
[48]  arXiv:2511.21704 (cross-list from cs.CL) [pdf, ps, other]
Title: On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models
Subjects: Computation and Language (cs.CL); Sound (cs.SD)
[ total of 48 entries: 1-48 ]
[ showing 48 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, new, 2512, contact, help  (Access key information)