We gratefully acknowledge support from
the Simons Foundation and member institutions.

Multimedia

Authors and titles for recent submissions

[ total of 34 entries: 1-25 | 26-34 ]
[ showing 25 entries per page: fewer | more | all ]

Fri, 6 Feb 2026

[1]  arXiv:2602.05496 [pdf, ps, other]
Title: XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
[2]  arXiv:2602.05078 (cross-list from cs.CV) [pdf, ps, other]
Title: Food Portion Estimation: From Pixels to Calories
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
[3]  arXiv:2602.04904 (cross-list from cs.LG) [pdf, ps, other]
Title: DCER: Dual-Stage Compression and Energy-Based Reconstruction
Comments: 13 pages, 2 figures, 8 tables. Submitted to ICML 2026. Code will be available on GitHub
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Thu, 5 Feb 2026

[4]  arXiv:2602.04680 (cross-list from cs.SD) [pdf, ps, other]
Title: Audio ControlNet for Fine-Grained Audio Generation and Editing
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
[5]  arXiv:2602.04413 (cross-list from cs.CL) [pdf, ps, other]
Title: History-Guided Iterative Visual Reasoning with Self-Correction
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[6]  arXiv:2602.04405 (cross-list from cs.CV) [pdf, ps, other]
Title: Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion
Comments: This work is accepted by IEEE Transactions on Image Processing. More modifications may be performed
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[7]  arXiv:2602.04145 (cross-list from cs.LG) [pdf, ps, other]
Title: Training Data Efficiency in Multimodal Process Reward Models
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
[8]  arXiv:2602.04032 (cross-list from eess.IV) [pdf, ps, other]
Title: MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment
Comments: Published in ICASSP 2025, 5 pages, 3 figures
Journal-ref: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[9]  arXiv:2602.03892 (cross-list from cs.CV) [pdf, ps, other]
Title: Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[10]  arXiv:2602.03891 (cross-list from eess.AS) [pdf, ps, other]
Title: Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
Authors: Seohyun Joo, Yoori Oh
Comments: 5 pages, 2 figures, to appear in ICASSP 2026
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

Wed, 4 Feb 2026

[11]  arXiv:2602.02630 [pdf, ps, other]
Title: Trailer Reimagined: An Innovative, Llm-DRiven, Expressive Automated Movie Summary framework (TRAILDREAMS)
Journal-ref: OJCMT, 15(3), e202524 (2025)
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
[12]  arXiv:2602.03558 (cross-list from cs.CV) [pdf, ps, other]
Title: ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[13]  arXiv:2602.03529 (cross-list from cs.NI) [pdf, ps, other]
Title: Morphe: High-Fidelity Generative Video Streaming with Vision Foundation Model
Comments: Accepted by NSDI 2026 Fall
Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[14]  arXiv:2602.03523 (cross-list from cs.SD) [pdf, ps, other]
Title: D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation From Lead sheet
Comments: Accepted at 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[15]  arXiv:2602.03268 (cross-list from cs.LG) [pdf, ps, other]
Title: Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[16]  arXiv:2602.02033 (cross-list from cs.CV) [pdf, ps, other]
Title: One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

Tue, 3 Feb 2026 (showing first 9 of 15 entries)

[17]  arXiv:2602.01833 [pdf, ps, other]
Title: Mixture of Disentangled Experts with Missing Modalities for Robust Multimodal Sentiment Analysis
Subjects: Multimedia (cs.MM)
[18]  arXiv:2602.01284 [pdf, ps, other]
Title: Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
[19]  arXiv:2602.00701 [pdf, ps, other]
Title: Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
[20]  arXiv:2602.00607 [pdf, ps, other]
Title: MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation
Subjects: Multimedia (cs.MM); Sound (cs.SD)
[21]  arXiv:2602.00209 [pdf, ps, other]
Title: Divide and Conquer: Multimodal Video Deepfake Detection via Cross-Modal Fusion and Localization
Comments: The 3rd Place, IJCAI 2025 Workshop on Deepfake Detection, Localization, and Interpretability
Subjects: Multimedia (cs.MM)
[22]  arXiv:2602.01681 (cross-list from eess.IV) [pdf, ps, other]
Title: Hyperspectral Image Fusion with Spectral-Band and Fusion-Scale Agnosticism
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[23]  arXiv:2602.01325 (cross-list from eess.IV) [pdf, ps, other]
Title: Unified ROI-based Image Compression Paradigm with Generalized Gaussian Model
Comments: 14 pages, 18 figures,
Subjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
[24]  arXiv:2602.01059 (cross-list from cs.CV) [pdf, ps, other]
Title: DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[25]  arXiv:2602.00484 (cross-list from cs.CV) [pdf, ps, other]
Title: GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association
Comments: Winner Solution of SoccerTrack in ACM Multimedia 2025 Workshop MMSports
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[ total of 34 entries: 1-25 | 26-34 ]
[ showing 25 entries per page: fewer | more | all ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, cs, new, 2602, contact, help  (Access key information)