写点什么

Voice Agent 开发者必读,2024 最前沿语音模型梳理

作者:声网
  • 2024-12-04
    四川
  • 本文字数:2747 字

    阅读完需:约 9 分钟

今天推荐的是我们的社区成员 BoJack 创建的 GitHub 仓库,如果你在关注 Voice Agent 开发,想了解最前沿的语音模型都有哪些,这个仓库的列表就非常值得关注。


BoJack 正在上海交大读博,研究方向为语音多模态,语音交互系统,自监督预训练。他也是近期发布的语音全双工模型 LSLM、TTS 语音合成模型 F5-TTS 的作者之一。


仓库地址:https://github.com/ddlBoJack/Awesome-Speech-Language-Model



Awesome-Speech-Language-Model

论文、代码与资源:语音语言模型和端到端语音对话系统。

通用语音、音频和音乐理解模型

Universal Speech, Audio and Music Understanding


模型 Model


  • LTU: Listen, Think, and Understand - ICLR 2024


https://arxiv.org/abs/2305.10790


  • SALMONN: Towards Generic Hearing Abilities for Large Language Models- ICLR 2024


https://arxiv.org/abs/2310.13289


  • LTU-AS: Joint Audio and Speech Understanding - ASRU 2024


https://arxiv.org/abs/2309.14405


  • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models - arXiv 2023


https://arxiv.org/abs/2311.07919


  • Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities - ICML 2024


https://arxiv.org/abs/2402.01831


  • Qwen2-Audio Technical Report - arXiv 2024


https://arxiv.org/abs/2407.10759


  • WavLLM: Towards Robust and Adaptive Speech Large Language Model - EMNLP 2024


https://arxiv.org/abs/2404.00656


  • DiVA: Distilling an End-to-End Voice Assistant Without Instruction Training Data - arXiv 2024


https://arxiv.org/abs/2410.02678


基准 Benchmark


  • Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech - ICASSP 2024


https://arxiv.org/abs/2309.09510


  • AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension - ACL 2024


https://arxiv.org/abs/2402.07729


  • SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding BeyondWords - arXiv 2024


https://arxiv.org/abs/2406.13340


  • AudioBench: A Universal Benchmark for Audio Large Language Models -arXiv 2024


https://arxiv.org/abs/2406.16020


  • SALMon: A Suite for Acoustic Language Model Evaluation - arXiv 2024


https://arxiv.org/abs/2409.07437


  • MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark - arXiv 2024


https://www.arxiv.org/abs/2410.19168


  • Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks -ICLR 2024 open review


https://openreview.net/forum?id=s7lzZpAW7T

端到端语音对话系统

End2End Speech Dialogue System


模型 Model


  • SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - EMNLP 2023


https://arxiv.org/abs/2305.11000


  • GPT-4o Voice Mode -API 2024


https://openai.com/index/hello-gpt-4o/


  • PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems - EMNLP 2024

  • VITA: Towards Open-Source Interactive Omni Multimodal LLM - arXiv 2024


https://www.arxiv.org/abs/2408.05211


  • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming - arXiv 2024


https://arxiv.org/abs/2408.16725


  • LLaMA-Omni: Seamless Speech Interaction with Large Language Models -arXiv 2024


https://arxiv.org/abs/2409.06666


  • Moshi: a speech-text foundation model for real-time dialogue - arXiv 2024


https://arxiv.org/abs/2410.00037


  • Westlake-Omni - GitHub 2024


https://github.com/xinchen-ai/Westlake-Omni


  • EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions - arXiv 2024


https://arxiv.org/abs/2409.18042


  • IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities - arXiv 2024


https://arxiv.org/abs/2410.08035


  • MooER-omni - GitHub 2024


https://github.com/MooreThreads/MooER


  • GLM-4-Voice - GitHub 2024


https://github.com/THUDM/GLM-4-Voice


  • Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM - arXiv 2024


https://arxiv.org/abs/2411.00774


  • Hertz-dev - GitHub 2024


https://github.com/Standard-Intelligence/hertz-dev


  • Fish Agent - GitHub 2024


https://github.com/fishaudio/fish-speech


  • Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities - arXiv 2024


https://arxiv.org/abs/2410.11190


基准 Benchmark


  • VoiceBench: Benchmarking LLM-Based Voice Assistants - arXiv 2024


https://arxiv.org/abs/2410.17196

全双工建模

Full Duplex Modeling


  • A Full-duplex Speech Dialogue Scheme Based On Large Language Models -NeurIPS 2024


https://arxiv.org/abs/2405.19487


  • MiniCPM-duplex: Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models - EMNLP 2024


https://arxiv.org/abs/2406.15718


  • LSLM: Language Model Can Listen While Speaking - arXiv 2024


https://arxiv.org/abs/2408.02622


  • SyncLLM: Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents - arXiv 2024


https://arxiv.org/abs/2409.15594


  • Enabling Real-Time Conversations with Minimal Training Costs - arXiv 2024


https://arxiv.org/abs/2409.11727


综述 Survey


  • Towards audio language modeling -- an overview - arXiv 2024


https://arxiv.org/abs/2402.13236


  • Recent Advances in Speech Language Models: A Survey - arXiv 2024


https://arxiv.org/abs/2410.03751


  • Speech Trident - Github


https://github.com/ga642381/speech-trident


  • A Survey on Speech Large Language Models - arXiv 2024


https://arxiv.org/abs/2410.18908


编辑:林瑞丽,傅丰元



更多 Voice Agent 学习笔记:


从开发者工具转型 AI 呼叫中心,这家 Voice Agent 公司已服务 100+客户


WebRTC 创建者刚加入了 OpenAI,他是如何思考语音 AI 的未来?


人类级别语音 AI 路线图丨 Voice Agent 学习笔记


语音 AI 革命:未来,消费者更可能倾向于与 AI 沟通,而非人工客服


语音 AI 迎来爆发期,也仍然隐藏着被低估的机会丨 RTE2024 音频技术和 Voice AI 专场


下一代 AI 陪伴 | 平等关系、长久记忆与情境共享 | 播客《编码人声》


Voice-first,闭关做一款语音产品的思考|社区来稿



用户头像

声网

关注

还未添加个人签名 2021-02-05 加入

声网(NASDAQ:API)成立于2014年。开发者可通过声网API,在应用内构建多种实时音视频互动场景。使用声网服务的包括小米、陌陌、斗鱼、哔哩哔哩、新东方、小红书、HTC VIVE 、Yalla等遍布全球的巨头、独角兽企业。

评论

发布
暂无评论
Voice Agent 开发者必读,2024 最前沿语音模型梳理_声网_InfoQ写作社区