UniVerse-1: Unified Audio-Video Generation via Stitching of Experts


archi

Architecture of UniVerse-1. (a) Overall architecture. The architectural foundation of UniVerse-1 is realized through a stitching of expertise methodology. This approach deeply integrates the pre-trained Wan2.1 video model and the Ace-step audio model. (b) Fused block. The fusion is implemented at a granular, block-by-block level, where each block in the Wan architecture is deeply fused with its corresponding block in the Ace-step architecture.



Abstract


We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of expertise technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation expertise models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community.



Stitching of Expertise


planner

We introduce a novel framework, termed 'Stitching of Expertise', for integrating specialized, pre-existing models for video and audio synthesis. This approach is designed to preserve the generative capabilities of each unimodal expert, while simultaneously enabling fine-grained, bidirectional interaction between them at the level of individual layer blocks.



driving

To address the architectural disparity arising from the different depths of the two models, we employ a layer interpolation technique. This method aligns the models by systematically inserting new transformer blocks into the shallower network at uniform intervals. Critically, we initialize the parameters of each new block by linearly interpolating the weights of its immediately preceding and succeeding layers, which is essential for a smooth integration.





Generated results


Human Speech



Video Prompt: The video features a man with long black hair, wearing a traditional light gray robe, speaking angrily about having been searching for someone for ten years. The setting is a dimly lit room with blurred wooden elements, and the man's serious and determined expression is emphasized through his red eyes and slight frown. The video focuses on the man's speech and expression, with no additional ambient sounds provided.
Speech Content: I really have been looking for you for ten years. I have been looking for you for ten years.

Case1

Video Prompt: The video features a woman in a leopard print coat speaking into a microphone with a green label that reads '金典 SATINE'. She is standing in front of a large screen displaying text, likely in a studio or event setting. The woman is addressing the camera or an audience, discussing her mother's comments about her age and her thoughts on Dilraba Dilmurat's marriage status. The scene is static with no significant movement or changes in the background.
Speech Content: My mom always tells me that I am so old, if I don’t find a partner, how will I get married in the future? I told my mom how could Dilraba Dilmurat not get married.

Case2

Video Prompt: The video features a man speaking in front of a bookshelf filled with books and a panda poster. He is wearing glasses and a black t-shirt. The setting appears to be an indoor space, possibly a home office or study room. The man is likely delivering a speech or presentation, discussing an article by Professor Feng Yujun about Russia's defeat. There is no significant movement or action taking place in the video.
Speech Content: Recently, Professor Feng Yujun, an expert on Russian issues at Peking University, published an article in the famous British magazine The Economist saying that Russia's defeat is inevitable.

Case3

Video Prompt: The video features a man in traditional Chinese attire standing in a field of tall grass. He is clasping his hands and speaking, with a calm and reflective expression. The background is a serene field with a warm, golden sky, creating a peaceful and contemplative atmosphere. There is no ambient sound provided.
Speech Content: If a person has no dreams, what is the difference between him and a salted fish.

Case4

Video Prompt: The video features a close-up of a menacing-looking devil in a dense forest, speaking about the advancements in AI technology. The setting is eerie and atmospheric, with soft, diffused lighting and a sense of mystery.
Speech Content: Breaking news. AI is getting so powerful that you could turn photos to videos and to.

Case5

Video Prompt: The video features a man with long black hair, wearing a traditional light gray robe, speaking angrily about having been searching for someone for ten years. The setting is a dimly lit room with blurred wooden elements, and the man's serious and determined expression is emphasized through his red eyes and slight frown. The video focuses on the man's speech and expression, with no additional ambient sounds provided.
Ambient Prompt: A woman giving a speech.
Speech Content: I really have been looking for you for ten years.

Case6

Video Prompt: The video depicts a formal graduation ceremony where a woman is delivering a speech about the world macroeconomic environment. The setting is simple, with a plain wall and a few other individuals in graduation attire standing in the background. The woman is standing at a podium, speaking into a microphone, while the other individuals observe attentively.
Speech Content: Debates on the world macroeconomic environment that today so govern and impact.

Case7

Video Prompt: The video features a bald man in a beige blazer speaking on a stage. He is addressing an unseen audience, discussing the issue of people not listening when others speak. The setting is simple, with a blue background and a circular shadow, indicating a formal presentation or lecture environment.
Ambient Prompt: A man is giving a speech.
Speech Content: And yet many people have the experience that when they speak, people don't listen to them. Why is that?

Case8

Video Prompt: A man in a dark suit and white shirt is seated on a yellow couch, speaking about the language of Americans. He uses hand gestures to emphasize his points, and the background shows a cityscape with illuminated buildings, suggesting an urban setting, possibly a studio with a city view.
Ambient Prompt: A man speaking.
Speech Content: The thing about Americans that I've thought about the language is that they speak, they say they speak English.

Case9

Video Prompt: A man in a tuxedo is giving a speech at a formal event, holding an award and speaking into a microphone. The background is a blurred, colorful backdrop with bokeh effects, suggesting a celebratory atmosphere. The man is the main focus of the video, and there are no other characters or objects in the scene.
Ambient Prompt: A man giving a speech.
Speech Content: The Revenant was a product of the tireless efforts of an unbelievable cast.

Case10

Video Prompt: An anime-style girl is sitting at a table in a restaurant, eating a bowl of noodles with chopsticks. She is wearing a green cardigan over a white dress with black lace trim. The girl has long black hair with star-shaped hairpins. She is talking about how spicy the noodles are. The background shows a clock on the wall and a man sitting at a table. The ambient sound is the sound of eating noodles.
Speech Content: It's so spicy. It's so spicy.

Case11

Video Prompt: The video depicts a man in a professional setting, engaged in a serious phone conversation. He is dressed in a gray suit and white shirt, holding a telephone receiver and a pen, suggesting he is in the middle of a business discussion. The background is a simple office environment, with a blurred figure in the background, adding to the professional atmosphere. The man's facial expression and body language convey determination and focus, emphasizing the importance of the conversation he is having.
Speech Content: This time we must get this order, no matter what the cost.

Case12

Video Prompt: The video features a woman in a living room, speaking about the desire for larger chests and acknowledging the challenges associated with it. She is dressed casually in a light blue shirt and is gesturing with her hands as she speaks. The background is a typical living room setting with a couch, plants, and a ceiling fan. The lighting is natural, indicating it is daytime.
Speech Content: I know the many people have small chests that wish to have bigger chest. I know that could be so hard for some reasons.

Case13

Video Prompt: The video captures a woman in a dramatic and emotional performance, singing or speaking into a vintage microphone. Her attire and makeup, along with her expressive gestures, suggest a powerful and heartfelt delivery. The dark background and the focus on her and the microphone create a sense of intimacy and intensity, highlighting the emotional depth of her performance.
Speech Content: Whenever sing my song, on the stage, on my own. Whenever say my words, wishing they would be heard.

Case14

Video Prompt: The video features Lisa of Blackpink speaking into a microphone, wearing headphones and a blue top with a cat design. She is in a setting with a brick wall background. The video captures her speaking and occasionally touching her face.
Ambient Prompt: A woman speaking.
Speech Content: Like people know me as Lisa of Blackpink.

Case15

Video Prompt: The video features a man in elaborate armor standing confidently in a dramatic stadium setting. He speaks with a confident and assertive tone, emphasizing his power and authority. The background is filled with a cloudy sky and lightning, adding to the intense and dramatic atmosphere. The overall style is cartoonish, with detailed and expressive character design.
Ambient Prompt: Noisy background sound
Speech Content: You think you've seen pain? You think you know suffering? Try beat it.

Case16

Video Prompt: The video features a woman speaking about her music, which she describes as very westernized despite being a Chinese musician. She is standing in front of a blue curtain, with a traditional Chinese string instrument visible to her left. Her hands are moving as she speaks, indicating an expressive delivery.
Ambient Prompt: A woman giving a speech.
Speech Content: However, as a Chinese musician, my music was very westernized.

Case17

Video Prompt: A man in a gray suit and hat is giving a speech on a stage. He is gesturing with his hands while speaking about the data showing that married people report higher life satisfaction. The background is a stage with a presentation screen displaying text.
Ambient Prompt: An adult male is speaking.
Speech Content: The get married advocates like to point to data that show that married people report higher life satisfaction.

Case18

Video Prompt: A cartoon bunny wearing a blue jacket and carrying a brown backpack is standing on a dirt path surrounded by colorful flowers. The bunny is looking around and possibly talking to the camera about picking carrots and sharing them with parents. The background is a picturesque outdoor setting with a dirt path, colorful flowers, and a blurred background of houses and trees, suggesting a rural or countryside environment.
Speech Content: I picked a lot of carrots today and will share them with my parents when I get home.

Case19

Video Prompt: The video features a woman dressed in traditional attire, speaking directly to the camera with a serious demeanor. The setting is an outdoor environment with large green leaves in the background, and the woman's speech content suggests she is providing advice or information, possibly related to health and mental well-being.
Speech Content: It's just a game, so don't take it to heart. Take it from me, I'm a doctor. Excessive rumination on negative thoughts is known to cause many adverse health effects.

Case20


Musical Instrument Playing



Video Prompt: A young man is sitting on a step, playing a guitar and singing. He is wearing a gray t-shirt and dark pants with red sneakers. The background features a yellow wall with peeling paint and a window. The camera remains stationary, focusing on the young man and his guitar. The ambient sound is the sound of the guitar and singing.
Ambient Prompt: Guitar music and singing.
Speech Content: When I was young, I'd listen to the radio Waitin' for my favorite songs.

Case1

Video Prompt: The video shows a person playing the saxophone in a dimly lit room with a warm ambiance. The person is wearing a black t-shirt with a graphic design and glasses. They are playing a golden saxophone, moving their fingers on the keys, and occasionally adjusting their posture. The background includes framed posters on the wall, a computer monitor displaying a music editing software, a microphone, a speaker, and a lava lamp on a desk. The camera remains stationary, focusing on the person and the saxophone. The sound of the saxophone, the music playing in the background, and the ambient noise of the room can be heard.
Ambient Prompt: A saxophone is playing a melody.

Case2

Video Prompt: A woman is playing a flute in a dark, indoor setting. She is wearing a pink blazer and a patterned blouse. Her hands are moving rhythmically over the keys of the flute as she plays. The background is blurred, suggesting an indoor environment with soft lighting. The only sound is the soft, melodic music of the flute.
Ambient Prompt: A flute is playing a note.

Case3

Video Prompt: A young man is playing an acoustic guitar in a room with blue walls, two large speakers, and string lights in the background. He is wearing glasses, a maroon shirt, and a white undershirt. The video focuses on his hands moving along the fretboard and strumming the strings, with background music and guitar strumming audible.
Ambient Prompt: A guitar is being played.

Case4

Video Prompt: A young man is playing the piano in a well-lit room. He is wearing a green T-shirt and glasses, sitting on a black piano bench. The room has large windows covered by blinds, a bookshelf filled with books on the right side, and a carpeted floor. The young man is focused on playing the piano, his hands moving rhythmically over the keys, and he occasionally glances at the sheet music. The background music is soft and adds to the serene atmosphere of the room.
Ambient Prompt: A piano is playing a melody.

Case5

Video Prompt: A young boy is playing the drums in a room decorated with posters of bands and musicians. He is wearing a yellow t-shirt with a geometric pattern and black shorts, and he is wearing headphones. The boy is actively playing the drums, hitting the drums and cymbals rhythmically. The room has dark gray walls, a carpeted floor, and posters of bands and musicians on the wall. The background music is playing throughout the video.
Ambient Prompt: A drum beat is playing.

Case6

Video Prompt: A pianist dressed in a black suit is playing a Kawai piano in a dimly lit room with a wooden floor. The pianist's hands move rhythmically across the keys, creating a soothing melody. The background is quiet, with only the soft sound of the piano music.
Ambient Prompt: A piano is playing a melody.

Case7

Video Prompt: A man is playing the violin in a room filled with violins and books. He is wearing a black t-shirt with a red 'Supreme' logo. The man is focused on playing the violin, moving the bow across the strings and his fingers along the strings. The room has shelves containing various violins and books, and a framed sheet music on the wall. The sound of the violin being played can be heard.
Ambient Prompt: A violin is playing a note.

Case8

Video Prompt: A man wearing a striped shirt is playing the violin with a bow. He is standing in front of a plain white background. The man is moving his hands and body while playing the violin. There is music playing in the background.
Ambient Prompt: A violin is playing a note.

Case9

Video Prompt: The video features a woman playing an acoustic guitar in a cozy, warmly lit room. She is dressed in a white blouse and a pearl necklace, and her long brown hair flows freely. The room is decorated with string lights, a red electric guitar, and a ukulele hanging on the wall. A floral-patterned cushion is visible on a chair in the background. The woman occasionally looks at the camera with a smile while playing the guitar, creating a melodic tune. The camera remains focused on her, capturing her expressions and movements. The ambient sound is the music of the guitar, adding to the cozy and intimate atmosphere of the scene.
Ambient Prompt: The sound of the guitar music.

Case10

Video Prompt: A young boy is sitting on a couch, playing an acoustic guitar and singing a song. The room is cozy with a beige couch, a window with blinds, and a wooden floor. The boy is wearing a dark blue t-shirt and patterned shorts with animal prints. He strums the guitar and sings, his facial expressions changing as he sings. The background music is a guitar and singing.
Ambient Prompt: A young boy sings and plays a guitar.
Speech Content: You make me happy when the clouds are gray.

Case11

Video Prompt: The video shows a woman playing a flute in a decorative setting. She is wearing an orange sweatshirt and black pants, sitting on a chair, and her hands and mouth are actively engaged in playing the flute. The background features pink flowers and a lamp, creating a serene and artistic atmosphere. The camera remains stationary, capturing the woman's focused performance.
Ambient Prompt: A flute is playing a melody.

Case12

Video Prompt: The video captures a serene and focused moment of a woman playing the violin. She is dressed in a strapless pink dress, and her dark hair is neatly tied in a bun. The background is a plain, light-colored wall, which keeps the focus on her and her instrument. The music playing in the background adds to the tranquil and artistic atmosphere of the scene.
Ambient Prompt: A violin is playing a melody.

Case13

Video Prompt: A man is playing an accordion in a room. He is wearing a red cap with 'MOTUL' written on it, a blue shirt, green pants, and black sandals. The room has a wooden floor, a white door, and a sink. The man is focused on playing the accordion, moving his fingers on the keys and buttons. The background music is from the accordion.
Ambient Prompt: Music from the accordion.

Case14

Video Prompt: A young child is playing a drum set in a living room, creating a cheerful and lively atmosphere. The child is smiling and appears to be enjoying the activity. The background includes a black leather sofa, white curtains, and a potted plant on a table, adding to the cozy and homey setting.
Ambient Prompt: A drum is being played.

Case15


Other Sound



Video Prompt: The video showcases a tranquil scene of a waterfall in a dense forest. The waterfall is the central focus, with water cascading down a rocky cliff surrounded by lush greenery. The ambient sound of flowing water adds to the peaceful atmosphere, making it a perfect representation of nature's beauty.
Ambient Prompt: Sound of flowing spring water.

Case1

Video Prompt: The video begins with a close-up of a woman with short brown hair, wearing a blue dress with a red belt, standing and looking at the fireworks in the sky. The camera then pulls away, revealing a wider view of the night sky filled with colorful fireworks. The sound of fireworks exploding can be heard throughout the video.
Ambient Prompt: Sound of fireworks exploding.

Case2

Video Prompt: The video depicts a group of people performing aerobics in a park setting. The characters are dressed in athletic wear and are stepping on step platforms in unison. The background features a park-like environment with trees, grass, and a staircase. The ambient sound is that of dance music, adding to the energetic atmosphere of the scene.
Ambient Prompt: Sound of dance music.

Case3

Video Prompt: The video captures the moment a civilian airliner prepares for and executes takeoff on a wide, open runway. The scene is set against a backdrop of a flat landscape and a partly cloudy sky, with the sound of the plane's engine providing an immersive audio experience. The airliner moves from a stationary position to full speed, and finally lifts off, showcasing the power and precision of modern aviation.
Ambient Prompt: The sound of the plane's engine roars.

Case4

Video Prompt: The video captures a first-person perspective of a motorcyclist speeding down a winding road through a dense forest. The motorcyclist, dressed in a black leather jacket and gloves, grips the handlebars tightly as they navigate the curves. The speedometer shows the increasing speed, and the motorcycle leans into the turns. The background is filled with tall trees and a clear blue sky, creating a serene and natural environment. The sound of the motor engine and the wind rushing past add to the immersive experience.
Ambient Prompt: "Sound of motor engine, wind rushing past.

Case5

Video Prompt: The video captures a close-up view of a fire burning brightly. The flames are intense and vibrant, illuminating the surrounding area with a warm glow. The wood logs are slowly burning, and the crackling sound of the fire can be heard. The scene is realistic and captures the essence of a cozy fire burning.
Ambient Prompt: Fire crackling and crackling with wind blowing.

Case6

Video Prompt: The video captures a serene moment of a small bird perched on a branch, singing or chirping, with a blurred green background that suggests a natural outdoor setting. The bird's movements are subtle, primarily involving the opening and closing of its beak.
Ambient Prompt: Birds chirping and singing.

Case7

Video Prompt: The video captures a spectacular display of fireworks lighting up the night sky. The fireworks explode in various colors and patterns, creating a dazzling and dynamic visual effect. The background is a dark night sky, providing a stark contrast to the bright, colorful bursts of light. There are no characters or objects present in the video, and the focus is solely on the fireworks and their explosive display.
Ambient Prompt: Several loud bursts of fireworks,

Case8

Video Prompt: The video captures a stunning view of a powerful waterfall, with water cascading down over a rocky edge and creating a misty atmosphere. The scene is devoid of any human presence, focusing solely on the natural beauty of the waterfall and its surroundings.
Ambient Prompt: Water is rushing and splashing.

Case9

Video Prompt: The video depicts a festive scene at a night market in Japan on New Year's Eve. Three girls in traditional Japanese kimonos are enjoying the fireworks display. The girl in the front is speaking happily, while the other two girls are smiling and looking at the fireworks. The background features colorful fireworks in the sky, stalls with lanterns, and a festive atmosphere. The ambient sound includes the sound of fireworks and the ambient noise of a night market.
Ambient Prompt: The sound of fireworks, he ambient noise of a night market.
Speech Content: It is new year's eve, and the fireworks are so beautiful!

Case10

Video Prompt: The video captures a serene and somewhat melancholic rainy night scene. The camera remains stationary, focusing on the trees and the street as the rain falls heavily. The ambient sounds of rain and wind add to the peaceful yet somber atmosphere.
Ambient Prompt: Rain falling on a surface.

Case11

Video Prompt: The video captures a serene scene of a boat journeying through a cold, icy ocean towards a large iceberg. The camera follows the boat's movement, highlighting the vastness of the ocean and the majestic presence of the iceberg. The ambient sounds of waves and wind add to the immersive experience of the journey.
Ambient Prompt: Wind blowing hard and waves crashing.

Case12

Video Prompt: The video shows a yellow helicopter with 'SAMU' written on it, parked on a helipad in a city setting. The helicopter is stationary and its engine is starting with roar of the engine, and there are no visible characters or significant actions taking place. The background includes buildings and parked cars, suggesting an urban environment.
Ambient Prompt: An aircraft engine running idle.

Case13

Video Prompt: The video depicts a scene of a flooded area caused by heavy waves crashing against the shore. Trees and buildings are partially submerged in water, and the sky is cloudy, indicating a severe weather event. The waves continue to crash, and the water level rises, submerging more of the area.
Ambient Prompt: Waves crash against the shore as a man yells and the wind blows.

Case14

Video Prompt: The video captures a military helicopter taking off from a grassy field. The helicopter is large and green, with its rotors spinning rapidly as it lifts off the ground. The background features a cloudy sky and some distant structures, adding to the realistic setting of the scene.
Ambient Prompt: Helicopter blades spinning.

Case15




Verse-Bench


Verse-Bench is a benchmark we developed for evaluating joint audio-visual generation. We curated 600 image-text prompt pairs from a multitude of sources. These sources encompass frames extracted from YouTube videos, BiliBili videos, TikTok clips, movies, and anime; images generated by AI models; and a collection of images from public websites. Our dataset comprises three subsets:

planner
A larger, high-resolution version is available in our report.


  • Set1-I contains image-text pairs (including AI-generated, web-crawled, and media screenshots), for which video/audio captions and speech content were produced using LLMs and manual annotation, comprising a total of 205 samples. Statistical results in figure (b).
  • Set2-V consists of video clips from YouTube and Bilibili, which were annotated with LLM-generate captions and Whisper-based ASR transcripts, followed by human verification, comprising a total of 295 samples. Statistical results in figure (c).
  • Set3-Ted includes TED Talks from September 2025, processed with the same annotation pipeline as Set2, comprising a total of 100 samples.




  • Citation


    AخA
     
    @inproceedings{wang2025universe-1,
        title={UniVerse-1:A Unified Audio-Video Generation Framework via Stitching of Expertise},
        author={Wang, Duomin and Zuo, wei and Li, Aojie and Chen, Ling-Hao and Liao, Xinyao and Zhou, Deyu and Yin, Zixin and Dai, Xili and Jiang, Daxin and Yu, Gang},
        journal={arxiv},
        year={2025}
    }


    Acknowledgements


    The website template was adapted from GRAM.