Improve Vevo2 Model Card: Add pipeline tag, links, and description

This PR significantly improves the model card for the Vevo2 model by:
- Adding `pipeline_tag: text-to-speech` to the metadata, enhancing its discoverability on the Hugging Face Hub.
- Including direct links to the paper, project page, and GitHub repository at the top of the card for easy access to related resources.
- Adding a concise description of the model, derived from its abstract, to provide users with a quick overview.
- Retaining the existing, comprehensive sample usage section and citations. A helpful comment has been added to the usage section to guide users on cloning the repository.

Please review these updates.

Files changed (1) hide show

README.md +9 -3

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-license: cc-by-nc-nd-4.0
 datasets:
 - amphion/Emilia-Dataset
 language:
@@ -9,20 +8,28 @@ language:
 - ko
 - de
 - fr
 tags:
 - tts
 - vc
 - svs
 - svc
 - music
 ---
 # Vevo2
-[![arXiv](https://img.shields.io/badge/Vevo-Paper-COLOR.svg)](https://arxiv.org/abs/2508.16332)
 ## Usage
 ```python
 import os
 import torch
 from huggingface_hub import snapshot_download
@@ -263,7 +270,6 @@ if __name__ == "__main__":
         style_ref_text=taiyizhenren_text,
         timbre_ref_wav_path=taiyizhenren_path,
     )
 ```
 ## Citations

 ---
 datasets:
 - amphion/Emilia-Dataset
 language:
 - ko
 - de
 - fr
+license: cc-by-nc-nd-4.0
 tags:
 - tts
 - vc
 - svs
 - svc
 - music
+pipeline_tag: text-to-speech
 ---
 # Vevo2
+[\ud83d\udcda Paper](https://huggingface.co/papers/2508.16332) • [\ud83c\udfe0 Project Page](https://versasinger.github.io/) • [\ud83d\udcbb Code](https://github.com/microsoft/Vevo2)
+Vevo2 is a unified framework for controllable speech and singing voice generation. It addresses challenges like the scarcity of annotated singing data and enables flexible controllability through two innovative audio tokenizers: a music-notation-free prosody tokenizer and a unified content-style tokenizer. The framework consists of an auto-regressive (AR) content-style modeling stage for control over text, prosody, and style, and a flow-matching acoustic modeling stage for timbre control. Joint speech-singing training and a multi-objective post-training task further enhance its ability to follow text and prosody. Experimental results demonstrate Vevo2's mutual benefits for both speech and singing voice generation, showcasing strong generalization and versatility across various synthesis, conversion, and editing tasks.
 ## Usage
 ```python
+# To run the code, first clone the Vevo2 GitHub repository:
+# git clone https://github.com/microsoft/Vevo2
+# Then install its dependencies to access models.svc.vevo2.vevo2_utils
 import os
 import torch
 from huggingface_hub import snapshot_download
         style_ref_text=taiyizhenren_text,
         timbre_ref_wav_path=taiyizhenren_path,
     )
 ```
 ## Citations