Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

other language support? #3

Open
zdj97 opened this issue Apr 15, 2024 · 14 comments
Open

other language support? #3

zdj97 opened this issue Apr 15, 2024 · 14 comments

Comments

@zdj97
Copy link

zdj97 commented Apr 15, 2024

No description provided.

@ylacombe
Copy link
Collaborator

Hey @zdj97, at the moment, we don't support other languages.
However, must of the approaches here are language-agnostic, and I can only think of the speaking rate estimator which is English specific. The speaking rate is simply computed for now as the audio length divided by the number of phonemes. The latter is computed with g2p on English specifically.

What languages do you have in mind? Would you like to open a PR to add support for other languages ?
Let me know !

@ittailup
Copy link

ittailup commented Apr 25, 2024

@ylacombe Why did you choose g2p specifically? I had to swap it with espeak-ng phonemizer for Spanish because g2p doesn't support Spanish. Happy to push my changes later in the week.

@ylacombe
Copy link
Collaborator

@ittailup, this work started as a reproduction of this research paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations, which uses g2p!
Also, we considered:

  1. License
  2. Dependencies and easiness of installation
    for which g2p fulfill our requirements!

@taalua
Copy link

taalua commented May 2, 2024

@ittailup I am interested to fine-tuning the current model to other languages, i.e., Spanish, did you use the existing trained model and prompt tokenizer "parler-tts/parler_tts_mini_v0.1" or did you train from scratch with custom tokenizer for espeak-ng? Thank you for your insights.

@ittailup
Copy link

ittailup commented May 6, 2024

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)
            
            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate
            
            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)
        
        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)
            
        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate
        
        speaking_rate = len(phonemes) / audio_length
        
        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

@ittailup
Copy link

ittailup commented May 6, 2024

@taalua I did not have to change the prompt at

NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

@taalua
Copy link

taalua commented May 7, 2024

@ittailup Thank you. I appreciate your help. So the tokenizer remains the same, i.e., parler-tts/parler_tts_mini_v0.1. Does fine-tuning work well with Spanish using mini_v0.1?

How much data for fine-tuning you have, and also how many epochs do you need?

@ittailup
Copy link

ittailup commented May 7, 2024

Parler gave me the best results of all the pipelines and models I had tested. Better than Piper, easier to train than pflow, vits2, styletts2. The voice quality with ~15h of speech and 39 epochs was very impressive. Even after 10k steps the quality was probably good enough to stop, we did 54k.

@ylacombe
Copy link
Collaborator

ylacombe commented May 9, 2024

Hey @ittailup, this is great to hear!
Would you mind sharing some samples out of curiosity? Also don't hesitate to share the model, if that's something you can do!

@yoesak
Copy link

yoesak commented May 11, 2024

@taalua I did not have to change the prompt at

NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

Hi thanks, I tried your rate

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)
            
            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate
            
            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)
        
        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)
            
        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate
        
        speaking_rate = len(phonemes) / audio_length
        
        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

Thanks I tried using your "rate_apply" (not using g2p) and finetuned using Indonesian speech dataset from Common Voice 13, and works and the result is also good although using only 1706 samples.

here is the result:

output.mp4

@ylacombe
Copy link
Collaborator

Hey @yoesak,, thanks for sharing the sample, it looks really great!
Would you be potentially interested in sharing the model publicly?
(also cc @ittailup in case you'd be interested as well!)

@yoesak
Copy link

yoesak commented May 17, 2024

Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)

Yes, but the model is not stable yet, in my finding, if I use espeak backend for larger amount data, I got memory leak, so I decided to use custom phoneme module, since I only need for Indonesian language. soon after I finished the training, I will let you know.

@ylacombe
Copy link
Collaborator

if I use espeak backend for larger amount data, I got memory leak
This is interesting, have you tried using a traditional LLM tokenizer?

How is the training going? Let me know if I can help!

@manigandanp
Copy link

manigandanp commented Aug 19, 2024

et. this was my "rate_apply" (written by Claud

This method had some issues when I use it for Tamil language. Here is the updated version

from phonemizer.backend import EspeakBackend

backend = EspeakBackend("ta")

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[text_column_name], list):
        speaking_rates = []
        phonemes_list = []
        if "speech_duration" in batch:
            for text, audio_duration in zip(
                batch[text_column_name], batch["speech_duration"]
            ):
                phonemes = backend.phonemize(text, strip=True)[0]
                audio_duration = audio_duration if audio_duration != 0 else 0.01
                speaking_rate = len(phonemes) / audio_duration
                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)
        else:
            for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
                phonemes = backend.phonemize(text, strip=True)[0]

                sample_rate = audio["sampling_rate"]
                audio_length = len(audio["array"].squeeze()) / sample_rate

                speaking_rate = len(phonemes) / audio_length

                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)

        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = backend.phonemize(list(batch[text_column_name]), strip=True)[0]
        print(phonemes)
        if "speech_duration" in batch:
            audio_length = (
                batch["speech_duration"] if batch["speech_duration"] != 0 else 0.01
            )
        else:
            sample_rate = batch[audio_column_name]["sampling_rate"]
            audio_length = (
                len(batch[audio_column_name]["array"].squeeze()) / sample_rate
            )

        speaking_rate = len(phonemes) / audio_length

        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants