other language support? #3

zdj97 · 2024-04-15T05:50:02Z

No description provided.

ylacombe · 2024-04-15T07:38:20Z

Hey @zdj97, at the moment, we don't support other languages.
However, must of the approaches here are language-agnostic, and I can only think of the speaking rate estimator which is English specific. The speaking rate is simply computed for now as the audio length divided by the number of phonemes. The latter is computed with g2p on English specifically.

What languages do you have in mind? Would you like to open a PR to add support for other languages ?
Let me know !

ittailup · 2024-04-25T17:04:17Z

@ylacombe Why did you choose g2p specifically? I had to swap it with espeak-ng phonemizer for Spanish because g2p doesn't support Spanish. Happy to push my changes later in the week.

ylacombe · 2024-04-25T17:53:59Z

@ittailup, this work started as a reproduction of this research paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations, which uses g2p!
Also, we considered:

License
Dependencies and easiness of installation
for which g2p fulfill our requirements!

taalua · 2024-05-02T22:00:22Z

@ittailup I am interested to fine-tuning the current model to other languages, i.e., Spanish, did you use the existing trained model and prompt tokenizer "parler-tts/parler_tts_mini_v0.1" or did you train from scratch with custom tokenizer for espeak-ng? Thank you for your insights.

ittailup · 2024-05-06T17:54:26Z

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)
            
            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate
            
            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)
        
        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)
            
        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate
        
        speaking_rate = len(phonemes) / audio_length
        
        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

ittailup · 2024-05-06T17:59:37Z

@taalua I did not have to change the prompt at

dataspeech/scripts/run_prompt_creation.py

Line 317 in 8fd2dd4

    
           NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

taalua · 2024-05-07T15:01:25Z

@ittailup Thank you. I appreciate your help. So the tokenizer remains the same, i.e., parler-tts/parler_tts_mini_v0.1. Does fine-tuning work well with Spanish using mini_v0.1?

How much data for fine-tuning you have, and also how many epochs do you need?

ittailup · 2024-05-07T16:05:13Z

Parler gave me the best results of all the pipelines and models I had tested. Better than Piper, easier to train than pflow, vits2, styletts2. The voice quality with ~15h of speech and 39 epochs was very impressive. Even after 10k steps the quality was probably good enough to stop, we did 54k.

ylacombe · 2024-05-09T07:09:07Z

Hey @ittailup, this is great to hear!
Would you mind sharing some samples out of curiosity? Also don't hesitate to share the model, if that's something you can do!

yoesak · 2024-05-11T09:41:57Z

@taalua I did not have to change the prompt at

dataspeech/scripts/run_prompt_creation.py

Line 317 in 8fd2dd4

NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:

I did add a nationality to the "text description" so "A man" would become "A {country_name} man", but this was a text replace after building the the initial dataspeech dataset.

Hi thanks, I tried your rate

@taalua I took the mini_v0.1 checkpoint and fine tuned it with my dataset. this was my "rate_apply" (written by Claude).

from phonemizer import phonemize
from phonemizer.backend import EspeakBackend

backend = EspeakBackend('es-es', with_stress=True)

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[audio_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
            phonemes = phonemize(text, language='es-es', backend='espeak', with_stress=True)
            
            sample_rate = audio["sampling_rate"]
            audio_length = len(audio["array"].squeeze()) / sample_rate
            
            speaking_rate = len(phonemes) / audio_length

            speaking_rates.append(speaking_rate)
            phonemes_list.append(phonemes)
        
        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = phonemize(batch[text_column_name], language='es-es', backend='espeak', with_stress=True)
            
        sample_rate = batch[audio_column_name]["sampling_rate"]
        audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate
        
        speaking_rate = len(phonemes) / audio_length
        
        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

Thanks I tried using your "rate_apply" (not using g2p) and finetuned using Indonesian speech dataset from Common Voice 13, and works and the result is also good although using only 1706 samples.

here is the result:

output.mp4

ylacombe · 2024-05-13T15:08:38Z

Hey @yoesak,, thanks for sharing the sample, it looks really great!
Would you be potentially interested in sharing the model publicly?
(also cc @ittailup in case you'd be interested as well!)

yoesak · 2024-05-17T03:05:21Z

Hey @yoesak,, thanks for sharing the sample, it looks really great! Would you be potentially interested in sharing the model publicly? (also cc @ittailup in case you'd be interested as well!)

Yes, but the model is not stable yet, in my finding, if I use espeak backend for larger amount data, I got memory leak, so I decided to use custom phoneme module, since I only need for Indonesian language. soon after I finished the training, I will let you know.

ylacombe · 2024-05-27T10:00:42Z

if I use espeak backend for larger amount data, I got memory leak
This is interesting, have you tried using a traditional LLM tokenizer?

How is the training going? Let me know if I can help!

manigandanp · 2024-08-19T05:52:55Z

et. this was my "rate_apply" (written by Claud

This method had some issues when I use it for Tamil language. Here is the updated version

from phonemizer.backend import EspeakBackend

backend = EspeakBackend("ta")

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[text_column_name], list):
        speaking_rates = []
        phonemes_list = []
        if "speech_duration" in batch:
            for text, audio_duration in zip(
                batch[text_column_name], batch["speech_duration"]
            ):
                phonemes = backend.phonemize(text, strip=True)[0]
                audio_duration = audio_duration if audio_duration != 0 else 0.01
                speaking_rate = len(phonemes) / audio_duration
                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)
        else:
            for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
                phonemes = backend.phonemize(text, strip=True)[0]

                sample_rate = audio["sampling_rate"]
                audio_length = len(audio["array"].squeeze()) / sample_rate

                speaking_rate = len(phonemes) / audio_length

                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)

        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = backend.phonemize(list(batch[text_column_name]), strip=True)[0]
        print(phonemes)
        if "speech_duration" in batch:
            audio_length = (
                batch["speech_duration"] if batch["speech_duration"] != 0 else 0.01
            )
        else:
            sample_rate = batch[audio_column_name]["sampling_rate"]
            audio_length = (
                len(batch[audio_column_name]["array"].squeeze()) / sample_rate
            )

        speaking_rate = len(phonemes) / audio_length

        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

other language support? #3

other language support? #3

zdj97 commented Apr 15, 2024

ylacombe commented Apr 15, 2024

ittailup commented Apr 25, 2024 •

edited

Loading

ylacombe commented Apr 25, 2024

taalua commented May 2, 2024

ittailup commented May 6, 2024

ittailup commented May 6, 2024

taalua commented May 7, 2024 •

edited

Loading

ittailup commented May 7, 2024

ylacombe commented May 9, 2024

yoesak commented May 11, 2024

ylacombe commented May 13, 2024

yoesak commented May 17, 2024

ylacombe commented May 27, 2024

manigandanp commented Aug 19, 2024 •

edited

Loading

other language support? #3

other language support? #3

Comments

zdj97 commented Apr 15, 2024

ylacombe commented Apr 15, 2024

ittailup commented Apr 25, 2024 • edited Loading

ylacombe commented Apr 25, 2024

taalua commented May 2, 2024

ittailup commented May 6, 2024

ittailup commented May 6, 2024

taalua commented May 7, 2024 • edited Loading

ittailup commented May 7, 2024

ylacombe commented May 9, 2024

yoesak commented May 11, 2024

ylacombe commented May 13, 2024

yoesak commented May 17, 2024

ylacombe commented May 27, 2024

manigandanp commented Aug 19, 2024 • edited Loading

ittailup commented Apr 25, 2024 •

edited

Loading

taalua commented May 7, 2024 •

edited

Loading

manigandanp commented Aug 19, 2024 •

edited

Loading