Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All progress updates #23

Closed
jpc opened this issue Jul 18, 2023 · 0 comments
Closed

All progress updates #23

jpc opened this issue Jul 18, 2023 · 0 comments

Comments

@jpc
Copy link
Contributor

jpc commented Jul 18, 2023

Progress updates (from newest):

2023-12-10

Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!

English speech, female voice (transferred from a Polish language dataset):

whisperspeech-sample.mp4

A Polish sample, male voice:

whisperspeech-sample-pl.mp4

2023-07-14

We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.

An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):

Female voice:

we-choose-tts.mp4

Male voice:

we-choose-tts-s467.mp4

We have streamlined the inference pipeline and you can now test the model yourself on Google Colab: Open In Colab

2023-04-13

We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).

End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:

(don't forget to unmute the video)

test-e2e-jfk-T0.7.mp4

Ground truth:

we-choose.mp4

2023-04-03

We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.

Validation set ground truth (don't forget to unmute):

ground-truth.mov

The generated output from the S->A model (multinomial sampling, temperature 0.8):

saar-1300hr-2l-20e-T0.8.mov
@jpc jpc pinned this issue Jul 18, 2023
@collabora collabora locked and limited conversation to collaborators Jul 18, 2023
@jpc jpc converted this issue into discussion #37 Jan 9, 2024
@jpc jpc unpinned this issue Jan 14, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Development

No branches or pull requests

1 participant