Scores¶
The scores define different metrics to evaluate control model performance in articulatory speech synthesis. Higher scores indicate better performance, with each subscore typically ranging between 0 and 100.
Groups of Scores¶
The benchmark distinguishes three main groups of scores:
Articulatory Scores
Measures quality of articulatory movements
Evaluates velocity and jerk distribution
Compares virtual tongue movement to EMA data
Compares virtual tongue height to ultrasound measurements
Semantic Scores
Evaluates closeness of produced to target semantic vector embedding
Assesses classification rank in word classification
Acoustic Scores
Compares synthesis and target audio recording
Evaluates loudness envelope and log-mel spectrograms
Future: f0 and formant transition scores
Score Calculation¶
Scores are calculated as follows:
Calculate error per token
Average errors across dataset
Normalize by baseline model’s average error
Subtract from 1 and multiply by 100
This ensures:
No error = score of 100
Baseline model error = score of 0
Total Score Formula¶
1. Articulatory Scoring¶
The score_articulatory function calculates an overall articulatory score by combining the results from three sub-scores:
Tongue Height
EMA (Electromagnetic Articulography)
Velocity and Jerk
1.1 score_tongue_height(data, task)¶
Calculates a score based on the mean RMSE difference between the predicted tongue height and the reference tongue height.
1.2 score_ema(data, task)¶
Calculates a score on the EMA (Electromagnetic Articulography) data for the tongue tip (TT) and tongue body (TB) based on the mean RMSE difference between synthesis and reference EMA in x, y, z direction.
1.3 score_vel_jerk(data, task)¶
Calculates a score based on the velocity and jerk of the cp-trajectories. The score is computed on a logarithmic scale and considers outliers by using the 99.9% quantile for the calculation.
2. Acoustic Scoring¶
The score_acoustic function evaluates the acoustic properties of the data by combining two sub-scores:
Loudness
Spectrogram
2.1 score_loudness(data, task)¶
Calculates a score based on the difference between the predicted loudness and the target loudness. Loudness is calculated every 220 samples over a 1024 sample window by summing all log-mel spectrogram entries for each time slice.
2.2 score_spectrogram(data, task)¶
Calculates a score based on the difference between the predicted log-mel spectrogram and the target spectrogram. We use a Mel spectrogram with 60 banks in the frequency range from 10 to 12000 Hz, a time shift of 110 samples and an aggregation window for the Fourier transform of 1024 samples.
3. Semantic Scoring¶
The score_semantic function evaluates the semantic properties of the data by combining two sub-scores:
Semantic Distance
Semantic Rank
3.1 score_sem_dist(data, task)¶
Calculates a score based on the semantic distance between the predicted semantic vector and the target semantic vector.
3.2 score_sem_rank(data, task)¶
- Calculates a score based on the rank of the predicted semantic vector compared to a set of 4311 reference vectors including the target.
Ranking them least to most distant based on the euclidean distance between our produced compared and the reference vectors.