The essence of the Japanese language that cutting-edge TTS often overlooks: How to understand and measure the pitch accent problem.

Publication Date: May 2026

The landscape of the TTS field changes every six months. The technical judgments in this article are as of the publication date and will be updated regularly.

Introduction

In recent years, LLM-based text-to-speech (TTS) models have evolved rapidly. ElevenLabs, Google Cloud TTS, Azure Speech—all of them can now generate remarkably natural-sounding speech in English and multiple languages.

However, anyone who regularly uses synthesized Japanese speech has likely experienced this at some point: "Why does this speech sound somehow unnatural?" "The pronunciation is correct, but the meaning seems different..."

The culprit for this is often pitch accent.

This article will organize this pitch accent issue, which significantly impacts the quality of Japanese TTS, and share methods for quantitatively evaluating whether speech is sufficiently accurate from the perspective of SolanaLink's practical experience.

What is Pitch Accent and Why is it Important?

Same Sound, Different Meaning

Japanese (standard Japanese based on the Tokyo dialect) has a system called pitch accent. The pattern of "high" and "low" pitch distinguishes the meaning of words.

Here are some representative minimal pairs:

単語	アクセント型	意味
雨	HL（頭高型）	rain
飴	LH（平板型）	candy
橋	LH	bridge
箸	HL	chopsticks
神	HL	god
紙	LH	paper

The phonemes (vowels and consonants) are the same. Only the movement of pitch differs. This is a characteristic of Japanese, as it determines the meaning of the word itself.

Differences from English Speakers and Speakers of Some Asian Languages

English has stress accent, but it does not have a system of "distinguishing words by pitch." Therefore, when engineers whose native language is English evaluate TTS models, it frequently happens that they fail to notice pitch accent errors.

The sound is audible. The meaning can be inferred from the context. However, for Japanese speakers, it feels the same as "a different word was said"—this is the biggest gap.

Why are errors common in general TTS models?

Many modern large-scale TTS models are trained on large amounts of multilingual speech data. English, Chinese, Spanish...among these, the training data for Japanese is relatively small, and there are cases where the pitch accent patterns are not accurately learned for each word.

As a result, the following phenomena occur:

Where it should be "It's raining," it sounds like "It's raining candy."
"Crossing the bridge" sounds like "Crossing chopsticks."
Accent becomes indefinite in proper nouns, loanwords, and technical terms.

These errors may only be considered "slightly odd" in general voice navigation or text-to-speech. However, depending on the application, they can be critical.

Language learning materials (learners may memorize incorrect pronunciations)
Customer service voice response (damages brand credibility)
Business voices in fields such as medicine, law, and finance where misunderstandings are unacceptable

The Difficulty of Evaluation: How to Measure "Correctness"

Assuming we acknowledge the importance of pitch accent, the next challenge we face is the evaluation problem.

What Simple MOS (Subjective Evaluation) Doesn't Show

The most widely used method for evaluating TTS quality is the Mean Opinion Score (MOS)—a 5-point subjective evaluation. However, MOS measures whether it sounds natural overall, and cannot measure whether the accent is correct on its own.

In particular, if the evaluator is not a native Japanese speaker, accent errors may hardly be reflected in the score. Simply "sounding fluent" can easily result in a score above 4.0.

The "Self-Reflective Trap" of Comparison with Automated G2P Tools

The next method to consider is comparison with automated phoneme conversion (G2P) tools. In Japanese, pyopenjtalk is a typical example, returning a phoneme sequence with accent information when text is input.

入力: 雨が降る
pyopenjtalk の出力（簡略化）: a:HL me:LL ga:L fu:LH ru:LH

Analyzing TTS-generated speech and automatically checking if it matches the accent pattern predicted by pyopenjtalk—this seems reasonable at first glance.

However, there is a structural problem here.

Since pyopenjtalk itself is a model, it contains errors. For words not in the dictionary, neologisms, and context-dependent accent changes (compound words, interactions with particles), pyopenjtalk's predictions are not always correct.

What happens then:

A perfect TTS might happen to make a different judgment than pyopenjtalk and be rated low.
A TTS that perfectly replicates pyopenjtalk's errors is rated high.

This is a classic self-referential trap of "evaluating model A with model B," and the evaluation results become truly unreliable.

Practical Solution: Minimal Pairs Datasets with Manual Labeling

SolanaLink is considering the following approach to this problem:

The core evaluation will be conducted using manually labeled "undisputable minimal pairs."

Prepare 50-100 word pairs that a native Japanese speaker can immediately determine the accent type of, such as rain/candy, bridge/chopsticks, god/paper, sake/salmon.
This will function as the absolute correct answer.

Use pyopenjtalk as a supplementary metric.

Leverage its advantage of high-speed processing of large amounts of text and use it for regression testing ("Is it regressing to cases that were previously passed?").
However, the agreement rate with pyopenjtalk will not be used as the sole passing criterion.

Expand the evaluation set according to the target use.

For language learning, use frequently used vocabulary from beginner to intermediate levels.
For business voice, use industry-specific jargon and proper nouns.
This expansion process itself creates the uniqueness of the project.

MOS will only be implemented at large monthly milestones.

Use 5 or more native Japanese evaluators.
Performing this with every learning repetition is impractical in terms of both time and cost.
Daily automated evaluation will be based on pitch accent accuracy.

Why the "Local/On-Premise" Approach is Emergering

Here's another common concern we receive from SolanaLink customers: We don't want to send our own data and voice data to a cloud-based TTS.

We want to create synthesized speech based on customer interaction logs, but we don't want to entrust voice data to a third party.
Sending text containing specialized medical and legal terminology to the cloud is difficult due to internal company policy.
The cost is character-based, making it difficult to predict long-term operations.

To address these needs, a TTS that operates on-premise or in edge environments—configurations utilizing recent open-source LLM-TTS models—is becoming a realistic option.

However, on-premise TTS has the following unique challenges:

License: Model code and weights (trained parameters) are often licensed separately, requiring careful verification of commercial use rights.
Japanese Language Quality: Globally-oriented open-source models often have inconsistent quality in Japanese—especially pitch accent.
Hardware: Optimal configurations vary depending on the operating environment, including Apple Silicon, NVIDIA GPUs, and CPU-only systems.

While this article focuses on pitch accent, these issues are interconnected, resulting in a multi-dimensional optimization problem: "Operating a TTS that is grammatically correct in Japanese, with a commercially permissible license, and at an acceptable hardware cost."

Summary

A frequently overlooked element that significantly impacts the quality of Japanese TTS is pitch accent.
Matching rates with common MOS or automated G2P tools alone are insufficient for adequately evaluating accent accuracy.
In practice, a multi-layered evaluation approach, centering on a minimum pairwise dataset with manual labeling, using pyopenjtalk for regression testing, and positioning MOS as a monthly milestone, is realistic.
As demand for on-premise/local TTS increases, a design that balances licensing, Japanese language quality, and hardware becomes crucial.

Information from SolanaLink

SolanaLink provides support for the implementation of speech synthesis, primarily targeting Japanese, and for building custom models. In particular, we offer consultation in the following areas:

Selection of OSS TTS models based on commercial licenses
Quantitative evaluation design of pitch accent quality
TTS operation in on-premise/hybrid environments
Development of industry-specific pronunciation dictionaries (medical, legal, educational, financial, etc.)

We welcome inquiries regarding internal PoC considerations, evaluation of switching from existing TTS systems, and the development of your own branded voices. Please feel free to contact us via the channels below.

Inquiry: info@solanalink.jp
Company Introduction: https://solanalink.jp

Note: Information Validity Period

The technical judgments (evaluation methods, status of OSS models, etc.) mentioned in this article are as of May 2026. The TTS field is rapidly changing, and the assumptions may change in six months. Before making any important decisions, please check the latest information or contact us.

Note: Information Validity Period

The technical judgments (evaluation methods, status of OSS models, etc.) mentioned in this article are current as of May 2026.

References

For the linguistic background of Japanese pitch accent, see Fumio Koizumi, The Sounds of Japan, and the series of studies by Tadanobu Tsunoda.
pyopenjtalk (Japanese G2P + Accent Extraction Library): https://github.com/r9y9/pyopenjtalk
JSUT Corpus (Japanese TTS Research Speech Database): https://sites.google.com/site/shinnosuketakamichi/publication/jsut
JVS Corpus (Multispeaker Japanese Speech Database): https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus

This article was written by Tony, an engineer at SolanaLink. Comments and questions are welcome.

SolanaLink

The essence of the Japanese language that cutting-edge TTS often overlooks: How to understand and measure the pitch accent problem.

目次

Introduction

What is Pitch Accent and Why is it Important?

Same Sound, Different Meaning

Differences from English Speakers and Speakers of Some Asian Languages

Why are errors common in general TTS models?

The Difficulty of Evaluation: How to Measure "Correctness"

What Simple MOS (Subjective Evaluation) Doesn't Show

The "Self-Reflective Trap" of Comparison with Automated G2P Tools

Practical Solution: Minimal Pairs Datasets with Manual Labeling

Why the "Local/On-Premise" Approach is Emergering

Summary

Information from SolanaLink

Note: Information Validity Period

References

コメント

コメント (0)