Business 03
We partner with trusted Japanese production studios, providing access to 200+ professional performers — actors, voice actors, announcers, and narrators. Emotion designed as continuous structure, not categorical labels. Studio-grade quality with full rights clearance for commercial AI training.
Most speech data fails in AI applications for reasons that converge on the same root causes:
Cannot reproduce the same conditions
Expression varies unpredictably across sessions
Emotion and prosody are incidental
Interpretation of instructions varies between speakers
Cannot re-record under changed conditions later
These are not failures of speaker ability — they are differences in professional capability.
Professional announcers, actors, and voice artists can:
Instantly translate instructions into vocal expression
Reproduce the same conditions repeatedly
Intentionally produce fine-grained variations
This capability makes speech treatable as a controllable variable.
2.1 Speaker Attribute Design
For each project, we pre-specify:
Age Range: e.g., late 20s, early 50s
Gender & Voice Quality
Regional Background: Presence or absence of dialect influence
Speech Characteristics: Clarity, tempo, pitch range, prosodic range
This is not "casting" — it is data design.
2.2 Speaker Assignment Design
Cases requiring long-term speaker fixation
Cases requiring speaker rotation under aligned conditions
Cases varying only age or gender
Composition is structured according to comparison and learning objectives.
Principle
The process of controlling speech as an engineering target.
3.1 Session Design
Each session explicitly designs:
Speaking Rate: Perceived tempo, not BPM
Pauses: Types and durations
Prosodic Range: Intonation variation
Emotional Arc: Start point, end point, and transition
Backchannels & Interruptions: Presence or absence
Critical: We do not specify emotions by "type."
Instead of "anger"
Calm → Discomfort → Frustration → Suppressed anger
Instead of "joy"
Surprise → Relief → Acceptance
Emotions are designed as continuous transitions, not categorical labels.
3.2 Instruction Method — Professional Premise
Professional speakers can receive mixed instructions in:
Emotional terminology
Performance terminology
Acoustic engineering terminology
This enables production of: same text, different non-verbal expressions, reproducible differences.
Non-verbal elements handled in professional speech data:
Breath & Breathing: Inhalation, breath pauses
Hesitation Sounds: Fillers, pauses, false starts
Backchannel Types & Timing
Sentence-Final Processing: Assertive vs. implied
Voice Onset & Decay
None of these are left to chance.
All are generated with: instruction, condition, and reproducibility.
5.1 Why We Avoid Simple Labels
Single-label emotion annotations like "anger" or "sadness" are not used.
Reasons:
Human emotions change over time
In speech, they overlap
When used for training, they cause overfitting
5.2 Structure We Adopt
Temporal Intervals: start / end
Continuous Quantities: Intensity, tension, etc.
Correspondence with Utterance Units
This enables learning not of "emotional states" but of expressive processes.
Professional speech data becomes entirely unusable in the future if rights design is weak.
M9 STUDIO requires:
Usage definition before recording
Explicit scope, region, and duration
Written documentation of reuse, derivatives, and retraining permissions
Linkage between consent content and audio samples
This creates a structure that withstands: EU research, international commercial use, and future model updates.
Varies by project, but typically includes:
Audio Data: RAW / normalized
Non-Verbal Annotation
Metadata: Speaker attributes, conditions
Session Design Document
Annotation Guide
QC Report
Rights & Usage Summary
"Audio files only" is never delivered.
This business is chosen for the following applications:
High-Quality TTS
(especially emotion control)
Dialogue AI
(turn-taking, backchannels)
Avatars &
AI Characters
Robot
Speech I/O
Emotion Understanding &
Speech Behavior Analysis
The common thread: "We want to use the data we create for years."
Through partnerships with trusted Japanese production studios, we provide access to a diverse network of professional voice talent.
Voice Actors — Anime, game, and dubbing professionals
Actors — Stage and screen performers with vocal training
Announcers — Broadcast and corporate narration specialists
Narrators — Documentary, audiobook, and e-learning voices
Narration & Announcement
Professional broadcast quality
Customer Service
Polite, clear guidance voices
Character & Animation
Expressive performance styles
AI Assistant
Neutral, natural conversational tones
Emotional Expression
Joy, anger, sadness with intensity control
Regional Dialects
Kansai, Tohoku, and other regional varieties
Gender: Male, Female, Non-binary
Age Range: 20s through 70s
Languages: Japanese (native), with some bilingual performers
All performers are contracted through established production studios with proper rights management.
New sample library with 200+ performers launching soon.
Due to licensing and NDA requirements, voice samples are provided on request.
Can't wait? Contact us now for early access to voice samples matched to your requirements.
What We Need
Application type, voice style preferences, estimated data volume
What You Receive
Curated sample pack matched to your requirements
Response within 2 business days. NDA available upon request.
Controllable Speakers
Professional speakers treated as controllable variables
Re-Recording Capability
Can re-record under the same conditions any number of times
Continuous Emotion Design
Emotion designed as continuous structure, not categories
Non-Verbal Control
Non-verbal elements never left to chance
Long-Term Rights
Rights preserved for future use without degradation
M9 STUDIO's professional speech data business is not about collecting voices — it is about treating human expression as an engineering discipline.
This is not suited for requirements that are: cheap, fast, or high-volume.
However, for those who want to build speech AI that won't break in production — we can support from start to finish.