Business 01
We do not simply tag emotions or classify atmospheres after the fact. We treat non-verbal information as primary input — equivalent to language — designing and generating it as structured datasets with temporal and relational integrity before any recording begins.
The non-verbal elements we address include:
Pauses, hesitations, breath, backchannels, and interruption timing in speech
Emotional transitions — not categories, but changes over time
Gaze, posture, micro-gestures, and gesture-speech synchronization
Spatial distance, ambient sound, reverberation, occlusion, and other environment-derived perceptual conditions
This domain cannot be addressed by post-processing existing data. The ability to generate new data according to requirements is itself the value.
In designing speech and language systems, we have worked with over 200 professional performers — actors, voice actors, announcers, and narrators — each with distinct speech styles, performance techniques, and vocal characteristics. Through this collaboration, we have structured non-verbal elements including pauses, intonation patterns, and emotional transitions.
This is not a claim about data volume. It is part of a design process to verify under which conditions certain expressions cease to function.
1.1 Paralinguistics — Non-Verbal in Speech & Audio
Pause / Silence: Types include thinking, hesitation, emphasis, timing
Hesitation / Disfluency: Restarts, fillers, self-corrections
Breath & Voicing: Inhalation, breath pauses, laughing breath, sighs, voice onset
Prosodic Dynamics: Continuous variation in pitch, intensity, and tempo
Backchannels: Listener responses and their timing
Turn-taking Signals: Cues for turn exchange, interruption, and yielding
The critical point is representing these not as isolated labels, but as changes along a timeline.
1.2 Kinesics — Non-Verbal in Visual & Body Movement
Gaze: Direction, fixation, aversion, tracking
Micro-gestures: Subtle nods, head shakes, finger and shoulder movements
Posture Shifts: Changes in body position, approach/withdrawal
Gesture-Speech Alignment: Synchronization and desynchronization between gesture and speech
For robotics and interaction AI, this is where implementation success or failure is determined.
1.3 Contextual & Spatial Cues — Environment-Derived Non-Verbal
Distance & Orientation: Interpersonal distance, facing direction, occlusion
Ambient Conditions: Environmental noise, reverberation, reflection, crowding
Perception Constraints: Lighting, visibility, field of view limitations, audio masking
These are not background — they are preconditions for perception. Datasets must explicitly define and reproduce these conditions.
Principle
The value of non-verbal data is determined not by the richness of classification taxonomies, but by the format of representation.
2.1 Temporal Intervals
Interval representation with start_time / end_time. Examples: hesitation intervals, silence intervals, backchannel intervals.
2.2 Continuous Signals
Continuous quantities such as pitch, energy, and tempo. Emotional states are treated as trajectories of state variables, not reduced to word labels.
2.3 Relational Structure
Relationships between speech turns (interruption, overlap, yielding)
Correspondence between gaze and speech target
Correspondence between gesture and semantic unit
"When, to what, and how did they respond?"
Only by structuring data with both time and relations can it become reusable for dialogue AI and robotics.
Because post-hoc processing cannot substitute for intentional design, this phase is the core of the operation.
3.1 Requirements Definition — What Must Be Decided First
Target System: Conversational agent, robot, multimodal model, evaluation system
Target Task: Turn-taking, backchannel generation, emotion transition estimation, interpersonal distance control
Recording Format: In-person / remote, 1-on-1 / multi-party, quiet / noisy
Variable vs. Fixed Elements: e.g., fix speaker attributes, vary environmental noise in stages
3.2 Session Design — The Skeleton for Reproducibility
Scenario (condition definition, not script)
Turn count, silence insertion conditions, interruption induction conditions
Emotion transition design (e.g., calm → hesitation → acceptance)
Speech-gaze-posture synchronization conditions (synchronized vs. intentionally offset)
3.3 Measurement & Recording — Minimum Required Logs
Audio: Sampling rate, microphone conditions, noise profile
Video: Frame rate, field of view, fixed/moving, occlusion conditions
Environment: Distance, room conditions, reverberation index, crowding level
Converting human understanding into structure. Non-verbal data may appear ambiguous, but it becomes reproducible when proceduralized.
4.1 Label Schema Design
Minimum necessary categories + continuous quantities
Prioritize "events" and "transitions" over "emotion labels"
Event examples: pause, backchannel, overlap, hesitation
Dynamics: arousal/valence as continuous values (when needed)
4.2 Annotation Guidelines
Fix judgment criteria in written form
Unify boundary marking (start/end)
Priority rules for multi-party cases (whose backchannel, whose gaze target)
4.3 Quality Control
Double annotation + agreement rate measurement (interval agreement, event agreement)
Boundary tolerance (e.g., ±200ms) set according to task
Redefinition and consolidation of low-agreement labels (label revision is part of quality)
5. Metadata Design — "Usable" Conditions That Drive Contracts
For non-verbal data, knowing "what happened" is not enough — "why it happened" and "under what conditions" are critical.
Session Conditions: Environment, number of participants, distance, noise, target task
Speaker Attributes: Age range, region, speech characteristics (as needed)
Recording Conditions: Device, settings, synchronization information
Rights Conditions: Usage scope, reuse permission, derivative permission
This metadata enables research, retraining, auditing, and regeneration.
6. Rights & Compliance — Especially Critical for Non-Verbal
Non-verbal data tends to be more personally identifiable than speech alone. We build compliance into the recording premise, not as an afterthought.
Participant Consent: Usage scope, duration, derivatives, and reuse explicitly documented
Data Separation: Data for different purposes not mixed
Traceability: Sample → session → consent condition linkage (for research and EU contexts, "explainable after the fact")
Customized per project, but typically includes the following combination:
Raw Data: Audio / video / sensor (required formats)
Annotation: Intervals, events, continuous quantities (agreed format)
Metadata: Conditions, attributes, recording information
Documentation: Specification (requirements, design intent, assumptions), annotation guide, QC report, rights and usage summary
"Data + documentation" as a set. Without this, data does not last — neither in research nor in commercial use. This is our premise.
Since design is the critical factor for non-verbal data, the following approach minimizes risk:
1. Requirements alignment (target task and evaluation criteria)
2. Small-scale PoC (validation of label schema and procedures)
3. QC criteria confirmation (tolerance thresholds, agreement rate standards)
4. Parallel execution for scale
5. Delivery specification fixed with future expansion and regeneration in mind
This approach avoids large-scale failures from the start while building a structure that can scale.
The differentiation of this business is not "we can also do non-verbal." It is that the following are simultaneously true:
Requirements-Stage Design
Non-verbal data can be designed from the requirements stage
Learnable Structure
Represented as temporal and relational structure that can be learned
Cross-Modal Synchronization
Created synchronized with speech, language, vision, and space
Long-Term Durability
Recording infrastructure and rights design withstand research, commercial use, and future reuse