0.

Definition

The non-verbal elements we address include:

Pauses, hesitations, breath, backchannels, and interruption timing in speech

Emotional transitions — not categories, but changes over time

Gaze, posture, micro-gestures, and gesture-speech synchronization

Spatial distance, ambient sound, reverberation, occlusion, and other environment-derived perceptual conditions

This domain cannot be addressed by post-processing existing data. The ability to generate new data according to requirements is itself the value.

In designing speech and language systems, we have worked with over 200 professional performers — actors, voice actors, announcers, and narrators — each with distinct speech styles, performance techniques, and vocal characteristics. Through this collaboration, we have structured non-verbal elements including pauses, intonation patterns, and emotional transitions.

This is not a claim about data volume. It is part of a design process to verify under which conditions certain expressions cease to function.

1.

Data Coverage
by Modality

1.1 Paralinguistics — Non-Verbal in Speech & Audio

Pause / Silence: Types include thinking, hesitation, emphasis, timing

Hesitation / Disfluency: Restarts, fillers, self-corrections

Breath & Voicing: Inhalation, breath pauses, laughing breath, sighs, voice onset

Prosodic Dynamics: Continuous variation in pitch, intensity, and tempo

Backchannels: Listener responses and their timing

Turn-taking Signals: Cues for turn exchange, interruption, and yielding

The critical point is representing these not as isolated labels, but as changes along a timeline.

1.2 Kinesics — Non-Verbal in Visual & Body Movement

Gaze: Direction, fixation, aversion, tracking

Micro-gestures: Subtle nods, head shakes, finger and shoulder movements

Posture Shifts: Changes in body position, approach/withdrawal

Gesture-Speech Alignment: Synchronization and desynchronization between gesture and speech

For robotics and interaction AI, this is where implementation success or failure is determined.

1.3 Contextual & Spatial Cues — Environment-Derived Non-Verbal

Distance & Orientation: Interpersonal distance, facing direction, occlusion

Ambient Conditions: Environmental noise, reverberation, reflection, crowding

Perception Constraints: Lighting, visibility, field of view limitations, audio masking

These are not background — they are preconditions for perception. Datasets must explicitly define and reproduce these conditions.

Principle

Structure, Not Tags.

2.

Data Representation

The value of non-verbal data is determined not by the richness of classification taxonomies, but by the format of representation.

2.1 Temporal Intervals

Interval representation with start_time / end_time. Examples: hesitation intervals, silence intervals, backchannel intervals.

2.2 Continuous Signals

Continuous quantities such as pitch, energy, and tempo. Emotional states are treated as trajectories of state variables, not reduced to word labels.

2.3 Relational Structure

Relationships between speech turns (interruption, overlap, yielding)

Correspondence between gaze and speech target

Correspondence between gesture and semantic unit

"When, to what, and how did they respond?"

Only by structuring data with both time and relations can it become reusable for dialogue AI and robotics.

3.

Recording &
Acquisition Design

Because post-hoc processing cannot substitute for intentional design, this phase is the core of the operation.

3.1 Requirements Definition — What Must Be Decided First

Target System: Conversational agent, robot, multimodal model, evaluation system

Target Task: Turn-taking, backchannel generation, emotion transition estimation, interpersonal distance control

Recording Format: In-person / remote, 1-on-1 / multi-party, quiet / noisy

Variable vs. Fixed Elements: e.g., fix speaker attributes, vary environmental noise in stages

3.2 Session Design — The Skeleton for Reproducibility

Scenario (condition definition, not script)

Turn count, silence insertion conditions, interruption induction conditions

Emotion transition design (e.g., calm → hesitation → acceptance)

Speech-gaze-posture synchronization conditions (synchronized vs. intentionally offset)

3.3 Measurement & Recording — Minimum Required Logs

Audio: Sampling rate, microphone conditions, noise profile

Video: Frame rate, field of view, fixed/moving, occlusion conditions

Environment: Distance, room conditions, reverberation index, crowding level

4.

Annotation

Converting human understanding into structure. Non-verbal data may appear ambiguous, but it becomes reproducible when proceduralized.

4.1 Label Schema Design

Minimum necessary categories + continuous quantities

Prioritize "events" and "transitions" over "emotion labels"

Event examples: pause, backchannel, overlap, hesitation

Dynamics: arousal/valence as continuous values (when needed)

4.2 Annotation Guidelines

Fix judgment criteria in written form

Unify boundary marking (start/end)

Priority rules for multi-party cases (whose backchannel, whose gaze target)

4.3 Quality Control

Double annotation + agreement rate measurement (interval agreement, event agreement)

Boundary tolerance (e.g., ±200ms) set according to task

Redefinition and consolidation of low-agreement labels (label revision is part of quality)

5–6.

Metadata &
Rights Design

5. Metadata Design — "Usable" Conditions That Drive Contracts

For non-verbal data, knowing "what happened" is not enough — "why it happened" and "under what conditions" are critical.

Session Conditions: Environment, number of participants, distance, noise, target task

Speaker Attributes: Age range, region, speech characteristics (as needed)

Recording Conditions: Device, settings, synchronization information

Rights Conditions: Usage scope, reuse permission, derivative permission

This metadata enables research, retraining, auditing, and regeneration.

6. Rights & Compliance — Especially Critical for Non-Verbal

Non-verbal data tends to be more personally identifiable than speech alone. We build compliance into the recording premise, not as an afterthought.

Participant Consent: Usage scope, duration, derivatives, and reuse explicitly documented

Data Separation: Data for different purposes not mixed

Traceability: Sample → session → consent condition linkage (for research and EU contexts, "explainable after the fact")

7.

Deliverables

Customized per project, but typically includes the following combination:

Raw Data: Audio / video / sensor (required formats)

Annotation: Intervals, events, continuous quantities (agreed format)

Metadata: Conditions, attributes, recording information

Documentation: Specification (requirements, design intent, assumptions), annotation guide, QC report, rights and usage summary

"Data + documentation" as a set. Without this, data does not last — neither in research nor in commercial use. This is our premise.

8.

Engagement Model

Since design is the critical factor for non-verbal data, the following approach minimizes risk:

1. Requirements alignment (target task and evaluation criteria)

2. Small-scale PoC (validation of label schema and procedures)

3. QC criteria confirmation (tolerance thresholds, agreement rate standards)

4. Parallel execution for scale

5. Delivery specification fixed with future expansion and regeneration in mind

This approach avoids large-scale failures from the start while building a structure that can scale.

9.

Why This Cannot Be Replaced

The differentiation of this business is not "we can also do non-verbal." It is that the following are simultaneously true:

Requirements-Stage Design

Non-verbal data can be designed from the requirements stage

Learnable Structure

Represented as temporal and relational structure that can be learned

Cross-Modal Synchronization

Created synchronized with speech, language, vision, and space

Long-Term Durability

Recording infrastructure and rights design withstand research, commercial use, and future reuse

DISCUSS YOUR REQUIREMENTS

NEXT

Business 02

On-Demand Original AI Data Creation

VIEW DETAILS