Dia (Text to Speech)

Dia by Nari Labs is an advanced open-weights TTS model that brings scripts to life with natural speech, emotions, and nonverbal cues. Easily control tone, voice, and delivery. Great alternative to ElevenLabs.

Playground API Pricing

API

If you're looking for an API, you can choose from your desired programming language.

POST

import requests

api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/dia"

# Prepare data and files
data = {}
files = {}

data['seed'] = None
data['text'] = "[S1] Segmind lets you build powerful image and video workflows — no code needed. \n [S2] Over 200 open and closed models. Just drag, drop, and deploy. \n [S1] Wait, seriously? Even custom models? \n [S2] Yup. Even fine-tuned ones. (chuckles) \n [S1] That's wild. I’ve spent weeks writing code for this. \n [S2] Now you can do it in minutes. Go try Segmind on the cloud. \n  [S1] I'm sold. Let’s go. (laughs)"
data['top_p'] = 0.95
data['cfg_scale'] = 4
data['temperature'] = 1.3
# For parameter "input_audio", you can send a raw file or a URI:
# files['input_audio'] = open('IMAGE_PATH', 'rb')  # To send a file
data['input_audio'] = 'null'  # To send a URI
data['speed_factor'] = 0.94
data['max_new_tokens'] = 3072
data['cfg_filter_top_k'] = 35

headers = {'x-api-key': api_key}

# If no files, send as JSON
if files:
    response = requests.post(url, data=data, files=files, headers=headers)
else:
    response = requests.post(url, json=data, headers=headers)
print(response.content)  # The response is the generated image

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

seedint ( default: 1 )

Use a seed for reproducible results. Leave blank for random output.

textstr *

Input text for speech generation. Use [S1], [S2] for speakers and ( ) for actions like (laughs) or (whispers). Verbal tags will be recognized, but might result in unexpected output.

top_pfloat ( default: 0.95 )

Controls word variety. Higher values allow rarer words. Most users can leave this as is.

min : 0.1,

max : 1

cfg_scalefloat ( default: 4 )

Controls how strictly audio follows text. Higher = more accurate, lower = more natural. (1 to 5)

min : 1,

max : 5

temperaturefloat ( default: 1.3 )

Controls randomness. Higher (1.4–2.0) = more variety, lower (0.1–1.0) = more consistency. Values can be 0.1 to 2.

min : 0.1,

max : 2

input_audiostr ( default: 1 )

Audio file in: .wav .mp3 .flac, for voice cloning. Model will clone this voice style.

speed_factorfloat ( default: 0.94 )

Controls playback speed. 1.0 = normal, below 1.0 = slower. Values can be 0.5 to 1.5

min : 0.5,

max : 1.5

max_new_tokensint ( default: 3072 )

Controls audio length. Higher values = longer audio (≈86 tokens per second). Values can be 500 to 4096

min : 500,

max : 4096

cfg_filter_top_kint ( default: 35 )

Filters audio tokens. Higher values = more diverse sounds, lower = more consistent. Values can be 10 to 100.

min : 10,

max : 100

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Dia by Nari Labs: Next-gen Text-to-Speech with Emotion and Realism

Dia is a cutting-edge, 1.6 billion parameter text-to-speech model developed by Nari Labs — where "Nari" (나리) means lily in pure Korean. It is designed to produce ultra-realistic, podcast-style dialogue directly from text inputs. Unlike traditional TTS systems that often sound robotic or lack expressive nuance, Dia excels at generating lifelike, multi-speaker conversations complete with emotional tone adjustments and non-verbal cues such as pauses, laughter, and coughing. This level of expressiveness and control makes Dia a game changer in the field, enabling creators to craft engaging audio for podcasts, audiobooks, video game characters, and conversational interfaces without the need for high-end proprietary solutions. It is ideal for conversational AI, storytelling, dubbing, and interactive voice applications.

Technically, Dia is built as a 1.6 billion parameter model optimized specifically for natural dialogue synthesis, distinguishing it from general-purpose TTS models. The architecture supports advanced features such as audio conditioning, where users can guide the generated speech’s tone, emotion, or delivery style using short audio samples. It also allows script-level control with embedded commands for non-verbal sounds, enhancing the realism of the output. The model was trained using Google’s TPU Cloud, making it efficient enough to run on most modern computers, though the full version requires around 10GB of VRAM, with plans for a more lightweight, quantized release in the future. By releasing both the model weights and inference code openly, Nari Labs fosters community-driven innovation and transparency, positioning Dia as a versatile and accessible tool for next-generation speech synthesis.

Key Features

Multi-Speaker Dialogue Tags Generate dynamic conversations using [S1], [S2] speaker tags.
Nonverbal Vocal Cues Dia recognizes expressive cues: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles).
Zero-Shot Voice Variety The model is not fine-tuned to a single voice, so it will produce a new synthetic voice with each run. This allows for variety but requires conditioning for consistency.
Voice Consistency Options
- Audio Prompting: Upload a voice sample to guide tone and speaker identity.
- Seed Fixing: Use the same seed for consistent voice generation across runs.
Voice Cloning Clone any voice by uploading a sample and a matching transcript. The model will adapt and use that voice for the rest of your script.

Usage Tips

Speaker Identity Management: Use [S1], [S2] for clarity in conversations.
Conditioning for Emotional Delivery: Include nonverbal tags or an audio sample to control emotion and style.

Other Popular Models

sdxl-controlnet

SDXL ControlNet gives unprecedented control over text-to-image generation. SDXL ControlNet models Introduces the concept of conditioning inputs, which provide additional information to guide the image generation process