Dia (Text to Speech)

Dia by Nari Labs is an advanced open-weights TTS model that brings scripts to life with natural speech, emotions, and nonverbal cues. Easily control tone, voice, and delivery. Great alternative to ElevenLabs.


API

If you're looking for an API, you can choose from your desired programming language.

POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import requests api_key = "YOUR_API_KEY" url = "https://api.segmind.com/v1/dia" # Prepare data and files data = {} files = {} data['seed'] = None data['text'] = "[S1] Segmind lets you build powerful image and video workflows — no code needed. \n [S2] Over 200 open and closed models. Just drag, drop, and deploy. \n [S1] Wait, seriously? Even custom models? \n [S2] Yup. Even fine-tuned ones. (chuckles) \n [S1] That's wild. I’ve spent weeks writing code for this. \n [S2] Now you can do it in minutes. Go try Segmind on the cloud. \n [S1] I'm sold. Let’s go. (laughs)" data['top_p'] = 0.95 data['cfg_scale'] = 4 data['temperature'] = 1.3 # For parameter "input_audio", you can send a raw file or a URI: # files['input_audio'] = open('IMAGE_PATH', 'rb') # To send a file data['input_audio'] = 'null' # To send a URI data['speed_factor'] = 0.94 data['max_new_tokens'] = 3072 data['cfg_filter_top_k'] = 35 headers = {'x-api-key': api_key} # If no files, send as JSON if files: response = requests.post(url, data=data, files=files, headers=headers) else: response = requests.post(url, json=data, headers=headers) print(response.content) # The response is the generated image
RESPONSE
image/jpeg
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


seedint ( default: 1 )

Use a seed for reproducible results. Leave blank for random output.


textstr *

Input text for speech generation. Use [S1], [S2] for speakers and ( ) for actions like (laughs) or (whispers). Verbal tags will be recognized, but might result in unexpected output.


top_pfloat ( default: 0.95 )

Controls word variety. Higher values allow rarer words. Most users can leave this as is.

min : 0.1,

max : 1


cfg_scalefloat ( default: 4 )

Controls how strictly audio follows text. Higher = more accurate, lower = more natural. (1 to 5)

min : 1,

max : 5


temperaturefloat ( default: 1.3 )

Controls randomness. Higher (1.4–2.0) = more variety, lower (0.1–1.0) = more consistency. Values can be 0.1 to 2.

min : 0.1,

max : 2


input_audiostr ( default: 1 )

Audio file in: .wav .mp3 .flac, for voice cloning. Model will clone this voice style.


speed_factorfloat ( default: 0.94 )

Controls playback speed. 1.0 = normal, below 1.0 = slower. Values can be 0.5 to 1.5

min : 0.5,

max : 1.5


max_new_tokensint ( default: 3072 )

Controls audio length. Higher values = longer audio (≈86 tokens per second). Values can be 500 to 4096

min : 500,

max : 4096


cfg_filter_top_kint ( default: 35 )

Filters audio tokens. Higher values = more diverse sounds, lower = more consistent. Values can be 10 to 100.

min : 10,

max : 100

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Dia by Nari Labs: Next-gen Text-to-Speech with Emotion and Realism

Dia is a cutting-edge, 1.6 billion parameter text-to-speech model developed by Nari Labs — where "Nari" (나리) means lily in pure Korean. It is designed to produce ultra-realistic, podcast-style dialogue directly from text inputs. Unlike traditional TTS systems that often sound robotic or lack expressive nuance, Dia excels at generating lifelike, multi-speaker conversations complete with emotional tone adjustments and non-verbal cues such as pauses, laughter, and coughing. This level of expressiveness and control makes Dia a game changer in the field, enabling creators to craft engaging audio for podcasts, audiobooks, video game characters, and conversational interfaces without the need for high-end proprietary solutions. It is ideal for conversational AI, storytelling, dubbing, and interactive voice applications.

Technically, Dia is built as a 1.6 billion parameter model optimized specifically for natural dialogue synthesis, distinguishing it from general-purpose TTS models. The architecture supports advanced features such as audio conditioning, where users can guide the generated speech’s tone, emotion, or delivery style using short audio samples. It also allows script-level control with embedded commands for non-verbal sounds, enhancing the realism of the output. The model was trained using Google’s TPU Cloud, making it efficient enough to run on most modern computers, though the full version requires around 10GB of VRAM, with plans for a more lightweight, quantized release in the future. By releasing both the model weights and inference code openly, Nari Labs fosters community-driven innovation and transparency, positioning Dia as a versatile and accessible tool for next-generation speech synthesis.

Key Features

  • Multi-Speaker Dialogue Tags
Generate dynamic conversations using [S1], [S2] speaker tags.
  • Nonverbal Vocal Cues
Dia recognizes expressive cues: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles).
  • Zero-Shot Voice Variety
The model is not fine-tuned to a single voice, so it will produce a new synthetic voice with each run. This allows for variety but requires conditioning for consistency.
  • Voice Consistency Options
    • Audio Prompting: Upload a voice sample to guide tone and speaker identity.
    • Seed Fixing: Use the same seed for consistent voice generation across runs.
  • Voice Cloning
Clone any voice by uploading a sample and a matching transcript. The model will adapt and use that voice for the rest of your script.

Usage Tips

  • Speaker Identity Management: Use [S1], [S2] for clarity in conversations.
  • Conditioning for Emotional Delivery: Include nonverbal tags or an audio sample to control emotion and style.

Cookie settings

We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept all", you consent to our use of cookies.