API
If you're looking for an API, you can choose from your desired programming language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import requests
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/dia"
# Prepare data and files
data = {}
files = {}
data['seed'] = None
data['text'] = "[S1] Segmind lets you build powerful image and video workflows — no code needed. \n [S2] Over 200 open and closed models. Just drag, drop, and deploy. \n [S1] Wait, seriously? Even custom models? \n [S2] Yup. Even fine-tuned ones. (chuckles) \n [S1] That's wild. I’ve spent weeks writing code for this. \n [S2] Now you can do it in minutes. Go try Segmind on the cloud. \n [S1] I'm sold. Let’s go. (laughs)"
data['top_p'] = 0.95
data['cfg_scale'] = 4
data['temperature'] = 1.3
# For parameter "input_audio", you can send a raw file or a URI:
# files['input_audio'] = open('IMAGE_PATH', 'rb') # To send a file
data['input_audio'] = 'null' # To send a URI
data['speed_factor'] = 0.94
data['max_new_tokens'] = 3072
data['cfg_filter_top_k'] = 35
headers = {'x-api-key': api_key}
# If no files, send as JSON
if files:
response = requests.post(url, data=data, files=files, headers=headers)
else:
response = requests.post(url, json=data, headers=headers)
print(response.content) # The response is the generated image
Attributes
Use a seed for reproducible results. Leave blank for random output.
Input text for speech generation. Use [S1], [S2] for speakers and ( ) for actions like (laughs) or (whispers). Verbal tags will be recognized, but might result in unexpected output.
Controls word variety. Higher values allow rarer words. Most users can leave this as is.
min : 0.1,
max : 1
Controls how strictly audio follows text. Higher = more accurate, lower = more natural. (1 to 5)
min : 1,
max : 5
Controls randomness. Higher (1.4–2.0) = more variety, lower (0.1–1.0) = more consistency. Values can be 0.1 to 2.
min : 0.1,
max : 2
Audio file in: .wav .mp3 .flac, for voice cloning. Model will clone this voice style.
Controls playback speed. 1.0 = normal, below 1.0 = slower. Values can be 0.5 to 1.5
min : 0.5,
max : 1.5
Controls audio length. Higher values = longer audio (≈86 tokens per second). Values can be 500 to 4096
min : 500,
max : 4096
Filters audio tokens. Higher values = more diverse sounds, lower = more consistent. Values can be 10 to 100.
min : 10,
max : 100
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Dia by Nari Labs: Next-gen Text-to-Speech with Emotion and Realism
Dia is a cutting-edge, 1.6 billion parameter text-to-speech model developed by Nari Labs — where "Nari" (나리) means lily in pure Korean. It is designed to produce ultra-realistic, podcast-style dialogue directly from text inputs. Unlike traditional TTS systems that often sound robotic or lack expressive nuance, Dia excels at generating lifelike, multi-speaker conversations complete with emotional tone adjustments and non-verbal cues such as pauses, laughter, and coughing. This level of expressiveness and control makes Dia a game changer in the field, enabling creators to craft engaging audio for podcasts, audiobooks, video game characters, and conversational interfaces without the need for high-end proprietary solutions. It is ideal for conversational AI, storytelling, dubbing, and interactive voice applications.
Technically, Dia is built as a 1.6 billion parameter model optimized specifically for natural dialogue synthesis, distinguishing it from general-purpose TTS models. The architecture supports advanced features such as audio conditioning, where users can guide the generated speech’s tone, emotion, or delivery style using short audio samples. It also allows script-level control with embedded commands for non-verbal sounds, enhancing the realism of the output. The model was trained using Google’s TPU Cloud, making it efficient enough to run on most modern computers, though the full version requires around 10GB of VRAM, with plans for a more lightweight, quantized release in the future. By releasing both the model weights and inference code openly, Nari Labs fosters community-driven innovation and transparency, positioning Dia as a versatile and accessible tool for next-generation speech synthesis.
Key Features
- Multi-Speaker Dialogue Tags
Generate dynamic conversations using [S1], [S2] speaker tags.
- Nonverbal Vocal Cues
Dia recognizes expressive cues: (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles).
- Zero-Shot Voice Variety
The model is not fine-tuned to a single voice, so it will produce a new synthetic voice with each run. This allows for variety but requires conditioning for consistency.
- Voice Consistency Options
- Audio Prompting: Upload a voice sample to guide tone and speaker identity.
- Seed Fixing: Use the same seed for consistent voice generation across runs.
- Voice Cloning
Clone any voice by uploading a sample and a matching transcript. The model will adapt and use that voice for the rest of your script.
Usage Tips
- Speaker Identity Management: Use [S1], [S2] for clarity in conversations.
- Conditioning for Emotional Delivery: Include nonverbal tags or an audio sample to control emotion and style.
Other Popular Models
sdxl-controlnet
SDXL ControlNet gives unprecedented control over text-to-image generation. SDXL ControlNet models Introduces the concept of conditioning inputs, which provide additional information to guide the image generation process

idm-vton
Best-in-class clothing virtual try on in the wild

codeformer
CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

sd2.1-faceswapper
Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training
