Elevenlabs Transcript

Experience unmatched accuracy with ElevenLabs Transcript, the leading model for AI speech-to-text.

API

If you're looking for an API, you can choose from your desired programming language.

POST

import requests
import base64

# Use this function to convert an image file from the filesystem to base64
def image_file_to_base64(image_path):
    with open(image_path, 'rb') as f:
        image_data = f.read()
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to fetch an image from a URL and convert it to base64
def image_url_to_base64(image_url):
    response = requests.get(image_url)
    image_data = response.content
    return base64.b64encode(image_data).decode('utf-8')

# Use this function to convert a list of image URLs to base64
def image_urls_to_base64(image_urls):
    return [image_url_to_base64(url) for url in image_urls]

api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/eleven-labs-transcript"

# Request payload
data = {
  "file": "https://segmind-sd-models.s3.amazonaws.com/display_images/sad_talker/sad_talker_audio_input.mp3",
  "timestamps_granularity": "word",
  "language_code": "en",
  "num_speakers": 1,
  "tag_audio_events": False,
  "diarize": True,
  "response_content_only": False
}

headers = {'x-api-key': api_key}

response = requests.post(url, json=data, headers=headers)
print(response.content)  # The response is the generated image

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

filestr ( default: https://segmind-sd-models.s3.amazonaws.com/display_images/sad_talker/sad_talker_audio_input.mp3 )

Input Audio or Video URL

timestamps_granularityenum:str ( default: word )

The granularity of the timestamps in the transcription. ‘word’ provides word-level timestamps and ‘character’ provides character-level timestamps per word.

Allowed values:

language_codestr ( default: en )

An ISO-639-1 or ISO-639-3 language_code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Defaults to null, in this case the language is predicted automatically.

cloud_storage_urlstr ( default: 1 )

URL of the input video/audio to be transcribed. Limited to 1GB

num_speakersint ( default: 1 )

The maximum amount of speakers talking in the uploaded file. Can help with predicting who speaks when. The maximum amount of speakers that can be predicted is 32. Defaults to null, in this case the amount of speakers is set to the maximum value the model supports.

min : 1,

max : 32

tag_audio_eventsboolean ( default: true )

tag_audio_events.

diarizeboolean ( default: true )

diarize ( Whether to annotate which speaker is currently talking in the uploaded file)

response_content_onlyboolean ( default: 1 )

Whether to sent only txt content

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

ElevenLabs Transcript

ElevenLabs Transcript is the premier AI transcription for professionals needing flawless audio to text. With industry-leading accuracy, elevenLabs transcript is perfect for films, podcasts, meetings, and medical dictations. Experience unmatched precision and seamless integration with this advanced ASR (automatic speech recognition) technology.

Key Features

Industry-Leading Accuracy - Achieve the lowest word error rate for perfectly accurate English transcription, outperforming Google Gemini and OpenAI Whisper in testing.
Smart Speaker Diarization - Intuitively distinguishes and labels every speaker in any conversation for clear, organized transcripts.
Precise Word-Level Timestamps - Capture the exact moment each word is spoken, enabling seamless subtitle syncing and interactive audio experiences.
Dynamic Audio Tagging - Enriches your English transcripts with the full context of your audio by tagging every sound event, from laughter to footsteps.
Global Language Support - Break language barriers with support for English and 98 other language

Use Cases

Media & Entertainment - Generate accurate subtitles and closed captions for films and videos with precise timestamps.
Business Meetings - Get clear, organized transcripts of meetings with speaker diarization, perfect for record-keeping and follow-up actions.
Medical Dictations - Transcribe medical dictations with industry-leading accuracy, ensuring precision in healthcare documentation.
Podcast Production - Transform audio content into text for show notes, scripts, and enhanced accessibility.

Other Popular Models

storydiffusion

Story Diffusion turns your written narratives into stunning image sequences.

faceswap-v2

Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

codeformer

CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

sd2.1-faceswapper