OAS 3.0.1

AI APIs

The Most Powerful models at your fingertip.

These endpoints require an API key to access. You can obtain an API key from our website. After obtaining an API key, you can pass it in the api-key header, such as api-key: YOUR_API_KEY.

Base URL

API Key (api_key)

Client Libraries

Shell

Ruby

Node.js

PHP

Python

cURL Shell

Whisper Stable-ts align

Whisper (https://github.com/openai/whisper) is a general-purpose speech recognition model. This endpoint uses Stable-ts (https://github.com/jianfch/stable-ts) and Faster-Whisper (https://github.com/SYSTRAN/faster-whisper) to improve speed and timestamp accuracy of the original whisper. Currently using stable-ts verion 2.17.3. This endpoint creates a job to align a pre-transcribed text (without timestamps) on the provided audio file asynchronously. The job will be placed in a queue and processed in the order it was received. The job status can be checked using the /stablets/status endpoint.

Body

application/json

audio_data

string

audio_data should be passed in as a dictionary in the format below:

{
  "audio_data": "audio file converted to base64",
  "file_ext": "mp3" or "wav"
}

If your audio file is in another format or is a video, consider converting it to one of the supported formats (mp3 or wav) using a tool such as ffmpeg.

Exactly one of the following two query parameters is required:

audio_data (this parameter)
audio_url

Example

{
  "audio_data": "SUQzAwAAAAAAU1RTU0UAAAAPAAADTGF2ZjU2LjIzLjEwMAAAAAAAAAAAAAAA//tAxAAAAAAA",
  "file_ext": "mp3"
}

audio_url

string

A public URL to a hosted audio file. You can use services such as Azure Storage to obtain a public link to your audio file and pass it here. Only mp3 or wav files are supported. If your audio file is in another format or is a video, consider converting it to one of the supported formats (mp3 or wav) using a tool such as ffmpeg.

Exactly one of the following two query parameters is required:

audio_data
audio_url (this parameter)

Examplehttps://longstoragevoila.blob.core.windows.net/long/zaiye.mp3

model

required

string enum

The model to use for transcription.

Examplelarge-v2

tiny
base
small
medium
large-v2
large-v3
distil-large-v2
distil-large-v3
large-v2-mix-jp

language

string

The language of the audio file. Auto-detects if not provided.

Exampleen

prepend_punctuations

string

Punctuations to prepend to the transcription.

append_punctuations

string

Punctuations to append to the transcription.

regroup

boolean

Whether to regroup the transcription segments.

Exampletrue

suppress_silence

boolean

Whether to suppress silence in the transcription.

Exampletrue

suppress_word_ts

boolean

Whether to suppress word timestamps in the transcription.

Exampletrue

use_word_position

boolean

Whether to use word positions in the transcription.

Exampletrue

q_levels

integer

The quality levels for the transcription.

Example2

k_size

integer

The kernel size for the transcription.

Example3

denoiser

string

The denoiser to use for the transcription.

Exampledefault_denoiser

denoiser_options

object

Options for the denoiser.

vad

boolean

Whether to use voice activity detection.

Exampletrue

vad_threshold

number float

Threshold for voice activity detection.

Example0.5

min_word_dur

number float

Minimum duration of words in the transcription.

Example0.2

min_silence_dur

number float

Minimum duration of silence in the transcription.

Example0.5

nonspeech_error

boolean

Whether to treat nonspeech as an error.

Exampletrue

only_voice_freq

boolean

Whether to use only voice frequency.

Exampletrue

text

required

string

The text to align with the audio.

remove_instant_words

boolean

Whether to remove instant words from the alignment.

Exampletrue

token_step

integer

Step size for token alignment.

Example1

original_split

boolean

Whether to use original split for alignment.

Exampletrue

max_word_dur

number float

Maximum duration of words in the alignment.

Example1.5

nonspeech_skip

boolean

Whether to skip nonspeech segments during alignment.

Exampletrue

fast_mode

boolean

Whether to enable fast mode for alignment.

Exampletrue

stream

boolean

Whether to stream the alignment results.

Exampletrue

failure_threshold

number float

Threshold for considering alignment as a failure.

Example0.2

presplit

boolean

Whether to presplit the text for alignment.

Exampletrue

gap_padding

number float

Padding for gaps in the alignment.

Example0.1

Responses

200

Successful response
400

Bad Request

POST/stablets/align

Shell cURL

curl --request POST \
  --url https://ytdlp-voilatech-apim.azure-api.net/v1/stablets/align \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "Hello, how are you?",
  "model": "small",
  "audio_url": "https://longstoragevoila.blob.core.windows.net/long/zaiye.mp3"
}'

Show Schema

{
  "id": "66bf0b2a-28c4-43a9-895c-96ec04aa49d1-e1",
  "status": "IN_QUEUE"
}

Successful response

AI APIs

Whisper Stable-ts align​ # Copy link to "Whisper Stable-ts align"

Whisper Stable-ts align