1.0
OAS 3.0.1

AI APIs

The Most Powerful models at your fingertip.

These endpoints require an API key to access. You can obtain an API key from our website. After obtaining an API key, you can pass it in the api-key header, such as api-key: YOUR_API_KEY.

API Key (api_key)
Client Libraries
Shell
Ruby
Node.js
PHP
Python
More

Whisper Stable-ts align

Whisper (https://github.com/openai/whisper) is a general-purpose speech recognition model. This endpoint uses Stable-ts (https://github.com/jianfch/stable-ts) and Faster-Whisper (https://github.com/SYSTRAN/faster-whisper) to improve speed and timestamp accuracy of the original whisper. Currently using stable-ts verion 2.17.3. This endpoint creates a job to align a pre-transcribed text (without timestamps) on the provided audio file asynchronously. The job will be placed in a queue and processed in the order it was received. The job status can be checked using the /stablets/status endpoint.

Body
application/json
audio_data
string

audio_data should be passed in as a dictionary in the format below:

{
  "audio_data": "audio file converted to base64",
  "file_ext": "mp3" or "wav"
}

If your audio file is in another format or is a video, consider converting it to one of the supported formats (mp3 or wav) using a tool such as ffmpeg.

Exactly one of the following two query parameters is required:

  1. audio_data (this parameter)
  2. audio_url
Example{ "audio_data": "SUQzAwAAAAAAU1RTU0UAAAAPAAADTGF2ZjU2LjIzLjEwMAAAAAAAAAAAAAAA//tAxAAAAAAA", "file_ext": "mp3" }
audio_url
string

A public URL to a hosted audio file. You can use services such as Azure Storage to obtain a public link to your audio file and pass it here. Only mp3 or wav files are supported. If your audio file is in another format or is a video, consider converting it to one of the supported formats (mp3 or wav) using a tool such as ffmpeg.

Exactly one of the following two query parameters is required:

  1. audio_data
  2. audio_url (this parameter)
Examplehttps://longstoragevoila.blob.core.windows.net/long/zaiye.mp3
model
required
string enum

The model to use for transcription.

Examplelarge-v2
  • tiny
  • base
  • small
  • medium
  • large-v2
  • large-v3
  • distil-large-v2
  • distil-large-v3
  • large-v2-mix-jp
language
string

The language of the audio file. Auto-detects if not provided.

Exampleen
prepend_punctuations
string

Punctuations to prepend to the transcription.

append_punctuations
string

Punctuations to append to the transcription.

regroup
boolean

Whether to regroup the transcription segments.

Exampletrue
suppress_silence
boolean

Whether to suppress silence in the transcription.

Exampletrue
suppress_word_ts
boolean

Whether to suppress word timestamps in the transcription.

Exampletrue
use_word_position
boolean

Whether to use word positions in the transcription.

Exampletrue
q_levels
integer

The quality levels for the transcription.

Example2
k_size
integer

The kernel size for the transcription.

Example3
denoiser
string

The denoiser to use for the transcription.

Exampledefault_denoiser
denoiser_options
object

Options for the denoiser.

vad
boolean

Whether to use voice activity detection.

Exampletrue
vad_threshold
number float

Threshold for voice activity detection.

Example0.5
min_word_dur
number float

Minimum duration of words in the transcription.

Example0.2
min_silence_dur
number float

Minimum duration of silence in the transcription.

Example0.5
nonspeech_error
boolean

Whether to treat nonspeech as an error.

Exampletrue
only_voice_freq
boolean

Whether to use only voice frequency.

Exampletrue
text
required
string

The text to align with the audio.

remove_instant_words
boolean

Whether to remove instant words from the alignment.

Exampletrue
token_step
integer

Step size for token alignment.

Example1
original_split
boolean

Whether to use original split for alignment.

Exampletrue
max_word_dur
number float

Maximum duration of words in the alignment.

Example1.5
nonspeech_skip
boolean

Whether to skip nonspeech segments during alignment.

Exampletrue
fast_mode
boolean

Whether to enable fast mode for alignment.

Exampletrue
stream
boolean

Whether to stream the alignment results.

Exampletrue
failure_threshold
number float

Threshold for considering alignment as a failure.

Example0.2
presplit
boolean

Whether to presplit the text for alignment.

Exampletrue
gap_padding
number float

Padding for gaps in the alignment.

Example0.1
Responses
  • 200

    Successful response

  • 400

    Bad Request

POST/stablets/align
Shell cURL
curl --request POST \
  --url https://ytdlp-voilatech-apim.azure-api.net/v1/stablets/align \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "Hello, how are you?",
  "model": "small",
  "audio_url": "https://longstoragevoila.blob.core.windows.net/long/zaiye.mp3"
}'
{
  "id": "66bf0b2a-28c4-43a9-895c-96ec04aa49d1-e1",
  "status": "IN_QUEUE"
}