Glossary

The AI video dictionary

Every term you'll meet making video with AI — from camera and codec basics to diffusion, avatars and agentic generation — defined in plain English.

74 terms · Video · AI · AI video

74 terms

A B C D E F G H I K L M N O P R S T U V W

Agentic videoAI video: An AI agent that plans and runs the whole production — script, scenes, voice, avatars and edits — from a single brief, rather than one clip at a time.
AI avatarAI video: An AI-generated or cloned on-screen presenter that can speak your script in a chosen voice and language.
AI dubbingAI video: Replacing or adding spoken audio in another language, ideally matched to the speaker's voice and lip movements.
Aspect ratioVideo: The width-to-height proportion of the frame — 16:9 (widescreen), 9:16 (vertical for Reels and TikTok), or 1:1 (square). It decides how your video fits each platform and screen.

B-rollVideo: Supplemental footage cut in over the main shot to add context, illustrate a point, or hide an edit.
BitrateVideo: How much data is used per second of video, measured in kbps or Mbps. A higher bitrate keeps more detail but makes larger files.
BokehVideo: The soft, pleasing out-of-focus blur in the background of a shot, often rendered as glowing circles of light.

Camera controlAI video: Directing virtual camera moves — pan, zoom, orbit, dolly — within an AI-generated shot.
Captions / subtitlesVideo: On-screen text of the spoken audio. Captions also note sounds and speakers for accessibility; subtitles usually transcribe or translate the dialogue.
CheckpointAI: A saved snapshot of a model's weights. Checkpoints are often shared as the downloadable 'model file' people run.
Chroma key (green screen)Video: Replacing a solid-coloured background — usually green — with another image or video by making that colour transparent.
CodecVideo: The algorithm that compresses and decompresses video — such as H.264, H.265/HEVC, AV1 or VP9. It balances visual quality against file size.
Colour gradingVideo: The creative step of adjusting colour, contrast and mood of footage in post-production to give it a consistent, intentional look.
CompositingVideo: Layering multiple visual elements — footage, graphics, effects, text — into a single combined frame.
Container (file format)Video: The file wrapper that holds the video, audio and metadata together — MP4, MOV, WebM or MKV. It is separate from the codec stored inside it.

DatasetAI: The collection of examples — videos, images, text — that a model is trained on. Its quality and variety shape what the model can do.
Deep learningAI: Machine learning that uses many-layered neural networks. It powers today’s image, video, voice and language models.
DeepfakeAI video: Synthetic media that realistically swaps or fabricates someone's face or voice. Powerful, but it raises real consent, authenticity and legal concerns.
Depth of fieldVideo: How much of the image is in sharp focus. A shallow depth of field blurs the background to make the subject pop.
Diffusion modelAI: The dominant approach behind AI images and video: the model starts from random noise and, step by step, removes it until a coherent result matching your prompt appears.
Digital human / digital twinAI video: A photorealistic AI replica of a real person, trained once and reused as an on-camera presenter.

EmbeddingAI: A list of numbers (a vector) that captures the meaning of text, an image or audio, so the model can compare and combine different inputs.
Establishing shotVideo: A wide opening shot that sets the location and context of a scene before you cut in closer.

Fine-tuningAI: Further training a base model on specific data to specialise it — for a particular style, brand or person.
First & last frameAI video: Supplying a start frame and/or an end frame that the model animates between, giving you precise control over a shot's beginning and end.
Foundation modelAI: A large, general-purpose model trained on broad data that can be adapted to many downstream tasks.
Frame interpolationAI video: Generating in-between frames to raise the frame rate or smooth motion — for example turning 24fps into silky 60fps.
Frame rate (FPS)Video: Frames per second — how many still images play each second. 24fps feels cinematic, 30fps is standard for the web, and 60fps looks ultra-smooth for motion and sports.

GANAI: Generative Adversarial Network — an earlier method where a generator and a critic network compete. Largely replaced by diffusion for high-quality video.
Guidance scale (CFG)AI: How strictly the model follows your prompt versus improvising. Higher values stick closer to the words; lower values give the model more creative freedom.

HallucinationAI: When a model produces confident output that is wrong or invented — like garbled text, extra fingers, or impossible motion.
HDR (High Dynamic Range)Video: Video that carries a wider range of brightness and colour than standard (SDR), for more lifelike highlights, shadows and richer tones.

Image-to-video (I2V)AI video: Bringing a still image to life as a video, often guided by a prompt that describes the motion you want.
InferenceAI: Running an already-trained model to produce an output — for example, generating your video from a prompt. This is what you pay for per generation.
Inpainting / outpaintingAI video: Filling in part of a frame (inpainting) or extending beyond its edges (outpainting). In video, used to remove, replace or expand regions over time.

KeyframeVideo: In editing, a marked frame that sets a value (position, scale, opacity) the software animates between. In compression, a full reference frame that nearby frames are rebuilt from.

Latent spaceAI: A compressed mathematical representation where the model actually works. Generation happens here first, then gets decoded into visible pixels.
LetterboxingVideo: Black bars added above and below (or beside) a video so it fits a different aspect ratio without cropping the picture.
Lip syncAI video: Matching a character or avatar's mouth movements to spoken audio so it looks like they are really saying the words.
LoRAAI: Low-Rank Adaptation — a lightweight way to teach a model a new style, character or concept with a small add-on file, instead of retraining the whole model.
Lower thirdVideo: Text placed in the lower part of the frame, typically a speaker's name and title, or a caption.
LUT (Look-Up Table)Video: A preset that remaps colours to apply a specific look in one click, or to convert footage between colour spaces.

ModelAI: A trained AI system that turns an input — like a text prompt — into an output, like a video. Different models have different strengths, speeds and prices.
Motion control / motion brushAI video: Tools that let you direct where and how things move in a generated clip, instead of leaving it entirely to the model.
MultimodalAI: A model that understands or generates more than one type of data at once — for example text, image, video and audio together.

Negative promptAI: A description of what you do NOT want in the output. It steers the model away from unwanted objects, styles or artefacts.
Neural networkAI: A model loosely inspired by the brain: layers of connected 'neurons' that learn patterns from data. It is the foundation of modern generative AI.

Open-weight modelAI: A model whose weights are published so anyone can run, study or fine-tune it (e.g. on fal or locally), as opposed to a closed model reached only through an API.

Parameters (weights)AI: The internal numbers a model learns during training. They store what the model 'knows'; more parameters can mean more capability.
PromptAI: The instruction you give the model — usually text, sometimes plus an image — describing the video you want it to create.
Prompt engineeringAI: The craft of wording prompts so the model reliably produces the result you intend, including subject, style, camera and mood.

Reference imageAI video: An image you give the model to guide the subject, character or style of the generated video.
Render / renderingVideo: Processing a project into a finished video file — or, in AI, the model generating frames into a final clip.
ResolutionVideo: The pixel dimensions of each frame, written width × height (e.g. 1920×1080). More pixels means more detail. Common tiers are 720p (HD), 1080p (Full HD), 4K and 8K.
RLHFAI: Reinforcement Learning from Human Feedback — training that uses people's preferences to align a model's outputs with what humans actually want.

Sampling stepsAI: How many iterations a diffusion model takes to turn noise into the final frame. More steps can mean higher quality but slower, costlier generation.
SeedAI: The starting random number for a generation. Reusing the same seed with the same prompt reproduces the same result — handy for consistency and small tweaks.
ShotVideo: A single continuous piece of footage. Common types include the wide shot, the medium shot and the close-up.
StoryboardVideo: A planned sequence of sketches or frames mapping out each shot before you produce or generate a video.
Style transferAI video: Applying the visual style of one reference to your own footage or generation.

Talking headAI video: A video centred on a person speaking to camera — the classic use case for AI avatars and presenters.
Temporal consistencyAI video: Keeping characters, objects and style stable from frame to frame so the video doesn't flicker, warp or morph unnaturally.
Text-to-speech (TTS)AI video: Turning written text into natural spoken audio with a synthetic voice — the engine behind AI voiceovers.
Text-to-video (T2V)AI video: Generating a video clip directly from a written description — no camera, actors or stock footage required.
TokenAI: The smallest chunk of input a model processes — a piece of a word for text, or a patch or frame for video.
TrainingAI: Teaching a model by showing it huge amounts of data and gradually adjusting its internal parameters until it produces good results.
TransformerAI: A neural-network architecture built on 'attention,' which weighs how parts of the input relate. It underpins large language models and many modern video models.
TransitionVideo: How one shot changes into the next — a hard cut, a dissolve, a fade, or a wipe.

UpscalingAI video: Using AI to increase a video’s resolution — say 1080p to 4K — adding plausible detail rather than just stretching pixels.

Video extensionAI video: Continuing a clip past its original length by generating additional frames that follow on naturally.
Video-to-video (V2V)AI video: Transforming an existing clip into a new style or look while preserving its original motion and timing.
Voice cloningAI video: Recreating a specific person's voice from a short sample so it can speak new text in that same voice.

WatermarkVideo: A logo or text overlaid on a video to mark ownership. Many free AI generators add one; paid plans usually remove it.
World modelAI video: An AI that builds an internal simulation of how scenes, objects and physics behave, helping it generate longer, more coherent and consistent video.

From prompt to render: the language of AI video

Making video with AI sits at the crossroads of two worlds — decades of filmmaking and video-production vocabulary, and the fast-moving language of machine learning. This dictionary brings both together, plus the new terms unique to generative video, so you can read any tool, tutorial or model card with confidence.

Start with the fundamentals: aspect ratio, resolution, frame rate and codecs decide how your video looks and where it plays. Then the AI layer — models, diffusion, prompts, seeds and LoRAs shape what gets generated. Finally the AI-video specifics — text-to-video, image-to-video, lip sync, avatars, temporal consistency and world models — describe what today’s generators can actually do.

Every definition is written in plain English, no maths required. Search by keyword, filter by topic, or browse A–Z — then put the vocabulary to work in the Vivideo studio.

The AI video dictionary

Start creating free What is AI Video?