/2025/07/11/video-generate-design-notes

Design Notes on Video Generate

中文

Video Generate started from a simple question: how much of the short-video production loop can be made explicit enough for software to handle?

I did not want it to be a magic box that says “AI video” and hides every decision. The useful shape is more mechanical: text becomes a script, the script becomes image prompts, images become material for a generated video, and the final result is stored somewhere that can be downloaded and inspected.

1. Keep the Pipeline Visible

The core design idea is a pipeline, not a chat window.

Short-video generation has several different jobs inside it:

  • understand the source text
  • decide the structure of the video
  • write a script
  • generate matching images
  • assemble the video
  • return a result the user can actually use

If these steps are hidden behind one vague button, debugging becomes difficult. When the output is bad, I need to know which stage failed. Was the script weak? Were the image prompts too generic? Did the media service fail? Did the storage URL break?

So the interface and backend both need to respect the same structure. A one-click flow can still exist, but internally it should remain a sequence of observable steps.

2. Use AI for Drafting, Not Ownership

The AI part should reduce blank-page work. It should not take ownership away from the user.

For this kind of project, I care less about whether the model can produce something surprising, and more about whether it can produce something structured. A good generated script is not just fluent text. It has scenes, timing, visual direction, and enough constraints for the next step to use.

That is why the design leans toward a service layer instead of scattered prompt calls. The application should treat model output as intermediate data, then pass it into image and video generation in a controlled way.

3. Separate Product Flow From Provider Details

DeepSeek, Alibaba Cloud image generation, ICE, and OSS are provider-specific pieces. The user should not have to think about them while using the app.

But the code should still make those boundaries clear:

  • DeepSeek writes the script
  • image generation creates visual assets
  • ICE assembles the video
  • OSS stores the generated media

This separation matters because cloud APIs change, credentials expire, and generated media workflows tend to fail in partial ways. If the code treats everything as one large action, small failures become hard to repair.

4. Prefer a Small Web App Over a Heavy Studio

I wanted the app to feel closer to a task tool than a video editor. The user brings text, starts generation, watches progress, and downloads the result.

That choice keeps the scope honest. A full editor would need timelines, manual trimming, asset libraries, preview states, undo, and export settings. Those are real features, but they would change the project into something else.

The useful first version is narrower: make the automated path complete before adding manual editing controls.

5. Make Progress a First-Class State

Generated video is not instant. If the page only waits silently, the product feels broken even when the backend is still working.

Progress display is not decoration here. It is part of the contract. The user needs to know that the system accepted the input, is generating a script, is creating assets, is waiting for video output, and has either completed or failed.

That also helps development. When the UI exposes states clearly, backend failures become easier to reproduce.

Current Rule

For Video Generate, the design rule is:

turn text into a visible production pipeline

The project is useful when it makes the hidden steps of AI video generation concrete enough to inspect, retry, and improve.