Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    SlowMist Flags Linux Snap Retailer Assault on Crypto Pockets Apps

    January 21, 2026

    Anabolic Operating

    January 21, 2026

    Behind the Scenes with the Obsidian Moon Builders

    January 21, 2026
    Facebook X (Twitter) Instagram
    Wednesday, January 21
    Trending
    • SlowMist Flags Linux Snap Retailer Assault on Crypto Pockets Apps
    • Anabolic Operating
    • Behind the Scenes with the Obsidian Moon Builders
    • IHC grants pre-arrest bail to Imaan, Hadi in case relationship again to July
    • PSG falter as Suarez leads Sporting to victory
    • Former Ford authorities staffer challenges watchdog order for Greenbelt interview beneath oath
    • Amagi slides in India debut, as cloud TV software program agency checks investor urge for food
    • A delegation of the Karachi Editors Membership met the Consul Basic of Iran
    • 6 Pakistani dramas that truly get psychological well being proper
    • XRP Holders Quietly Construct Positions In A Sample That Echoes Earlier Cycles
    Facebook X (Twitter) Instagram Pinterest Vimeo
    The News92The News92
    • Home
    • World
    • National
    • Sports
    • Crypto
    • Travel
    • Lifestyle
    • Jobs
    • Insurance
    • Gaming
    • AI & Tech
    • Health & Fitness
    The News92The News92
    Home - AI & Tech - Salesforce AI Introduces FOFPred: A Language-Pushed Future Optical Move Prediction Framework that Permits Improved Robotic Management and Video Era
    AI & Tech

    Salesforce AI Introduces FOFPred: A Language-Pushed Future Optical Move Prediction Framework that Permits Improved Robotic Management and Video Era

    Naveed AhmadBy Naveed AhmadJanuary 21, 2026No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Salesforce AI Introduces FOFPred: A Language-Pushed Future Optical Move Prediction Framework that Permits Improved Robotic Management and Video Era
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Salesforce AI analysis group current FOFPred, a language pushed future optical circulate prediction framework that connects massive imaginative and prescient language fashions with diffusion transformers for dense movement forecasting in management and video era settings. FOFPred takes a number of pictures and a pure language instruction corresponding to ‘transferring the bottle from proper to left’ and predicts 4 future optical circulate frames that describe how each pixel is predicted to maneuver over time.

    https://arxiv.org/pdf/2601.10781

    Future optical circulate as a movement illustration

    Optical circulate is the obvious per pixel displacement between two frames. FOFPred focuses on future optical circulate, which suggests predicting dense displacement fields for future frames given solely present observations and textual content, with out entry to future pictures at inference.

    Future optical circulate is a compact movement solely illustration. It removes static look and retains solely pixel stage movement, so it’s nicely suited as an intermediate state for robotic management insurance policies and as a conditioning sign for video diffusion fashions. In comparison with predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling textures and excessive frequency particulars that aren’t required for movement planning.

    To plug into present latent diffusion infrastructure, the analysis group encode optical circulate as RGB pictures. They map circulate magnitude and course from polar type into HSV channels, then convert to RGB. The scaling of every channel is tuned in order that consecutive circulate frames are visually easy and resemble animated graphics. An ordinary Flux.1 variational autoencoder then encodes and decodes these circulate pictures.

    Unified VLM Diffusion spine

    FOFPred makes use of a unified structure that mixes a frozen imaginative and prescient language mannequin, a frozen VAE and a trainable diffusion transformer. The pipeline is:

    • Qwen2.5-VL is used because the imaginative and prescient language encoder to collectively encode the caption and visible inputs.
    • Flux.1 VAE encodes the enter pictures and the coaching optical circulate targets into latent tensors.
    • An OmniGen type diffusion transformer, DiT, takes projected visible and textual options as conditional inputs and generates latent future circulate sequences.

    Solely the DiT and small MLP projectors are skilled. The Qwen2.5-VL and Flux.1 weights keep frozen, which lets the mannequin reuse picture enhancing pretraining and multimodal reasoning means from prior work. Temporal modeling is added by extending the RoPE positional encoding and a spotlight blocks from two dimensional spatial positions to full spatio-temporal positions throughout enter and output body sequences. This provides full spatio-temporal consideration with out including additional parameters, so the DiT can reuse OmniGen picture pretraining immediately.

    https://arxiv.org/pdf/2601.10781

    Coaching on noisy internet movies with relative optical circulate

    The core mannequin is skilled on internet scale human exercise movies with paired captions. The analysis group makes use of the One thing One thing V2 dataset and the EgoDex selfish manipulation dataset to acquire round 500,000 video caption pairs.

    Coaching makes use of an finish to finish circulate matching goal in latent area. Future optical circulate sequences are first computed offline, then encoded by the VAE and used as targets in a circulate matching diffusion loss for the DiT. Throughout coaching the tactic additionally applies classifier free steering on each textual content and visible circumstances and masks some frames and viewpoints to enhance robustness.

    A essential contribution is the relative optical circulate calculation used to construct clear coaching targets from noisy selfish movies. For every body pair the tactic:

    1. Computes dense optical circulate with an off the shelf estimator.
    2. Estimates digicam movement through homography utilizing deep options.
    3. Makes use of projective geometry to subtract digicam movement and acquire object centric relative circulate vectors.
    4. Filters body pairs by choosing these the place the highest ok p.c circulate magnitudes exceed a threshold, which focuses coaching on segments with significant movement.

    These steps are run offline at decrease decision for effectivity, then recomputed at authentic decision for the ultimate targets. The ablation research exhibits that static body targets or uncooked circulate with out digicam movement removing hurt downstream efficiency, whereas disentangled relative circulate targets give one of the best outcomes.

    https://arxiv.org/pdf/2601.10781

    Language pushed robotic manipulation

    The primary downstream use case is robotic management. FOFPred is finetuned on robotic video caption knowledge to foretell future optical circulate from each fastened and wrist mounted cameras. On high of FOFPred, the analysis group connect a diffusion coverage community that takes predicted circulate, textual content and robotic state, and outputs steady actions. This setup follows prior diffusion coverage work however makes use of future optical circulate as an alternative of predicted RGB frames because the core illustration.

    On the CALVIN ABCD benchmark, which evaluates lengthy horizon zero shot chains of 5 language specified manipulation duties, FOFPred reaches a mean chain size of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 beneath the identical protocol. FOFPred additionally attains a Process 5 success fee of 78.7 p.c, which is one of the best amongst reported strategies. In a low knowledge setting with 10 p.c of CALVIN demonstrations, FOFPred nonetheless reaches 3.43 common size, greater than the three.25 of VPP.

    On RoboTwin 2.0, a twin arm manipulation benchmark with 5 duties that require each arms, FOFPred attains a mean success fee of 68.6 p.c. The VPP baseline reaches 61.8 p.c beneath similar coaching settings. FOFPred improves success on each process within the subset.

    https://arxiv.org/pdf/2601.10781

    Movement conscious textual content to video era

    The second downstream process is movement management in textual content to video era. The analysis group construct a two stage pipeline by connecting FOFPred with the Go along with the Move video diffusion mannequin. FOFPred takes an preliminary body and a language description of movement, predicts a sequence of future circulate frames, and interpolates them right into a dense movement discipline. Go along with the Move then makes use of this movement discipline and the preliminary body to synthesize the ultimate video, implementing the described movement sample.

    On the movement heavy One thing One thing V2 benchmark, the FOFPred together with Go along with the Move pipeline improves over the CogVideoX baseline beneath similar circumstances. The tactic reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and movement constancy 0.662, that are constantly higher than CogVideoX. Importantly, FOFPred solely makes use of language and a single body at inference, whereas a number of controllable video baselines require hand or object masks or trajectories as additional inputs.

    https://arxiv.org/pdf/2601.10781

    Key Take aways

    1. FOFPred reframes movement prediction as language pushed future optical circulate, predicting 4 dense optical circulate frames from a number of present pictures and a textual content instruction, which gives a compact movement solely illustration for downstream duties.
    2. The mannequin makes use of a unified VLM Diffusion spine, with Qwen2.5-VL as a frozen imaginative and prescient language encoder, Flux.1-VAE as a frozen latent encoder for pictures and circulate, and an OmniGen type DiT as the one skilled part with spatio temporal RoPE based mostly consideration.
    3. Coaching depends on massive scale internet and selfish video from One thing One thing-V2 and EgoDex, and builds relative optical circulate targets by estimating ego-motion through homography, subtracting digicam circulate and filtering for top movement segments, which considerably improves downstream efficiency.
    4. In robotic manipulation, FOFPred acts as a movement spine for a diffusion coverage head and achieves cutting-edge or higher outcomes on CALVIN ABCD and RoboTwin 2.0, together with 4.48 common process chain size on CALVIN and 68.6 p.c common success on RoboTwin, outperforming VPP and DreamVLA variants.
    5. For textual content to video era, connecting FOFPred to Go along with the Move yields higher SSv2 metrics than CogVideoX, with greater SSIM and PSNR, decrease FVD and KVD, and improved movement constancy, whereas requiring solely language and a single body at inference, making FOFPred a reusable movement controller for each robotics and video synthesis pipelines.

    Try the Paper, Model and Repo. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


    Michal Sutter is an information science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking complicated datasets into actionable insights.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleGoNeuroXen
    Next Article Calgary awards sole-source contract for feeder primary alternative, work to start this week – Calgary
    Naveed Ahmad
    • Website
    • Tumblr

    Related Posts

    AI & Tech

    Amagi slides in India debut, as cloud TV software program agency checks investor urge for food

    January 21, 2026
    AI & Tech

    How AutoGluon Allows Trendy AutoML Pipelines for Manufacturing-Grade Tabular Fashions with Ensembling and Distillation

    January 21, 2026
    AI & Tech

    Snap reaches settlement in social media dependancy lawsuit

    January 21, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Hytale Enters Early Entry After A Decade After Surviving Cancellation

    January 14, 20263 Views

    Babar Azam falls for duck in BBL 15 qualifier

    January 21, 20261 Views

    Trump tells Norway he’s not obligated to peace after Nobel Prize snub

    January 20, 20261 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Hytale Enters Early Entry After A Decade After Surviving Cancellation

    January 14, 20263 Views

    Babar Azam falls for duck in BBL 15 qualifier

    January 21, 20261 Views

    Trump tells Norway he’s not obligated to peace after Nobel Prize snub

    January 20, 20261 Views
    Our Picks

    SlowMist Flags Linux Snap Retailer Assault on Crypto Pockets Apps

    January 21, 2026

    Anabolic Operating

    January 21, 2026

    Behind the Scenes with the Obsidian Moon Builders

    January 21, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Advertise
    • Disclaimer
    © 2026 TheNews92.com. All Rights Reserved. Unauthorized reproduction or redistribution of content is strictly prohibited.

    Type above and press Enter to search. Press Esc to cancel.