Tag: multimodal

25 articles available

SourceJul 10, 20262026-07-10

Crossmodal Speech Emotion Analysis With Audio And Generated Transcripts

Why combining audio with generated multilingual transcripts matters for speech emotion analysis, and where errors and cost tradeoffs remain.

Radiology AI for Draft Reporting in Clinical Workflow

agi

SourceJul 8, 20262026-07-08

Radiology AI for Draft Reporting in Clinical Workflow

Examines Harrison.Rad 1.5 as a radiology draft-reporting model, focusing on workflow value, supervision, and deployment risks.

Five-Modal MKGR for Cold-Start PPI Prediction

llm

SourceJul 4, 20262026-07-04

Five-Modal MKGR for Cold-Start PPI Prediction

MKGR combines one sequence modality and four knowledge graphs to improve cold-start PPI prediction over prior baselines.

OCB Tests Native Office Understanding Beyond PDF QA

hardware

SourceJul 3, 20262026-07-03

OCB Tests Native Office Understanding Beyond PDF QA

OCB evaluates native Office file understanding, revealing document AI limits beyond PDF-based QA.

MMG-Pop Rethinks Social Popularity Prediction Across Platforms

llm

SourceJun 29, 20262026-06-29

MMG-Pop Rethinks Social Popularity Prediction Across Platforms

MMG-Pop uses multimodal and temporal graph signals from Bluesky and Reddit to reassess social popularity prediction.

Rethinking Trust in Video Reasoning Under Visual Corruption

agi

SourceJun 26, 20262026-06-26

Rethinking Trust in Video Reasoning Under Visual Corruption

Examines the Blind Trust Problem in video reasoning and a reliability-based strategy for frame and tool selection.

CineCap And The Challenge Of Cinematic Video Captioning

hardware

SourceJun 24, 20262026-06-24

CineCap And The Challenge Of Cinematic Video Captioning

CineCap targets cinematic video captioning, focusing on camera motion, shot size, angle, and structured scene reasoning.

IV-CoT Separates Structure Planning From Visual Rendering

llm

SourceJun 24, 20262026-06-24

IV-CoT Separates Structure Planning From Visual Rendering

IV-CoT targets structural prompt fidelity in text-to-image generation by separating layout planning from appearance rendering.

Safety-Aware Evaluation for LLM Driver Intervention Messages

hardware

SourceJun 23, 20262026-06-23

Safety-Aware Evaluation for LLM Driver Intervention Messages

Why LLM driver intervention messages should be judged by risk alignment, urgency, and actionability, not text similarity alone.

See First, Answer Later in Multimodal LLM Alignment

hardware

SourceJun 18, 20262026-06-18

See First, Answer Later in Multimodal LLM Alignment

A paper issue on pre-aligning multimodal LLMs to use sufficient visual evidence before answering.

CAPED Reduces Privacy Exposure in Mobile GUI Agents

llm

SourceJun 12, 20262026-06-12

CAPED Reduces Privacy Exposure in Mobile GUI Agents

CAPED filters mobile screenshots before remote agents see them, reducing incidental privacy exposure while preserving task utility.

Choosing Between Subtitle and Vision Video Summarization

agi

CommunityJun 12, 20262026-06-12

Choosing Between Subtitle and Vision Video Summarization

A practical guide to choosing subtitle-only or multimodal frame analysis for video summary apps, with tradeoffs in quality, cost, latency, and evaluation.

Structure-Aware Retrieval Matters for Enterprise Document RAG

hardware

SourceJun 4, 20262026-06-04

Structure-Aware Retrieval Matters for Enterprise Document RAG

In enterprise document RAG, retrieval granularity often matters more than reasoning. Why structure-aware search helps.

MOV-Bench Reveals Gaps in Multi-Hop Video Reasoning

llm

SourceMay 28, 20262026-05-28

MOV-Bench Reveals Gaps in Multi-Hop Video Reasoning

MOV-Bench highlights evaluation gaps in multi-hop audio-visual reasoning and shows consistent gains from agentic search.

Multi-Image Jailbreaks Expose Multimodal LLM Safety Gaps

hardware

SourceMay 20, 20262026-05-20

Multi-Image Jailbreaks Expose Multimodal LLM Safety Gaps

Multi-image prompts can bypass single-image filters, exposing structural safety gaps in multimodal LLM defenses.

Speaker Diarization Expands to Film and TV

agi

SourceMar 20, 20262026-03-20

Speaker Diarization Expands to Film and TV

Speaker diarization is moving from meetings to film and TV, where off-screen speech, noise, and subtitle drift matter.

When Prompts Shrink, Video Creation Becomes Pipeline Operations

hardware

CommunityMar 11, 20262026-03-11

When Prompts Shrink, Video Creation Becomes Pipeline Operations

As prompts shrink, video work shifts from generating to operating: lock identity with references, storyboard panel prompts, set multimodal priority rules, and track rights risk.

Multimodal Clinical Reasoning Needs Controlled Evaluation, Not Scores

agi

SourceMar 7, 20262026-03-07

Multimodal Clinical Reasoning Needs Controlled Evaluation, Not Scores

In multimodal clinical reasoning, reported gains don’t guarantee safety; prioritize controlled evaluation, grounding, and auditable failure modes.

When Image Preprocessing Breaks Multimodal Geolocation Reliability

agi

CommunityMar 4, 20262026-03-04

When Image Preprocessing Breaks Multimodal Geolocation Reliability

Resizing, tiling, and tokenization can shift what models see, turning map/geography misreads into repeatable product risk.

How to Resolve Multimodal Feature Access Errors for Subscribers

llm

CommunityFeb 2, 20262026-02-02

How to Resolve Multimodal Feature Access Errors for Subscribers

Analyze permission sync errors limiting multimodal features for paid users and discover practical solutions like session renewal.

ChitChats Tool Enhances Multimodal Interaction With GPT-5.2-Codex Support

llm

CommunityJan 22, 20262026-01-22

ChitChats Tool Enhances Multimodal Interaction With GPT-5.2-Codex Support

ChitChats leverages GPT-5.2-Codex to provide multimodal character interactions with real-time streaming and large-scale image processing.

adobe acrobat

NewsJan 21, 20262026-01-21

Adobe Acrobat AI Transforms PDFs Into Presentations and Podcasts

Adobe Acrobat AI converts PDF documents into presentations and podcasts with an accurate source attribution engine.

gemma 3

TrustedJan 16, 20262026-01-16

Google Unveils Gemma 3 for Efficient On-Device Multimodal AI

Gemma 3 delivers high-speed multimodal inference on local devices with 128K context window and efficient MatFormer architecture.

medgemma

TrustedJan 16, 20262026-01-16

Google MedGemma: Empowering Healthcare With Open Source Multimodal AI Models

Google unveils MedGemma, an open-source medical AI offering high performance and local deployment for data sovereignty.