ChitChats Tool Enhances Multimodal Interaction With GPT-5.2-Codex Support

TL;DR

Processes up to 500 images and a 50MB payload per request through multimodal support based on GPT-5.2-Codex.
Improves user experience and reduces latency by applying real-time token-level streaming technology.
Increases developer accessibility through an open-source repository that allows for local runtime configuration and agent parameter adjustments.

Example: A user sends a hand-drawn clothing design to the chat window. The character examines the drawing and offers opinions on the fabric feel or color harmony, continuing the story while reflecting the creative intent in the conversation.

A shift is occurring where visual information is being combined with text-centric character dialogue. Artificial Intelligence (AI) now identifies the atmosphere of screenshots sent by users or suggests revisions for design drafts. A multimodal environment that simultaneously processes different types of data, such as text and images, is becoming the standard for character interaction.

The ChitChats tool enhances the immersion of character dialogue by integrating image recognition capabilities into the OpenAI Codex environment. The focus of this technical integration lies in the ability to process large-scale image data and optimize response speed through real-time token-level streaming. By overcoming the visual perception limitations of existing text-oriented agents, users can experience multidimensional conversations.

Current Status

OpenAI Codex (based on GPT-5.2-Codex) officially supports multimodal data processing. Developers can input images in PNG and JPEG formats via API and CLI. Inputted images are analyzed as context alongside prompts, rather than being treated as simple attachments. According to technical specifications, up to 500 individual images can be inputted per request, with the total payload size limited to 50MB.

The ChitChats tool utilizes real-time streaming technology to leverage Codex's performance. Instead of waiting for a complete response, generated tokens are immediately outputted to the screen. Compared to other tools such as Claude Code, this factor reduces perceived latency. Currently, this technology is being distributed as an open-source repository, allowing users to configure runtimes directly in local environments and adjust agent parameters.

Analysis

Integrating visual information into character dialogue serves as a mechanism to strengthen the character's persona. When a character reviews and reflects photos or screenshots shared by the user in the conversation, the user feels a bond, sensing that the AI understands the situation. In particular, real-time streaming technology maintains the flow of conversation, preventing a loss of immersion caused by mechanical delays.

However, technical limitations and points for review exist. The cost of processing high-resolution images, which can reach up to 2,805 tokens per image in high-detail mode, can be an operational burden in large-scale dialogue sessions. Furthermore, while there are descriptions stating that the ChitChats tool provides faster response speeds than Claude Code, there is a lack of quantitative comparative data to prove specifically reduced times. It should be considered that streaming efficiency may vary depending on actual network environments or local computational resources.

The 50MB payload limit could potentially cause bottlenecks in professional design collaboration scenarios where high-quality images or large volumes of screenshots should be processed simultaneously. For multimodal agents to become practical assistants, additional verification is required regarding not only visual processing speed but also whether they can hierarchically understand complex text or layouts within images.

Practical Application

Developers and users can move beyond text commands to utilize visual assets. By leveraging ChitChats' open-source structure, one can build a local runtime to create personalized multimodal characters. Web publishers, for example, can input screenshots of layout errors encountered during coding into the Codex CLI to receive suggestions for fixes.

To-Do Today:

Check access permissions and token usage for the GPT-5.2-Codex model in the OpenAI API dashboard.
Clone the ChitChats open-source repository to install the runtime in a local environment and test image input features.
Check the consistency of responses by adding instructions to the character agent's system prompt to utilize image analysis results in conversation.

FAQ

Q: What are the image file formats and size limits supported by the Codex API? A: It supports PNG and JPEG formats, and the total payload size per request should be within 50MB. Up to 500 individual images can be included in a single request.

Q: How is the token cost calculated for high-resolution image input? A: In low-detail settings, 85 tokens are consumed per image. In high-detail settings, images are adjusted to within 2048x2048 pixels and calculated in patch units, with a budget of up to 1,536 tokens per image.

Q: What advantages does ChitChats' streaming technology have compared to Claude Code? A: Real-time token-level streaming allows for immediate confirmation of text and visual feedback without having to wait for the entire response, providing an advantage in terms of conversational continuity. However, specific speed differences may vary depending on the usage environment.

Conclusion

The combination of OpenAI Codex and ChitChats has expanded character dialogue into the domain of visual information. The 50MB payload and real-time streaming technology are indicators of the practicality of multimodal interaction. Future challenges include how precisely these technologies demonstrate visual understanding in actual user experiences and whether they create enough value to offset token costs. Conversations with characters have now entered the stage of "seeing" beyond "reading."

References

🛡️ Images and vision | OpenAI API
🛡️ Codex CLI features - OpenAI for developers

Aionda