JPO’s AI Update Raises the Bar for MLLM Patent Filings
The market reaction to the Japan Patent Office’s mid-2026 revision of its AI-related invention guidance should not stop at the headline that MLLMs and cross-modal generative AI are now being treated more explicitly. The practical shift is deeper. For applicants, the revision changes how the specification should be written, where the technical contribution needs to be located, and how inventive-step and description arguments are likely to be tested in prosecution.
The direction is not surprising. JPO had already added 10 new AI-related examination cases in 2024, and its March 2026 report on AI-related invention trends widened the lens to generative AI, multimodal AI and prompt engineering. Read together, the message is fairly clear: Japanese examination practice is moving away from being impressed by the mere presence of a large model and toward asking what concrete technical problem the cross-modal system actually solves.
The tighter point is not whether AI is present, but where the technical effect really sits
For years, many AI filings have leaned too heavily on a familiar structure: input data, model processing, output result. Once the application moves into MLLM territory, that formula starts to look thin. A system that combines text, image, audio, video or sensor data does not become more patentable simply because more modalities are involved. What matters is whether the applicant can show that the architecture improved something technical in a way that can be explained and, ideally, measured.
That is where prosecution pressure is likely to build. Examiners are more likely to ask which part of the workflow reduced latency, improved matching accuracy, lowered false triggers, stabilised device-side execution, improved retrieval precision or made downstream control more reliable. In other words, “we used a powerful multimodal model” will not carry much weight by itself. “This cross-modal arrangement produced a concrete technical improvement in a system, device or processing chain” is much closer to the answer applicants will need.
What weakens many MLLM applications is not the model label, but an under-described chain of operations
A growing number of filings read well on the surface because they use the right vocabulary: multimodal large models, vision-language models, retrieval-augmented generation, agentic flows. But many still leave the core chain of operations underspecified. How are different modalities aligned? How is prompting structured? How do retrieval and generation interact? What constraints are applied before output is sent to a device, workflow or user-facing action? Where does feedback enter the loop? Those are not drafting ornaments anymore. They are where inventive-step and description support can start to wobble.
The JPO’s 2024 case additions already showed a sharper interest in inventive step, description requirements and patent eligibility for AI-related technologies. In an MLLM setting, that scrutiny is likely to become more exacting, not less. If a claim simply lists text, image and audio handling together, but does not explain why the modalities work cooperatively rather than being merely juxtaposed, the application risks being treated as a straightforward extension of known model capabilities into an existing workflow.
Cross-modal generative AI will push disclosure costs higher
This is one of the most practical consequences of the revision. Single-modality applications can sometimes survive with a relatively conventional software-process narrative. Cross-modal generative systems are harder to support that way. Applicants will increasingly need to say more about input sources, alignment logic, embedding structures, constraint handling, correction steps, human feedback, deployment conditions and resource boundaries. If they do not, broad claims may struggle on support and enablement; narrow claims may leave too much commercial value behind.
There is also a familiar but underestimated tension here. Many companies place the real know-how in prompt structures, data cleaning, post-processing rules, tool-calling order and safety-threshold control, yet file only a high-level schematic description. That trade-off becomes more expensive in MLLM cases because practical implementation often depends on exactly those middle-layer details. Hold too much back and prosecution becomes vulnerable. Reveal too much and applicants worry about exposing operational know-how. That means filing strategy in Japan is likely to require more segmentation and more discipline than a one-template-fits-all AI approach can offer.
Applicants should rethink claim layering and prosecution timing in Japan now
The real filing question is no longer just whether an MLLM-based invention can be claimed in Japan. It is how to split the invention sensibly. A single application that tries to capture model-level capabilities, vertical use cases, terminal-side control logic and data-processing mechanisms all at once can become difficult to defend later. In many cases, a more durable strategy is to separate the core inference or alignment mechanism, the industry-specific implementation and the hardware- or control-linked technical execution into different filing layers. That leaves more room for prosecution and divisional choices without surrendering the technical backbone too early.
Timing matters as well. Applicants would be wise not to wait for a first office action before thinking seriously about measurable effects, fallback embodiments and comparative support. MLLM and cross-modal generative AI cases are especially likely to attract questions such as: why is this not an ordinary optimisation, why is the description not sufficiently supportive, and why does the alleged effect stop at output quality rather than system performance? The signal from this JPO update is not that Japan is closing the door on MLLM filings. It is that model names and workflow diagrams, on their own, are unlikely to be enough any longer.



