Rethinking AI Architectures, Multimodality, and Product-Driven Research
- Jayant Upadhyaya
- 10 hours ago
- 7 min read
Artificial intelligence has made rapid progress over the last decade. Large language models, transformers, and massive datasets have reshaped how machines process information. However, many researchers believe that current approaches still fall short of building systems that resemble human intelligence in efficiency, adaptability, and long-term reasoning.
One company working on these deeper challenges is Cartesia, led by CEO Karan Gaul. Founded by former Stanford researchers, the company focuses on new AI architectures, multimodal learning, and real-world products, especially in voice-based applications.
This article explores the ideas behind Cartesia’s work, including why architecture research still matters, how multimodal intelligence should be understood, why audio is a powerful starting point, and how research and product development must work together in AI startups.
Why Architecture Research Still Matters in AI

Over the past ten years, much of AI progress has followed a clear pattern. Researchers discovered strong model designs, such as transformers and self-attention, and then focused on scaling them. Larger datasets, more computing power, and better engineering produced better results.
This approach has been extremely successful. It led directly to the rise of large language models and the modern AI industry. However, it also created a sense that the core problems were already solved and that future gains would come mostly from scale.
Architecture research asks a different question. Instead of asking how big a model should be, it asks what kind of model should exist in the first place.
During their time as graduate students around 2019 and 2020, the founders of Cartesia began questioning whether transformers alone could lead to systems closer to human intelligence. They looked ahead and asked what challenges would remain even if these models were scaled to their limits.
This question matters because human intelligence is very different from current AI. Humans are efficient. They can learn from limited data, combine information across senses, interact with the world, and operate over long time periods. Even an average human can handle complex tasks that require memory, context, and action.
The belief behind Cartesia’s research is that existing architectures may not be enough to reach this level of intelligence, no matter how much they are scaled.
Limits of the Transformer Paradigm
Transformers are powerful, but they have structural limits. At their core, they are context-based systems. They work by attending to large amounts of information stored in memory and retrieving what seems relevant.
This makes them very good at recall. They can answer questions based on facts present in their context. However, this also makes them inefficient in some ways. They rely heavily on raw information rather than building compressed, abstract representations of the world.
Human intelligence works differently. Humans do not store every detail in raw form. Instead, they compress information into concepts and abstractions. These abstractions allow reasoning over long periods and across many situations.
Transformers struggle with this type of compression. Their context windows act more like large storage buffers than long-term memory systems. This creates limits in how they scale to long time horizons, rich interaction, and complex multimodal understanding.
Recognizing these limits led Cartesia’s founders to explore alternative architectures.
Space-Based and Recurrent Models as an Alternative
One research direction explored by the team involves space-based models, a class of recurrent architectures designed to handle information over time more efficiently. These models aim to process sequences while maintaining a compact internal state.
Recurrent models naturally support compression. Instead of holding all information explicitly, they update a hidden state that summarizes what has been seen so far. This creates a fuzzier but more abstract representation of the world.
There is a tradeoff here. Compression reduces detail, but it also creates structure. By losing some fidelity, models gain the ability to reason at a higher level.
Transformers and space-based models sit at opposite ends of a spectrum. Transformers prioritize exact recall and retrieval. Recurrent models prioritize abstraction and long-term structure.
Modern research increasingly explores hybrid architectures that combine these strengths. These hybrids aim to balance precise recall with compressed reasoning, offering a middle ground between raw memory and abstraction.
The long-term question is not which existing model is best, but what the ultimate architecture should look like for multimodal, interactive, long-lived intelligence.
Compression as a Core Idea of Intelligence

One central idea guiding this research is compression. Compression is not just a technical trick. It is a fundamental part of intelligence.
To reason over large amounts of information, a system must reduce complexity. It must group details into concepts. It must link different forms of information into shared meaning.
Consider a simple object like a cup. A human understands what a cup is in text, in images, in physical interaction, and in sound. All these representations are different, but they are unified into a single concept.
This unification requires compression. The system must build an internal representation that captures meaning without storing every detail.
AI systems that cannot compress in this way remain limited. They may retrieve information accurately, but they struggle to generalize, interact, and adapt over long periods.
Understanding Multimodality Beyond Images and Video
Multimodality is often misunderstood. Many people think of it only as combining text with images or video. While those are important, multimodality is broader.
At its core, multimodality means learning from both signals and symbols. Signals are continuous inputs, such as audio waves or video frames. Symbols are discrete representations, such as text or tokens.
Even speech transcription is a multimodal task. It involves mapping a continuous audio signal to discrete text symbols. This process requires alignment, abstraction, and representation learning.
When two modalities are involved, many problems arise:
Generating one modality from another
Aligning parts of one modality with parts of another
Understanding meaning beyond direct translation
These challenges make multimodal learning a rich and difficult area of research.
Why Audio Is a Strategic Starting Point
Cartesia chose to focus on audio and text not because it is simple, but because it is grounded. Audio is a real-world signal that interacts directly with humans. It involves time, rhythm, emotion, and context.
Audio-text problems capture many of the same challenges found in other multimodal domains, such as video, robotics, and physical interaction. If a model can learn to handle audio properly, many of the same techniques can transfer elsewhere.
Audio also allows focused progress. Multimodal intelligence is a huge space. Trying to solve everything at once leads to shallow solutions. By focusing on one modality pair, research can go deeper.
The belief is that the right approach to audio-text modeling will generalize. The same principles can later apply to video, robotics, and other signal-based domains.
Tokenization and the Problem of Representation

A major challenge in multimodal AI is how to represent signals. Most systems rely on tokenization. Signals are converted into discrete tokens before being processed by models.
This approach has limits. Tokenization often involves hand-engineered steps that reduce flexibility. Important information may be lost before the model even begins learning.
Cartesia’s research explores a different idea. Instead of forcing signals into predefined tokens, models should learn representations internally. The goal is to remove rigid token boundaries and allow end-to-end learning of abstraction.
In this view, the model learns how to represent audio, video, or other signals in a way that best supports reasoning and prediction. This makes the system more adaptable and transferable.
If done correctly, this approach could work across many domains. Audio, video, and robotics all involve signals that must be understood over time.
Building AI for Long-Term Interaction
Another key motivation behind this research is interaction over long time scales. Many AI systems are designed for short tasks. They answer questions or complete single actions.
Human intelligence is different. Humans can be onboarded into a role and improve over years. They remember context, adapt to new situations, and interact continuously with others.
Cartesia uses the example of an AI call center agent. This is not a simple task. The agent must understand speech, respond appropriately, learn from experience, and handle many different users over time.
This kind of system requires:
Long context handling
Consistent behavior
Ability to improve gradually
Rich interaction with humans
These requirements expose weaknesses in current architectures and motivate new approaches.
Intelligence Is Not Just High IQ
There is a common belief that intelligence is about solving math or physics problems. While those are impressive skills, they are not the full picture.
Most real-world intelligence involves dealing with people, systems, and context. It involves communication, coordination, and decision-making under uncertainty.
AI systems today are often strong in narrow tasks but weak in general interaction.
This is not because of lack of data or compute alone. It is also because of architectural and training limitations.
Multimodality and interaction are key missing pieces.
Research Versus Product Development

Running a research-driven company is very different from doing academic research. In academia, many ideas can coexist. Different researchers pursue different goals. Exploration is encouraged.
In a startup, focus is essential. There is usually room for only one core vision. Random exploration can be expensive and distracting.
The challenge is to balance exploration with execution. Teams must feel free to think creatively, but they must also work toward a clear product goal.
Products bring discipline. Customers do not care about architectures. They care about results. If a method does not improve the user experience, it does not matter how interesting it is.
This creates intellectual honesty. Ideas must be tested against reality, not just published or discussed.
Why Product Focus Improves Research Quality
A product-driven approach forces hard questions:
Does this actually work better?
Does it help users?
This pressure prevents self-deception. It encourages precision rather than hype.
Research for its own sake is valuable, but research tied to products must meet higher standards. Claims must survive real-world use.
This does not mean novelty is bad. It means novelty must be justified.
Startup Gravity Applies to Research Companies Too
Some believe that research companies are exempt from normal startup rules. This belief is dangerous.
All companies face constraints. Resources are limited. Focus matters. Execution matters.
The lessons taught by startup accelerators and experienced founders apply just as much to research-driven companies. Ignoring these lessons leads to wasted effort.
Research ambition must be paired with operational discipline.
Conclusion
AI progress is not finished. Scaling existing models is powerful, but it is not the final answer. Architecture, compression, multimodality, and interaction remain open challenges.
By focusing on new architectures, grounded multimodal problems, and real products, companies like Cartesia aim to push AI toward more human-like intelligence.
The future of AI will not be defined only by bigger models, but by better understanding of how intelligence works, how signals become meaning, and how systems interact with the world over time.
That future requires both deep research and honest product thinking.



Comments