Consilium: When multiple LLMs collaborate

Imagine this: four AI experts sitting around a poker table discuss the toughest decisions in real time. That’s exactly what Consilium, the multi-LLM platform I built with Gradio Agents & MCP Hackathon. AI models can discuss complex questions and reach consensus through structured discussions.

This platform acts as an MCP (Model Context Protocol) server that integrates directly with Visual Gradio Interface and Cline (Claude Desktop had issues because Claude Desktop couldn’t be coordinated). The core idea has always been about LLM reaching consensus through discussion. That’s where he comes from the name Consilium. Other decision modes such as majority votes and ranked selections were then added, creating a more refined collaboration.

From concepts to architecture

This was not my original hackathon idea. I initially wanted to build a Simple MCP server and talk to the project on RevenueCat. However, when I realized that these models discussed questions and gave me a multi-LLM platform that gave me a ripe answer, I rethinked it.

The timing turned out to be perfect. Immediately after the hackathon, Microsoft released AI Diagnostic Orchestrator (MAI-DXO). This is an AI physician panel with different roles, like the “Dr. Challenger Agent,” which repeatedly diagnoses patients. The setup with OpenAI O3 correctly resolved 85.5% of benchmark cases for medical diagnosis, but physician practice achieved only 20% accuracy. This will accurately verify what Consilium demonstrates. When multiple AI perspectives collaborate, they can dramatically outperform individual analyses.

After solving the concept, I needed something that would serve as both an MCP server and a facespace demo for the glamorous hug. Initially, I considered using the standard gradient chat component, but wanted to make the submission stand out. The idea was to sit the LLMS around a table in the boardroom with audio bubbles. I couldn’t style the standard table well and so I went to a poker style round table because it was actually recognized as a table. This approach also allows you to send to two hackathon tracks by building custom grade components and an MCP server.

Building a visual foundation

Custom grade components have become the center of submission. A poker-style roundtable where participants sat and displayed audio bubbles, quickly caught the attention of visitors to the space, showing responses, thought status, and research activities. Thanks to Gradio’s excellent developer experience, component development was extremely smooth, but I came across one document gap around Pypi Publishing, which led to my initial contribution to the Gradio Project.

RoundTable = Consilium_RoundTable(label =“AI Expert Round Table”label_icon =“https://huggingface.co/front/assets/huggingface_logo-noborder.svg”value = json.dumps({
“Participants”: (),
“message”: (),
“currentspeaker”: none,
“thought”: (),
“Show Bubble”: (),
“Avatarimage”:avatar_images}))

Visual design has proven robust throughout the hackathon. After the initial implementation, only features such as user-defined avatars and center table text were added, but the core interaction model was not changed.

If you’re interested in creating your own custom grade components, take a look at the custom components in 5 minutes. The title doesn’t lie. The basic setup literally takes only 5 minutes.

Session status management

The visual round table maintains state through a session-based dictionary system where each user obtains orphaned state storage via user_sessions (session_id). The Core State object tracks participants, messages, CurrentsPeaker, thoughts, and showbubbles arrays updated via update_visual_state() callbacks. If the model is running thoughts, statements, or research, the engine will push incremental state updates to the frontend by adding apps to the message array, switching speaker/thinking states, and creating real-time visual flows without complex state machines – creating a direct JSON state mutation synchronized between backend processing and frontend rendering.

Practical discussion with LLMS

During implementation, I realized that there was no actual discussion during LLM because LLMS lacks a clear role. They received the full context of ongoing discussion, but they didn’t know how to attract anything meaningful. I introduced a distinct role in creating productive discussion dynamics. This is what happened after a few adjustments.

self.roles = {
‘standard’: “We provide expert analysis with clear inference and evidence.”,
‘Expert_advocate’: “You are a passionate expert who defends your professional position. We will present with confidence compelling evidence.”,
‘crital_analyst’: “You are a harsh critic. Identify flaws, risks and weaknesses in the discussion with analytical accuracy.”,
‘Strategic_advisor’: “You are a strategic advisor. You focus on real implementation, real-world constraints, and practical insights.”,
“Research_Specialist”: “You are a research expert with deep domain knowledge. We provide authoritative analysis and evidence-based insights.”,
“innovation_catalyst”: “You are an innovation expert. You challenge traditional thinking and propose groundbreaking approaches.”
}

This resolved the issue of discussion, but raised a new question. How to determine consensus or identify the strongest argument? We implemented a lead analyst system that allows users to select one LLM to synthesize the final results and assess whether they have reached consensus.

They also wanted users to control their communication structure. Added two alternative modes beyond the default full context sharing.

Ring: Each LLM receives only response stars from previous participants: All messages flow through lead analyst as central coordinator

Finally, discussions require an endpoint. Configurable rounds (1-5) have been implemented. Tests show that more rounds are more likely to reach consensus (but more computational costs).

LLM selection and research integration

Current model selections include Mistral Large, Deepseek-R1, Meta-Llama-3.3-70B, and QWQ-32B. There are no notable models like Claude Sonnet or Openai’s O3, but this is not a technical limitation, but reflects Hackathon Credit availability and sponsor award considerations.

self.models = {
‘Mistral’:{
‘name’: “Mistral Large Scale”,
“API_KEY”: mistral_key,
‘Available’: Boolean(mistral_key)},
“Sambanova_deepseek”:{
‘name’: ‘deepseek-r1’,
“API_KEY”:sambanova_key,
‘Available’: Boolean(Sambanova_key)} …}

For models that support functional calls, we integrated a dedicated research agent to be presented as participants in another roundtable. Rather than providing direct web access to the model, this agent approach provides visual clarity regarding the availability of external resources, ensuring consistent access across models calling all features.

def handle_function_calls(Self, Complete, original_prompt: strcalling_model: str) -> str:
“” “Unified functional call handler with enhanced research capabilities” “”

message = complete.choices(0). message

if do not have Hasattr(message, ‘tool_calls’)) or do not have message.tool_calls:
return message.content

for tool_call in message.tool_calls: function_name = tool_call.function.name argument = json.loads(tool_call.function.arguments) result = self._execute_research_function(function_name, arguments, calling_model_name)

Research agents access five sources: Web Search, Wikipedia, Arxiv, Github, and Sec Edgar. These tools were built into extensible base-class architectures for future expansion, focusing on freely embedded resources.

class Base Tool(ABC):
“”Basic class for all research tools””

def __init__(Self, name: strexplanation: str): self.name = name self.description = self.last_request_time = 0
self.rate_limit_delay = 1.0

@AbstractMethod
def search(Self, questions: str,** kwargs) -> str:
“” “Main search method – implemented by subclasses” “”
Pass

def score_research_quality(Self, Research_Result: strsauce: str = “web”) -> Dict(str, float):
“” “Newest, Authority, Specificity, Relevance” “”Score Study Based on “””
quality_score = {
“maximum”: self._check_recency (Research_Result),
“authority”: self._check_authority(Research_result, source),
“Specificity”: self._check_spicitity (Research_Result),
“Relationship”:self._check_relevance(research_result)}
return quality_score

Because research operations can be time-intensive, speech bubbles display progress indicators and time estimates to maintain user engagement during longer research tasks.

Discovering open floor protocols

After the hackathon, Deborah Dahl introduced me to the open floor protocol. This is perfectly in line with the Round Table approach. This protocol provides a standardized JSON message format for cross-platform agent communications. An important differentiator from the agent’s protocol from other agents is to maintain a constant conversational awareness, just as all agents are sitting at the same table. Another feature I’ve never seen in any other protocol is that the floor manager can dynamically invite and remove agents from the floor and agents.

Protocol interaction patterns directly map Consilium architectures.

Delegation: Channeling controls between agents: Passing messages without modifications: Behind the scenes orchestration: Multiple agent collaboration

Now, we have integrated open floor protocol support to allow users to add OFP-compliant agents to the Roundtable. This development can be followed at https://huggingface.co/spaces/azettl/consilium_ofp

Lessons learned and the meaning of the future

The Hackathon introduced me to a study of multi-agent discussions that I have never encountered before. This includes basic research, such as encouraging different thinking in large language models through multi-agent discussions. The community experience was amazing. All participants actively supported each other through discrepancy feedback and collaboration. Looking at the components of my roundtable integrated into another hackathon project was one of my highlights working on Consilium.

I continue to work on Consilium, with extended model selection, integration of open floor protocols, and the role of configurable agents, allowing the platform to support all the multi-agent debate scenarios imaginable.

Building Consilium reinforces my confidence that the future of AI is not just in stronger individual models, but in systems that enable effective AI collaboration. As specialized small language models become more efficient and resource-friendly, I think that task-specific SLM roundtables with dedicated research agents may offer an attractive alternative to generic, large language models for many use cases.

versatileai

See Full Bio

What's Hot

GPT-5.5 is OpenAI’s most capable agent AI model to date

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

GPT-5.5 is OpenAI’s most capable agent AI model to date

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

The most comprehensive evaluation suite for GUI agents!

Diffusers welcome stable spread 3

Most Popular

Disney invests $1 billion in OpenAI, licenses over 200 characters for Sora AI tool

The most comprehensive evaluation suite for GUI agents!

Diffusers welcome stable spread 3

Don't Miss

GPT-5.5 is OpenAI’s most capable agent AI model to date

What is optical interconnect and why Lightelligence’s $10 billion debut claims it’s important for AI

Adaptive ultrasound imaging with physics-based NV-Raw2Insights-US AI

Subscribe to Updates

What's Hot

Consilium: When multiple LLMs collaborate

From concepts to architecture

Building a visual foundation

Session status management

Practical discussion with LLMS

LLM selection and research integration

Discovering open floor protocols

Lessons learned and the meaning of the future

Related Posts