AI-Powered Trace Analysis

Adding MCP to TMLL

Matthew Khouzam

Ericsson Research · Open Community Experience 2026

Agenda

Tracing, Trace Compass & TMLL
What is MCP & why it matters
Performance, cost & privacy
How to start: create a CLI, wrap it in MCP
Debug with Theia & MCP Inspector
See results in Kiro, Theia & Gemini
Key takeaways

Who Am I?

Matthew Khouzam, Principal Researcher & Open Source Developer, Ericsson, Montréal
16 years in FOSS: Eclipse Trace Compass, Linux tracing, TMLL
University collaborations (Polytechnique Montréal, Concordia)
Former (and future ;) ) board of directors member, CoC commitee member, former chair of eCDT.

I have the privilege of being paid by Ericsson to make the world a better place through open source.

Tracing & Trace Compass

Context before the code

What is Tracing?

Recording timestamped events from a running system, kernel, userspace, network
Non-intrusive: nanosecond overhead, always-on capable
Produces massive datasets: a few minutes → gigabytes of events

What is Trace Compass?

Open-source Eclipse project for trace visualization & analysis
Supports LTTng, CTF, perf, ftrace, and more
Powerful, but requires expertise to use effectively
Trace Server Protocol (TSP) exposes analysis via REST API

Why Add MCP?

Users shouldn't need to be tracing experts to get insights
"Find CPU anomalies in this trace" is easier than navigating 20 views
AI handles the how, the user focuses on the what

Personal motivation: make trace analysis accessible to everyone, not just the experts who built it

What's New in Trace Compass

CTF2 support, JSON-based trace metadata, human-readable, AI-parseable
Opens the door to smarter AI-driven analysis
AI can read trace structure without binary parsing

JSON metadata means AI agents can understand trace schemas natively

TMLL, The ML Layer

Anomaly detection, iforest, z-score, IQR, moving average, combined
Change point analysis, single, z-score, voting, PCA
Correlation analysis, Pearson, Spearman, Kendall
Memory leak detection, idle resources, capacity planning

Python ML library on top of the Trace Server Protocol, powerful, but requires code to use

What is MCP?

Model Context Protocol, the universal adapter between AI and tools

MCP in 30 Seconds

Open protocol: JSON-RPC over stdio or HTTP
Write a tool once → use from any MCP-compatible agent
Adopted by OpenAI, Anthropic, Google, all major IDEs

AI Agent

→

MCP Server

→

Your Tool / API

Why MCP Matters for You

Deterministic tools + AI reasoning = best of both worlds

Deterministic Execution

AI decides what to call, your tool decides how to execute
No hallucinated analysis, real code runs on real data
Reproducible results: same input → same output, every time

Save Tokens

Don't paste raw data into the prompt, let the tool process it server-side
Return summaries, not megabytes of CSV
Progressive discovery: only load schemas for tools you actually use (~80% savings)

Improve Your Existing Tools

Your CLI already works, MCP makes it AI-accessible
No rewrite needed: wrap, don't replace
Users who can't write Python can now use your tool via natural language

MCP doesn't replace your tools, it gives them a new audience

Performance & Cost

Why not just feed the trace to the LLM?

Raw Trace → LLM vs MCP + TMLL

Approach	Tokens	Cost (est.)	Result
babeltrace → LLM	~1B+	$1000s per query	Hallucinated, non-deterministic
MCP + TMLL	~2K	fraction of a cent	Deterministic ML analysis

A 1 GB kernel trace ≈ billions of text tokens, exceeds every context window

LLMs can't reliably do statistics on raw events
MCP approach: AI sends one tool call, gets deterministic results back
Trace Server + TMLL do the heavy lifting, AI just interprets

MCP Overhead

Operation	Direct TMLL API	MCP (via CLI)
Experiment creation	~200 ms	~800 ms
Anomaly detection	~2 s	~3 s
Correlation analysis	~1.5 s	~2.5 s

~1 s overhead per call (subprocess + Python startup)
Negligible next to LLM inference latency
Future: in-process mode eliminates the cost

Runs Locally, Privacy & Free Compute

Trace Compass, TMLL, and the MCP server all run on your machine
Your traces never leave the network, critical for regulated, proprietary, or customer data
Uses compute you already own, no new cloud bill, no GPU rental
The AI only sees the small, aggregated result, not the raw events

We already had the hardware, the server, and the library. MCP just unlocked them.

How to Start

Step 1: Create a CLI

Create a CLI

#!/usr/bin/env python3
"""tmll_cli.py, 12 subcommands wrapping the TMLL library."""
import argparse
from tmll.tmll_client import TMLLClient

def detect_anomalies(args):
    client = TMLLClient(args.host, args.port)
    experiment = get_experiment(client, args.experiment)
    outputs = experiment.find_outputs(keyword=args.keywords, type=['xy'])
    ad = AnomalyDetection(client, experiment, outputs)
    result = ad.find_anomalies(method=args.method)
    print(f"Found {total} anomalies across {len(result.anomalies)} outputs")

# ... 11 more subcommands ...

Standard argparse, nothing MCP-specific yet

Step 2: Wrap CLI in MCP

~615 lines of Python total

The MCP Wrapper

from mcp.server.fastmcp import FastMCP
mcp = FastMCP("tmll-cli-mcp-server")

@mcp.tool()
def detect_anomalies(experiment_id: str,
                     keywords: list[str] = None,
                     method: str = None) -> str:
    """Detect anomalies in trace data using ML methods."""
    args = build_args({
        "keywords": ("-k", keywords or ["cpu usage"]),
        "method": ("-m", method or "iforest"),
    })
    return run_cli("anomaly", experiment_id, *args)

Any CLI → MCP tool in 5 lines of glue

The Stack

TMLL, Python ML library for trace analysis

↓ wrapped as

CLI, argparse, 12 commands

↓ wrapped as

MCP Server, FastMCP + subprocess

↓ used by

Any AI agent: Kiro, Gemini, Theia, Goose…

Why MCP → CLI → Library?

Separation of concerns, you can access and test your CLI independently, no AI needed
MCP is new (2024), put it on tested footing by building on a proven CLI layer
One path to maintain, why have two code paths when one works for both humans and AI?

12 Tools Exposed

ensure_server
create_experiment
list_experiments
list_outputs
fetch_data
delete_experiment

detect_anomalies
detect_memory_leak
detect_changepoints
analyze_correlation
detect_idle_resources
plan_capacity

Progressive Discovery

Eager Loading (old)

All 12 tool schemas sent upfront
~2,200 tokens consumed before the user says anything

Progressive Discovery (new)

Schemas loaded on demand
Typical session: 2–3 tools → ~80% savings

FastMCP's progressive discovery means the AI only pays for what it uses

MCP Apps

Rich content returned inline

Images in AI Output

@mcp.tool()
def plot_xy_with_anomalies(experiment_id: str,
                           as_image: bool = True):
    """Detect anomalies and return annotated charts."""
    # ... run analysis, generate matplotlib plot ...
    return Image(data=buf.getvalue(), format="png")

The AI doesn't just describe the anomaly, it shows you the chart

Debugging MCP

Theia IDE & MCP Inspector

Theia, AI Agent History

MCP Inspector

MCP Inspector, test tools interactively without an AI client

See the Results

One MCP server, every AI client

Kiro IDE

Eclipse Theia & Gemini CLI

Theia, inline anomaly chart

Gemini, 1 GB trace, 85K tokens, 5 tool calls

Key Takeaways

~615 lines of Python to make TMLL AI-accessible
Pattern: Library → CLI → MCP works for any tool
One MCP server → works across Kiro, Gemini, Theia, Goose
Progressive discovery saves ~80% of token overhead
Runs locally, your traces stay on your machine, on hardware you already own
MCP Apps: charts and images inline in AI responses
The bottleneck is never the MCP glue, it's the ML analysis and LLM inference
Many MCP tools in the wild are sub-optimal, don't judge the technology by a few bad implementations

MCP is a protocol, not a product. The quality is in your hands.

Thank You

matthew.khouzam@ericsson.com

github.com/eclipse-tmll/tmll/pull/16

Merged last night!

Back to main page