SeCom: Redefining Memory Management in Conversational AI

Foreword

I’ve recently been diving into memory management for dialog-based AI, especially how to construct and retrieve memories in long-term conversations. During my exploration I came across an eye-opening ICLR 2025 paper—**”SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents”**—a collaboration between Microsoft and Tsinghua University.

SeCom solves a core problem: How can an agent effectively manage and retrieve historical information in prolonged conversations? In this post I’ll unpack the method’s key ideas and technical innovations, hoping to spark inspiration for researchers working in this arena.

1. Why Should We Care About Dialog Memory Management?

1.1 Real-World Challenges in Long Conversations

Anyone who chats with LLMs regularly has probably experienced this: once a conversation grows long, the agent seems to “forget” earlier context or respond incoherently. That’s the memory problem in action.

Even with long-context models, super-long dialogs increase compute cost and often degrade quality. Key challenges include:

  • Context length limits: Token budgets remain finite.
  • Information relevance: History contains plenty of facts irrelevant to the current query.
  • Semantic coherence: Related information may be scattered across non-contiguous turns.
  • Personalization: The agent must remember user preferences and interaction patterns.

1.2 A Quick Landscape of Existing Approaches

The community’s strategies roughly split into three camps:

  1. “Give Me Everything” (full history)
    • Complete information, zero recall loss.
    • But like moving an entire library just to find one book—computational overkill.
  2. “Bullet-Point Digest” (summaries)
    • Compact and efficient.
    • Risk of omitting crucial details during abstraction.
  3. “Precision Strike” (retrieval-based)
    • Fetch only what you need, exactly when you need it.
    • Success hinges on choosing the right retrieval granularity—precisely the issue SeCom addresses.

1.2.3 Retrieval-Augmented Generation (RAG) in Dialog

RAG faces dialog-specific hurdles:

  • Chunking strategy: How to segment a dialog into retrievable units.
  • Relevance estimation: Harder than in static docs due to dialog dynamics.
  • Temporal dependency: Order matters; turns refer to earlier context.

1.3 The Granularity Dilemma

We often index memories at the turn-level or at the whole-conversation level. Both extremes break down:

  • Turn-level → fragments context, loses dependencies, retrieval recall suffers.
  • Conversation-level → topic mixture, lots of noise, retrieval becomes coarse.
  • Summaries → irreversible information loss.

SeCom’s insight: dialog naturally contains paragraph-level thematic boundaries. Segmenting at this “just-right” granularity preserves coherence without exploding memory size.

2. Inside SeCom

2.1 Two Key Insights

  1. Paragraph-like Topic Shifts exist in dialog just as in essays.
  2. Natural Language Is Redundant—filler words, confirmations, small talk, etc. Removing them boosts retrieval precision.

Hence SeCom = Segmentation + Compression.

2.2 System Pipeline

1
History → [Segmenter] → Paragraph-level units → [Compressor] → Denoised memories → [Retriever] → Relevant context → [Generator] → Final reply

Technically:

  1. Segmenter $f_{\mathcal I}$ splits the dialog.
  2. Compressor $f_{comp}$ denoises each segment.
  3. Retriever $f_R$ ranks memories for the current user utterance $u^*$.
  4. LLM $f_{LLM}$ produces the answer based on top-N memories.

2.3 How to Segment Without Labels

SeCom leverages GPT-4 in a zero-shot fashion: craft a prompt asking the model to mark topic boundaries and output span indices. No training data required.

When limited gold data are available, a reflection-based loop iteratively refines the guidelines using WindowDiff scores and GPT-4 reasoning.

An incremental segmenter decides on-the-fly whether a new turn merges into the previous segment or starts a fresh one.

2.4 Denoising via LLMLingua-2

LLMLingua-2 scores token importance and keeps the top $(1-r)$ fraction (e.g., 25 %) accordingly. Empirically, retaining just 25 % tokens preserves >95 % key information, lifts retrieval GPT4Score by +9.46, and yields 4 × speed-up.

2.5 Hybrid Retrieval

BM25 (sparse) and MPNet (dense) scores are linearly combined:

$$\text{score}_{hybrid}=\alpha,\text{BM25}+(1-\alpha),\text{MPNet}, \quad \alpha=0.6$$

3. Final Thoughts

3.1 What SeCom Teaches Us

  • Simplicity Wins: Segment + Compress, nothing fancy, yet highly effective.
  • Understand the Problem First: The authors nailed the granularity pain-point before designing a solution.

Future directions:

  • Personalized segmentation tuned to each user’s dialog style.
  • Real-time adaptation of compression and segmentation based on quality metrics.

References

This post is based on Microsoft & Tsinghua University’s ICLR 2025 paper. Please refer to the original publication and open-source repo for implementation details.