CueZen

How Do ID-Free Models Stack Up? A Performance Benchmark Against Our Current ID-Based RecSys

In our previous post, “From IDs to Meaning: The Case for Semantic Embeddings in Recommendation,” we introduced the concept of moving beyond ID-based representations to metadata-driven approaches and why it’s promising for domains like digital health and personalized health, where personalization needs to go beyond static IDs. By leveraging semantic embeddings derived from descriptive metadata using Large Language Models (LLMs), ID-free approaches aim to address common challenges such as cold-start problems, limited generalizability, and deployment complexity.

In this post, we move from concept to evidence. Our central question is: How do standard recommendation architectures perform when powered by ID-free semantic embeddings, compared to our current ID-based Knowledge Graph Attention Network (KGAT) recommender model?

Models Compared

For this evaluation, we selected models that reflect two distinct approaches to user and nudge representation: Baseline Models (both non-embedding and ID-based) and ID-Free Enhanced Models.

Baseline Models

These models serve as foundational benchmarks for evaluating the value of more advanced, semantic embedding driven approaches:

  • Random: A basic, non-personalized baseline that sets the lower bound for performance, whereby nudges are randomly selected for users.
  • Popular: A simple heuristic that recommends items based on how frequently they were interacted with across the entire available history. This provides a static, global view of popularity and serves as a non-personalized baseline. While it doesn’t use embeddings or user-specific information, it offers a useful reference point for evaluating the performance of learned models.
  • Knowledge Graph Attention Network (KGAT): Our current production model, KGAT, uses the traditional ID-based approach in a Graph Neural Network (GNN). It assigns unique numerical identifiers to each entity (e.g. users, nudges, markers, segments) within a knowledge graph and learns dedicated embeddings for each. An attention mechanism allows the model to capture complex, multi-hop relationships, making it especially effective at leveraging structured knowledge and explicit interaction histories, thereby serving as a strong ID-based benchmark [1]. This model has been successfully personalizing nudges for millions of participants daily, with very good results [2].

ID-Free Enhanced Models Leveraging Metadata

These models represent traditional, well-established recommendation architectures, but with a key change: instead of learning embeddings from arbitrary IDs, they are adapted to operate on pre-computed semantic embeddings. These embeddings are generated using LLMs applied to rich metadata—such as user health profiles, behavioral signals, and nudge content—allowing us to directly assess how well semantic inputs perform when integrated into widely adopted architectures. The following models were selected as they represented a spectrum of RecSys architectures from collaborative filtering to sequential recommenders.

  • BPR (Bayesian Personalized Ranking): A classic pairwise ranking model for collaborative filtering, optimized for implicit feedback [3].
  • NeuMF (Neural Matrix Factorization): A neural network-based approach for collaborative filtering, combining matrix factorization and multi-layer perceptron layers [4].
  • SimpleX: A lightweight yet performant collaborative filtering model that incorporates historical user-item interaction sequences [5].
  • SASRec (Self-Attentive Sequential Recommendation): A state-of-the-art sequential recommendation model that leverages self-attention over item sequences to capture dynamic behavioral patterns [6].

Experiment Setup

To ensure a robust and fair comparison, we conducted our experiments on our proprietary digital health recommendation dataset. The dataset encompasses 10 days of user interactions with various health nudges, alongside rich metadata for both users (e.g., demographics, health conditions, aggregated tracker data) and nudges (e.g., content text, categories, target behaviors, health nudges).

The following table summarizes the dataset statistics used for all models during training, validation, and testing.

StatisticOverallTrainValidationTest
# Users 3,069 2,558 445 446
# Nudges 70 68 47 42
# Interactions 4,490 3,599 445 446

Table 1: Dataset Statistics for Overall, Train, Validation, and Test sets.

The ID-based KGAT model, unlike the other models, represents users, nudges, and related entities within a structured knowledge graph or health graph. As a result, its input includes a larger set of interconnected entities and relation types:

StatisticKGAT Input
# Nodes79,447
# Edges564,429

Table 2: Knowledge Graph Structure for the KGAT Model.

Our evaluation followed a standard protocol:

  • Data Split: The dataset was split based on time into training, validation, and test sets using an 80/10/10 ratio. For sequential models, interactions were chronologically ordered, with the last interaction used for testing.
  • Hyperparameter Tuning: Optimal hyperparameters for each model were determined based on maximizing NDCG@3 on the validation set. 100 trials (hyperparameter combinations) were evaluated for each model, using Asynchronous Successive Halving Algorithm (ASHA) [7] to optimize the search.
  • Training and Evaluation: All models were trained until convergence, with early stopping implemented to prevent overfitting. Performance metrics were computed on the held-out test set.

Key Differences in Metadata and Input Representation

To fully appreciate the distinct behaviors and performances observed in our benchmark, it is important to understand the fundamental differences in how our ID-based and ID-Free Enhanced models consume and represent data—especially in how they capture semantics, structure, and temporal context.

The ID-based KGAT model builds a knowledge graph where users, nudges, and related entities (e.g., markers, segments) are represented as unique nodes. Relationships between these nodes (e.g., has_marker, in_segment) form the graph’s edges. This structure allows the model to learn from multi-hop connections and structured metadata, but it encodes user and nudge attributes as a fixed snapshot, based on their state at the beginning of the input window.

In contrast, ID-free models use semantic embeddings that are dynamically generated from rich metadata at the time of each interaction. For example, a user’s embedding is derived from their behavioral attributes at the time the nudge was sent, while a nudge’s embedding reflects its actual content at the time (in case of any edits). This enables the model to adapt to temporal changes and deliver personalized recommendations using up-to-date information.

The table below summarizes key differences in how users and nudges are represented across the two approaches:

AspectKGATID-Free Enhanced Models
Interaction DataConsiders only the distinct nudges a user has interacted with (duplicates removed).Includes all interactions, including repeated nudges, in the user's history. Sequential models also capture the order of interactions.
Temporal AdaptabilityStatic
• Users and nudges are represented as a point-in-time snapshot based on their attributes at the start of the input window.
• Limits the number of days of interaction data to avoid misalignment between interactions and associated user or nudge attributes (e.g. user health behaviors that change daily).
Dynamic
• Semantic embeddings of users and nudges are updated as their attributes change.
• Interactions are mapped to user and nudge embeddings using point-in-time joins, to associate an interaction event with the temporally-correct representation of the user and nudge.
• No restriction on the amount of historical data used, since representations can reflect current states at any time.
User RepresentationGraph-based connections to attributes (markers) and segments.Semantic embeddings from the user's current attributes.
Nudge RepresentationGraph-based connections to nudge attributes and target segments.Semantic embeddings from the actual nudge content.

Evaluation Metrics

We assessed performance using standard top-K recommendation metrics, specifically at K=3. These metrics quantify the quality of the top-ranked recommendations:

  • NDCG@3 (Normalized Discounted Cumulative Gain): Measures the ranking quality, assigning higher scores to relevant items that appear earlier in the recommendation list.
  • Precision@3: The proportion of recommended items at K=3 that are relevant.
  • Recall@3: The proportion of all relevant items successfully retrieved within the top K=3 recommendations.
  • MAP (Mean Average Precision): A measure that provides a comprehensive summary of overall ranking quality across different relevance thresholds.

The Results: Unpacking the Benchmark

Let’s dive into how the models performed. The chart below presents the recommendation performance for each model across the key metrics.

Model Performance: ID-Based vs. ID-Free Models

Key Observations

  • Value of Personalization: KGAT and all ID-Free models consistently performed better than the Random and Popular baselines across all metrics, reaffirming the importance of personalization and context-aware recommendations.
  • Baselines Provide Context: As expected, the Random baseline delivered the lowest performance across all metrics, establishing a clear lower bound. The Popular baseline significantly outperformed Random, demonstrating the effectiveness of popularity-based heuristics on this particular dataset.
  • KGAT’s Performance: Our ID-based KGAT model remained a strong performer compared to the non-personalized baselines, demonstrating the power of modeling deep, multi-hop relationships through attention over a knowledge graph. However, it was generally outperformed by the ID-free models across all metrics—highlighting the added value of semantic inputs and dynamic context.
  • ID-Free Models Showcase Potential: The ID-Free Enhanced models (BPR, NeuMF, SimpleX, SASRec) generally performed on par with or better than KGAT, indicating the potential of semantic embeddings in capturing richer behavioral context.
    • Among the ID-Free models, BPR, SimpleX, and SASRec showed similar levels of performance, demonstrating their ability to effectively leverage semantic embeddings.
    • NeuMF clearly stood out in this experiment, delivering the strongest performance across all metrics. Its hybrid architecture—combining matrix factorization with neural layers—appears particularly well-suited to capturing the semantic richness of ID-Free embeddings.

Discussion: Insights from the Benchmark

These results suggest promising potential for ID-free embeddings to reshape how we approach recommendations—especially in dynamic, highly personalized domains like digital health, where user behaviors and preferences change on a daily basis. We repeated the experiments across multiple date ranges and consistently observed similar performance patterns, reinforcing the reliability of these insights.

Several key insights stand out:

  • The Power of Semantic Understanding: ID-free embeddings—derived from descriptive metadata like nudge content and user health profiles—enable models to capture richer, more meaningful relationships than purely ID-based representations. This semantic grounding supports better generalization and adaptability, which is critical in contexts where user behaviors and nudge content are constantly evolving.
  • Enhancing Existing Architectures: A key takeaway is that traditional, well-understood architectures like NeuMF and SASRec can achieve strong results when powered by high-quality ID-free embeddings. This opens up opportunities to modernize recommendation pipelines without requiring wholesale changes to core infrastructure or training paradigms.
  • Complementary Strengths Across Models: While the ID-Free Enhanced models performed better than KGAT, KGAT remained a strong contender—particularly in its ability to model complex, multi-hop relationships through structured knowledge graphs and health graphs. Its strength lies in leveraging curated domain knowledge and explicit relationships, which can be particularly valuable when semantic metadata is limited or noisy. Meanwhile, ID-free models benefit from greater adaptability to changing user contexts and can scale easily without requiring graph maintenance. These findings suggest that future improvements may come from hybrid approaches that combine the structured reasoning of graph-based models with the flexibility and semantic depth of ID-free embeddings.

What’s Next?

This benchmark marks a pivotal step in our journey toward more adaptive and scalable recommendation systems. The strong performance of ID-Free Enhanced models signals a promising direction for the future of personalized digital health nudging. Our next steps will focus on:

  • Evaluating the performance of the ID-free models across varying training data sizes to understand trade-offs between model performance, training time, and resource consumption.
  • Conducting ablation studies and detailed evaluations to explore factors such as the robustness of semantic embeddings, the impact of negative sampling strategies, optimal sequence length, pruning historical items, and the choice of embedding models and role of metadata.

These explorations will help us further assess the practical value of ID-free modeling and unlock its full potential—guiding our efforts to build an even more personalized, effective, and scalable nudge engine for digital health and personalized health.

References:

[1] X. Wang and e. al., “KGAT: Knowledge Graph Attention Network for Recommendation,” in KDD ’19, 2019.
[2] J. Chiam and e. al., “NudgeRank: Digital Algorithmic Nudging for Personalized Health,” in KDD ’24, 2024.
[3] S. Rendle and e. al., “BPR: Bayesian personalized ranking from implicit feedback,” in UAI ’09, 2009.
[4] X. He and e. al., “Neural Collaborative Filtering,” in WWW ’17, 2017.
[5] K. Mao and e. al., “SimpleX: A Simple and Strong Baseline for Collaborative Filtering,” in CIKM ’21, 2021.
[6] W.-C. Kang and e. al., “Self-Attentive Sequential Recommendation,” in ICDM ’18, 2018.
[7] L. Li and e. al., “A System for Massively Parallel Hyperparameter Tuning,” in MLSys 2020, 2020.

Book a meeting





    Request a demo