SIGIR 2025 - Additional Online Material

Dilina Chandika Rajapakse	Dietmar Jannach
Trinity College Dublin, Ireland	University of Klagenfurt, Austria
rajapakd[at]tcd.ie	dietmar.jannach[at]aau.at

Over the past few years, researchers have explored the use of reinforcement learning (RL) for sequential recommendation problems. However, since RL techniques commonly target at optimizing long-term rewards, it is surprising that RL-based models are reported to be competitive with traditional supervised models when evaluated under the myopic next-item prediction protocol. A recent study suggests that reported performance gains of combining RL with supervised learning techniques, as done in the Self-Supervised Q-Learning (SQN) framework, may actually not come from learning an optimal policy, but that the RL component helps to learn embeddings that encode the users' past interactions. Given these observations, we aimed to reassess the performance of RL-enhanced sequential recommendations in the SQN framework. While we were able to reproduce the results reported in the respective papers, we found that properly-tuned supervised learning models like GRU4Rec substantially outperform the proposed RL-models from the literature. Our analyses furthermore revealed that there is a significant inconsistency in terms of evaluation protocols in the literature, and that the use of third-party implementations of existing models may lead to unreliable conclusions. Overall, still more research and alternative evaluation schemes seem required to fully leverage the power of RL for sequential recommendation tasks.

Source code available at : https://github.com/dilina-r/rl-rec

Yoochoose

RetailRocket

Diginetica

Yoochoose - Full Results (Clicks only)

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.395	0.226	0.268	0.527	0.244	0.311	0.646	0.253	0.341
VSKNN	0.460	0.263	0.312	0.592	0.281	0.355	0.699	0.289	0.382
GRU4Rec	0.477	0.285	0.333	0.609	0.303	0.376	0.713	0.310	0.402
NARM	0.475	0.285	0.332	0.609	0.303	0.375	0.715	0.310	0.402
SAS(RL)	0.466	0.278	0.324	0.595	0.295	0.366	0.700	0.302	0.393
SAS-SQN	0.466	0.277	0.324	0.596	0.295	0.366	0.703	0.302	0.394
SAS-EVAL	0.467	0.277	0.324	0.598	0.295	0.367	0.704	0.302	0.394
GRU(RL)	0.402	0.241	0.281	0.520	0.257	0.319	0.622	0.264	0.345
GRU-SQN	0.417	0.248	0.290	0.539	0.265	0.330	0.643	0.272	0.356
GRU-EVAL	0.419	0.249	0.291	0.544	0.266	0.332	0.647	0.273	0.358

RetailRocket - Full Results (Views only)

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.462	0.328	0.362	0.535	0.338	0.385	0.593	0.342	0.400
VSKNN	0.466	0.331	0.365	0.537	0.341	0.388	0.596	0.345	0.403
GRU4Rec	0.419	0.279	0.314	0.518	0.292	0.346	0.610	0.299	0.369
NARM	0.438	0.327	0.355	0.499	0.336	0.375	0.556	0.340	0.389
SAS(RL)	0.358	0.237	0.267	0.437	0.248	0.293	0.506	0.253	0.310
SAS-SQN	0.367	0.247	0.277	0.447	0.258	0.303	0.519	0.263	0.321
SAS-EVAL	0.372	0.250	0.281	0.453	0.261	0.307	0.523	0.266	0.324
GRU(RL)	0.317	0.211	0.238	0.379	0.220	0.258	0.433	0.224	0.272
GRU-SQN	0.332	0.226	0.252	0.398	0.235	0.274	0.455	0.239	0.288
GRU-EVAL	0.336	0.228	0.255	0.397	0.237	0.275	0.454	0.240	0.289

Diginetica - Full Results (Views only)

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.278	0.166	0.193	0.386	0.180	0.228	0.502	0.188	0.257
VSKNN	0.334	0.245	0.267	0.419	0.257	0.295	0.513	0.263	0.319
GRU4Rec	0.453	0.296	0.335	0.556	0.310	0.369	0.655	0.317	0.394
NARM	0.415	0.265	0.302	0.524	0.280	0.338	0.628	0.287	0.364
SAS(RL)	0.373	0.226	0.263	0.477	0.240	0.297	0.572	0.247	0.320
SAS-SQN	0.377	0.230	0.266	0.482	0.244	0.300	0.581	0.251	0.325
SAS-EVAL	0.374	0.227	0.263	0.478	0.241	0.297	0.576	0.248	0.322
GRU(RL)	0.232	0.138	0.161	0.305	0.148	0.185	0.376	0.152	0.203
GRU-SQN	0.281	0.165	0.194	0.379	0.178	0.225	0.467	0.184	0.248
GRU-EVAL	0.284	0.165	0.195	0.379	0.178	0.225	0.469	0.184	0.248

Yoochoose

RetailRocket

Diginetica

Yoochoose Clicks - Full Results

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.362	0.220	0.255	0.462	0.233	0.288	0.542	0.239	0.308
VSKNN	0.402	0.243	0.283	0.498	0.256	0.314	0.580	0.262	0.335
GRU4Rec	0.426	0.258	0.300	0.533	0.272	0.335	0.617	0.278	0.356
NARM	0.373	0.220	0.258	0.478	0.235	0.292	0.562	0.240	0.314
SAS(RL)	0.359	0.212	0.249	0.464	0.226	0.283	0.551	0.233	0.305
SAS-SQN	0.367	0.215	0.252	0.469	0.228	0.286	0.560	0.235	0.309
SAS-EVAL	0.366	0.218	0.255	0.474	0.233	0.290	0.560	0.239	0.312
GRU(RL)	0.300	0.176	0.207	0.390	0.188	0.236	0.463	0.193	0.254
GRU-SQN	0.305	0.178	0.210	0.396	0.191	0.239	0.473	0.196	0.259
GRU-EVAL	0.307	0.180	0.211	0.400	0.192	0.241	0.476	0.198	0.261

Yoochoose Purchases - Full Results

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.619	0.400	0.455	0.739	0.416	0.494	0.807	0.421	0.511
VSKNN	0.513	0.322	0.370	0.616	0.336	0.403	0.707	0.342	0.426
GRU4Rec	0.394	0.271	0.301	0.476	0.282	0.328	0.537	0.286	0.344
NARM	0.413	0.254	0.293	0.521	0.268	0.328	0.604	0.274	0.349
SAS(RL)	0.326	0.206	0.236	0.411	0.217	0.263	0.474	0.222	0.279
SAS-SQN	0.329	0.207	0.237	0.407	0.218	0.263	0.475	0.222	0.280
SAS-EVAL	0.335	0.216	0.245	0.412	0.226	0.270	0.476	0.230	0.287
GRU(RL)	0.270	0.159	0.186	0.357	0.170	0.214	0.425	0.175	0.232
GRU-SQN	0.286	0.170	0.199	0.366	0.181	0.225	0.444	0.186	0.244
GRU-EVAL	0.285	0.172	0.200	0.373	0.184	0.229	0.445	0.189	0.247

RetailRocket Clicks - Full Results

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.303	0.215	0.237	0.355	0.222	0.254	0.399	0.225	0.265
VSKNN	0.368	0.282	0.303	0.416	0.288	0.319	0.454	0.291	0.328
GRU4Rec	0.378	0.290	0.312	0.431	0.297	0.329	0.480	0.300	0.341
NARM	0.343	0.274	0.291	0.381	0.279	0.303	0.418	0.281	0.313
SAS(RL)	0.293	0.210	0.231	0.350	0.218	0.249	0.403	0.221	0.263
SAS-SQN	0.296	0.218	0.237	0.348	0.225	0.254	0.398	0.228	0.267
SAS-EVAL	0.296	0.216	0.236	0.349	0.223	0.253	0.399	0.227	0.266
GRU(RL)	0.156	0.108	0.120	0.187	0.113	0.130	0.218	0.115	0.138
GRU-SQN	0.196	0.134	0.149	0.235	0.139	0.162	0.271	0.141	0.171
GRU-EVAL	0.192	0.134	0.148	0.232	0.139	0.161	0.268	0.141	0.170

RetailRocket Purchases - Full Results

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.399	0.296	0.322	0.455	0.303	0.340	0.503	0.307	0.352
VSKNN	0.700	0.625	0.644	0.728	0.629	0.653	0.752	0.631	0.659
GRU4Rec	0.781	0.729	0.742	0.806	0.733	0.750	0.824	0.734	0.755
NARM	0.623	0.532	0.555	0.659	0.537	0.567	0.692	0.540	0.575
SAS(RL)	0.634	0.514	0.544	0.689	0.522	0.562	0.725	0.524	0.571
SAS-SQN	0.655	0.553	0.579	0.695	0.559	0.592	0.721	0.561	0.599
SAS-EVAL	0.657	0.549	0.576	0.695	0.554	0.588	0.723	0.556	0.596
GRU(RL)	0.180	0.127	0.141	0.216	0.132	0.152	0.254	0.135	0.162
GRU-SQN	0.238	0.168	0.186	0.285	0.175	0.201	0.331	0.178	0.213
GRU-EVAL	0.236	0.164	0.182	0.282	0.170	0.197	0.330	0.173	0.209

Diginetica Clicks - Full Results

	H@5	M@5	N@5	H@10	M@10	N@20	H@20	M@20	N@20
SKNN	0.278	0.166	0.193	0.386	0.180	0.228	0.502	0.188	0.257
VSKNN	0.334	0.245	0.267	0.419	0.257	0.295	0.513	0.263	0.319
GRU4Rec	0.453	0.296	0.335	0.556	0.310	0.369	0.655	0.317	0.394
NARM	0.415	0.265	0.302	0.524	0.280	0.338	0.628	0.287	0.364
SAS(RL)	0.373	0.226	0.263	0.477	0.240	0.297	0.572	0.247	0.320
SAS-SQN	0.377	0.230	0.266	0.482	0.244	0.300	0.581	0.251	0.325
SAS-EVAL	0.374	0.227	0.263	0.478	0.241	0.297	0.576	0.248	0.322
GRU(RL)	0.232	0.138	0.161	0.305	0.148	0.185	0.376	0.152	0.203
GRU-SQN	0.281	0.165	0.194	0.379	0.178	0.225	0.467	0.184	0.248
GRU-EVAL	0.284	0.165	0.195	0.379	0.178	0.225	0.469	0.184	0.248

GRU Protocol Hyperparameters

SQN Protocol Hyperparameters

SKNN Parameters

	Yoochoose	Retailrocket	Diginetica	Values
k	500	100	100	{50, 100, 500, 1000, 1500}
sample_size	2500	500	500	{500, 1000, 2500, 5000, 10000}
similarity	cosine	cosine	cosine	{'jaccard', 'cosine', 'binary', 'tanimoto'}

VSKNN Parameters

	Yoochoose	Retailrocket	Diginetica	Values
k	500	500	100	{50, 100, 500, 1000, 1500}
sample_size	5000	500	2500	{500, 1000, 2500, 5000, 10000}
weighting	log	log	quadratic	{'linear', 'same', 'div', 'log', 'quadratic'}
weighting_score	quadratic	linear	quadratic	{'linear', 'same', 'div', 'log', 'quadratic'}
idf_weighting	1	5	2	{0, 1, 2, 5, 10}

GRU4Rec Parameters

Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0

	Yoochoose	Retailrocket	Diginetica	Values
layers	128	96	128	min=64, max=512, step=32
batch_size	208	208	224	min=32, max=256, step=16
learning_rate	0.13	0.14	0.25	min=0.01, max=0.25, step=0.005
dropout_p_embed	0.5	0.45	0.35	min=0, max=0.5, step=0.05
dropout_p_hidden	0	0	0.4	min=0, max=0.7, step=0.05
momentum	0.5	0.5	0	min=0, max=0.9, step=0.05
sample_alpha	0.2	0.2	0.3	min=0, max=1.0, step=0.1
bpreg	1.05	0.9	0.6	min=0, max=2.0, step=0.05
elu_param	0	0	0.5	{0.5, 1, 0}
constrained_embedding	True	True	False	{True, False}
loss	'cross-entropy'	'bpr-max'	'cross-entropy'	{'bpr-max', 'cross-entropy'}

NARM Parameters

Fixed Parameters: epochs=50

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	144	144	128	min=64, max=512, step=16
emb_dim	160	384	96	min=64, max=512, step=32
lr	0.0006	0.0001	0.0006	min=0.0001, max=0.1, step=0.00005

GRU(RL), SQN-GRU and EVAL-GRU Parameters

Fixed Parameters: epochs=40

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	512	512	160	min=64, max=512, step=64
discount	0	0.2	0.2	min=0, max=1.0, step=0.1
hidden_factor	64	512	512	min=64 max=512, step=32
lr	0.003	0.0005	0.001	min=0.001, max=0.1, step=0.0005

SAS(RL), SQN-SAS and EVAL-SAS Parameters

Fixed Parameters: epochs=40, num_blocks=1, num_heads=1

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	512,	128,	244	min=64, max=512, step=64
discount	0.4,	0.7,	0.6	min=0, max=1.0, step=0.1
dropout	0.5,	0.5,	0.3	min=0, max=0.9, step=0.1
hidden_factor	64,	384,	256	min=64 max=512, step=32
lr	0.001	0.0015	0.0055	min=0.001, max=0.1, step=0.0005

SKNN Parameters

	Yoochoose	Retailrocket	Diginetica	Values
k	100	50	50	{50, 100, 500, 1000, 1500}
sample_size	1000	1000	500	{500, 1000, 2500, 5000, 10000}
similarity	jaccard	cosine	cosine	{'jaccard', 'cosine', 'binary', 'tanimoto'}

VSKNN Parameters

	Yoochoose	Retailrocket	Digitenica	Param space
k	1500	1000	500	{50, 100, 500, 1000, 1500}
sample_size	2500	5000	2500	{500, 1000, 2500, 5000, 10000}
weighting	same	log	linear	{'linear', 'same', 'div', 'log', 'quadratic'}
weighting_score	quadratic	div	same	{'linear', 'same', 'div', 'log', 'quadratic'}
idf_weighting	5	5	0	{0, 1, 2, 5, 10}

GRU4Rec Parameters

Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0

	Yoochoose	Retailrocket	Diginetica	Values
layers	96	384	96	min=64, max=512, step=32
batch_size	144	32	144	min=32, max=256, step=16
learning_rate	0.05	0.035	0.14	min=0.01, max=0.25, step=0.005
dropout_p_embed	0.2	0.4	0.45	min=0, max=0.5, step=0.05
dropout_p_hidden	0.25	0.2	0.25	min=0, max=0.7, step=0.05
momentum	0.55	0.1	0.5	min=0, max=0.9, step=0.05
sample_alpha	0.6	0.4	0.1	min=0, max=1.0, step=0.1
bpreg	0.45	0.05	1.75	min=0, max=2.0, step=0.05
elu_param	0	0	0.5	{0.5, 1, 0}
constrained_embedding	True	True	True	{true, false}
loss	'cross-entropy'	'cross-entropy'	'cross-entropy'	{'bpr-max', 'cross-entropy'}

NARM Parameters

Fixed Parameters: epochs=50

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	96	144	176	min=64, max=512, step=16
emb_dim	224	256	512	min=64, max=512, step=32
lr	0.0006	0.0001	0.0001	min=0.0001, max=0.1, step=0.00005

GRU(RL), SQN-GRU and EVAL-GRU Parameters

Fixed Parameters: epochs=40

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	64	256	512	min=64, max=512, step=64
discount	0.5	0.4	0	min=0, max=1.0, step=0.1
hidden_factor	64	384	128	min=64 max=512, step=32
lr	0.001	0.0015	0.005	min=0.001, max=0.1, step=0.0005

SAS(RL), SQN-SAS and EVAL-SAS Parameters

Fixed Parameters: epochs=40, num_blocks=1, num_heads=1

	Yoochoose	Retailrocket	Diginetica	Values
batch_size	512	512	384	min=64, max=512, step=64
discount	0.2	0.5	0.2	min=0, max=1.0, step=0.1
dropout	0.7	0.6	0.6	min=0, max=0.9, step=0.1
hidden_factor	384	256	320	min=64 max=512, step=32
lr	0.001	0.0001	0.0001	min=0.001, max=0.1, step=0.0005

ICML'24 Reproduced Results

Train without Validation

Reproducing Results from ICML'24 Paper Labarca et al.

The tables below report our reproduced metrics against the results stated in the ICML'24 paper titled "On the Unexpected Effectiveness of Reinforcement Learning for Sequential Recommendation". We use the original code, hyperparameters and data provided by the authors.

Reproduced results for Click events, for Yoochoose (RC15) and RetailRocket datasets.

	Reproduced Results				ICML'24 Paper results
	Yoochoose		RetailRocket		Yoochoose		RetailRocket
	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20
SAS	0.494	0.276	0.233	0.145	0.496	0.275	0.231	0.145
SQN-SAS	0.498	0.273	0.260	0.161	0.505	0.280	0.260	0.162
EVAL-SAS	0.508	0.282	0.257	0.159	0.502	0.278	0.258	0.161
GRU	0.439	0.239	0.175	0.108	0.427	0.231	0.175	0.109
SQN-GRU	0.458	0.252	0.200	0.128	0.457	0.251	0.200	0.127
EVAL-GRU	0.456	0.251	0.201	0.128	0.457	0.251	0.201	0.128

Reproduced results for Purchase/Buy events, for Yoochoose (RC15) and RetailRocket datasets.

	Reproduced Results				ICML'24 Paper results
	Yoochoose		RetailRocket		Yoochoose		RetailRocket
	H@20	N@20	H@20	N@20	H@20	N@20	H@20	N@20
SAS	0.561	0.317	0.355	0.222	0.560	0.315	0.351	0.218
SQN-SAS	0.612	0.341	0.430	0.271	0.581	0.327	0.423	0.263
EVAL-SAS	0.567	0.317	0.412	0.256	0.622	0.348	0.420	0.260
GRU	0.546	0.299	0.235	0.145	0.561	0.310	0.231	0.146
SQN-GRU	0.576	0.324	0.271	0.175	0.576	0.322	0.270	0.174
EVAL-GRU	0.576	0.320	0.269	0.172	0.578	0.323	0.272	0.176

Results when trained without validation sessions

The original implementations of NARM and SQN-framework, dismiss the 'validation' sessions in the final training of the model. NARM use the validation data to select the best-performing model at each epoch, which is then used in the final performance evaluation. In the SQN-framework, the models are routinely evaluated with the validation data, but provides no contribution towards the model performance. We trained the NARM and RL-enhanced models in a similar manner, respective to their original implementations, using our SQN-protocol data splits and tuned hyperparameters. Below we report the results for the Yoochoose and RetailRocket datasets.

	Yoochoose - Clicks				Yoochoose - Purchases
	H@10	M@10	H@20	M@20	H@10	M@10	H@20	M@20
NARM	0.263	0.143	0.308	0.147	0.235	0.155	0.261	0.157
SAS(RL)	0.254	0.124	0.304	0.128	0.171	0.101	0.193	0.118
SAS-SQN	0.258	0.127	0.310	0.130	0.175	0.106	0.198	0.108
SAS-EVAL	0.259	0.127	0.309	0.130	0.176	0.102	0.202	0.104
GRU(RL)	0.226	0.113	0.268	0.116	0.190	0.107	0.216	0.109
GRU-SQN	0.222	0.112	0.260	0.115	0.180	0.100	0.205	0.101
GRU-EVAL	0.217	0.109	0.257	0.112	0.180	0.099	0.203	0.101

	RetailRocket - Clicks				RetailRocket - Purchases
	H@10	M@10	H@20	M@20	H@10	M@10	H@20	M@20
NARM	0.322	0.274	0.328	0.275	0.705	0.597	0.718	0.598
SAS(RL)	0.317	0.208	0.357	0.210	0.639	0.528	0.661	0.530
SAS-SQN	0.319	0.209	0.361	0.212	0.644	0.533	0.667	0.534
SAS-EVAL	0.318	0.209	0.360	0.212	0.645	0.526	0.666	0.527
GRU(RL)	0.207	0.117	0.244	0.120	0.358	0.219	0.409	0.222
GRU-SQN	0.224	0.131	0.263	0.133	0.390	0.250	0.433	0.253
GRU-EVAL	0.235	0.137	0.273	0.139	0.428	0.276	0.477	0.279

Reassessing the Effectiveness of Reinforcement Learning based Recommender Systems for Sequential Recommendation

Additional Online Material: Source Code, Full Results, Hyperparameters

Yoochoose - Full Results (Clicks only)

RetailRocket - Full Results (Views only)

Diginetica - Full Results (Views only)

Yoochoose Clicks - Full Results

Yoochoose Purchases - Full Results

RetailRocket Clicks - Full Results

RetailRocket Purchases - Full Results

Diginetica Clicks - Full Results

SKNN Parameters

VSKNN Parameters

GRU4Rec Parameters

NARM Parameters

GRU(RL), SQN-GRU and EVAL-GRU Parameters

SAS(RL), SQN-SAS and EVAL-SAS Parameters

SKNN Parameters

VSKNN Parameters

GRU4Rec Parameters

NARM Parameters

GRU(RL), SQN-GRU and EVAL-GRU Parameters

SAS(RL), SQN-SAS and EVAL-SAS Parameters

Reproducing Results from ICML'24 Paper Labarca et al.

Results when trained without validation sessions