Over the past few years, researchers have explored the use of reinforcement learning (RL) for sequential recommendation problems. However, since RL techniques commonly target at optimizing long-term rewards, it is surprising that RL-based models are reported to be competitive with traditional supervised models when evaluated under the myopic next-item prediction protocol. A recent study suggests that reported performance gains of combining RL with supervised learning techniques, as done in the Self-Supervised Q-Learning (SQN) framework, may actually not come from learning an optimal policy, but that the RL component helps to learn embeddings that encode the users' past interactions. Given these observations, we aimed to reassess the performance of RL-enhanced sequential recommendations in the SQN framework. While we were able to reproduce the results reported in the respective papers, we found that properly-tuned supervised learning models like GRU4Rec substantially outperform the proposed RL-models from the literature. Our analyses furthermore revealed that there is a significant inconsistency in terms of evaluation protocols in the literature, and that the use of third-party implementations of existing models may lead to unreliable conclusions. Overall, still more research and alternative evaluation schemes seem required to fully leverage the power of RL for sequential recommendation tasks.
Source code available at : https://github.com/dilina-r/rl-rec
GRU Protocol Hyperparameters
SQN Protocol Hyperparameters
SKNN Parameters
| Yoochoose | Retailrocket | Diginetica | Values |
---|
k | 500 | 100 | 100 | {50, 100, 500, 1000, 1500} |
sample_size | 2500 | 500 | 500 | {500, 1000, 2500, 5000, 10000} |
similarity | cosine | cosine | cosine | {'jaccard', 'cosine', 'binary', 'tanimoto'} |
VSKNN Parameters
| Yoochoose | Retailrocket | Diginetica | Values |
---|
k | 500 | 500 | 100 | {50, 100, 500, 1000, 1500} |
sample_size | 5000 | 500 | 2500 | {500, 1000, 2500, 5000, 10000} |
weighting | log | log | quadratic | {'linear', 'same', 'div', 'log', 'quadratic'} |
weighting_score | quadratic | linear | quadratic | {'linear', 'same', 'div', 'log', 'quadratic'} |
idf_weighting | 1 | 5 | 2 | {0, 1, 2, 5, 10} |
GRU4Rec Parameters
Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0
| Yoochoose | Retailrocket | Diginetica | Values |
---|
layers | 128 | 96 | 128 | min=64, max=512, step=32 |
batch_size | 208 | 208 | 224 | min=32, max=256, step=16 |
learning_rate | 0.13 | 0.14 | 0.25 | min=0.01, max=0.25, step=0.005 |
dropout_p_embed | 0.5 | 0.45 | 0.35 | min=0, max=0.5, step=0.05 |
dropout_p_hidden | 0 | 0 | 0.4 | min=0, max=0.7, step=0.05 |
momentum | 0.5 | 0.5 | 0 | min=0, max=0.9, step=0.05 |
sample_alpha | 0.2 | 0.2 | 0.3 | min=0, max=1.0, step=0.1 |
bpreg | 1.05 | 0.9 | 0.6 | min=0, max=2.0, step=0.05 |
elu_param | 0 | 0 | 0.5 | {0.5, 1, 0} |
constrained_embedding | True | True | False | {True, False} |
loss | 'cross-entropy' | 'bpr-max' | 'cross-entropy' | {'bpr-max', 'cross-entropy'} |
NARM Parameters
Fixed Parameters: epochs=50
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 144 | 144 | 128 | min=64, max=512, step=16 |
emb_dim | 160 | 384 | 96 | min=64, max=512, step=32 |
lr | 0.0006 | 0.0001 | 0.0006 | min=0.0001, max=0.1, step=0.00005 |
GRU(RL), SQN-GRU and EVAL-GRU Parameters
Fixed Parameters: epochs=40
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 512 | 512 | 160 | min=64, max=512, step=64 |
discount | 0 | 0.2 | 0.2 | min=0, max=1.0, step=0.1 |
hidden_factor | 64 | 512 | 512 | min=64 max=512, step=32 |
lr | 0.003 | 0.0005 | 0.001 | min=0.001, max=0.1, step=0.0005 |
SAS(RL), SQN-SAS and EVAL-SAS Parameters
Fixed Parameters: epochs=40, num_blocks=1, num_heads=1
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 512, | 128, | 244 | min=64, max=512, step=64 |
discount | 0.4, | 0.7, | 0.6 | min=0, max=1.0, step=0.1 |
dropout | 0.5, | 0.5, | 0.3 | min=0, max=0.9, step=0.1 |
hidden_factor | 64, | 384, | 256 | min=64 max=512, step=32 |
lr | 0.001 | 0.0015 | 0.0055 | min=0.001, max=0.1, step=0.0005 |
SKNN Parameters
| Yoochoose | Retailrocket | Diginetica | Values |
---|
k | 100 | 50 | 50 | {50, 100, 500, 1000, 1500} |
sample_size | 1000 | 1000 | 500 | {500, 1000, 2500, 5000, 10000} |
similarity | jaccard | cosine | cosine | {'jaccard', 'cosine', 'binary', 'tanimoto'} |
VSKNN Parameters
| Yoochoose | Retailrocket | Digitenica | Param space |
---|
k | 1500 | 1000 | 500 | {50, 100, 500, 1000, 1500} |
sample_size | 2500 | 5000 | 2500 | {500, 1000, 2500, 5000, 10000} |
weighting | same | log | linear | {'linear', 'same', 'div', 'log', 'quadratic'} |
weighting_score | quadratic | div | same | {'linear', 'same', 'div', 'log', 'quadratic'} |
idf_weighting | 5 | 5 | 0 | {0, 1, 2, 5, 10} |
GRU4Rec Parameters
Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0
| Yoochoose | Retailrocket | Diginetica | Values |
---|
layers | 96 | 384 | 96 | min=64, max=512, step=32 |
batch_size | 144 | 32 | 144 | min=32, max=256, step=16 |
learning_rate | 0.05 | 0.035 | 0.14 | min=0.01, max=0.25, step=0.005 |
dropout_p_embed | 0.2 | 0.4 | 0.45 | min=0, max=0.5, step=0.05 |
dropout_p_hidden | 0.25 | 0.2 | 0.25 | min=0, max=0.7, step=0.05 |
momentum | 0.55 | 0.1 | 0.5 | min=0, max=0.9, step=0.05 |
sample_alpha | 0.6 | 0.4 | 0.1 | min=0, max=1.0, step=0.1 |
bpreg | 0.45 | 0.05 | 1.75 | min=0, max=2.0, step=0.05 |
elu_param | 0 | 0 | 0.5 | {0.5, 1, 0} |
constrained_embedding | True | True | True | {true, false} |
loss | 'cross-entropy' | 'cross-entropy' | 'cross-entropy' | {'bpr-max', 'cross-entropy'} |
NARM Parameters
Fixed Parameters: epochs=50
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 96 | 144 | 176 | min=64, max=512, step=16 |
emb_dim | 224 | 256 | 512 | min=64, max=512, step=32 |
lr | 0.0006 | 0.0001 | 0.0001 | min=0.0001, max=0.1, step=0.00005 |
GRU(RL), SQN-GRU and EVAL-GRU Parameters
Fixed Parameters: epochs=40
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 64 | 256 | 512 | min=64, max=512, step=64 |
discount | 0.5 | 0.4 | 0 | min=0, max=1.0, step=0.1 |
hidden_factor | 64 | 384 | 128 | min=64 max=512, step=32 |
lr | 0.001 | 0.0015 | 0.005 | min=0.001, max=0.1, step=0.0005 |
SAS(RL), SQN-SAS and EVAL-SAS Parameters
Fixed Parameters: epochs=40, num_blocks=1, num_heads=1
| Yoochoose | Retailrocket | Diginetica | Values |
---|
batch_size | 512 | 512 | 384 | min=64, max=512, step=64 |
discount | 0.2 | 0.5 | 0.2 | min=0, max=1.0, step=0.1 |
dropout | 0.7 | 0.6 | 0.6 | min=0, max=0.9, step=0.1 |
hidden_factor | 384 | 256 | 320 | min=64 max=512, step=32 |
lr | 0.001 | 0.0001 | 0.0001 | min=0.001, max=0.1, step=0.0005 |
ICML'24 Reproduced Results
Train without Validation
Reproducing Results from ICML'24 Paper Labarca et al.
The tables below report our reproduced metrics against the results stated in the ICML'24 paper titled "On the Unexpected Effectiveness of Reinforcement Learning
for Sequential Recommendation". We use the original code, hyperparameters and data provided by the authors.
Reproduced results for Click events, for Yoochoose (RC15) and RetailRocket datasets.
| Reproduced Results | ICML'24 Paper results |
| Yoochoose | RetailRocket |
Yoochoose | RetailRocket |
| H@20 | N@20 |
H@20 | N@20 | H@20 |
N@20 | H@20 | N@20 |
SAS | 0.494 | 0.276 |
0.233 | 0.145 | 0.496 |
0.275 | 0.231 | 0.145 |
SQN-SAS | 0.498 | 0.273 |
0.260 | 0.161 | 0.505 |
0.280 | 0.260 | 0.162 |
EVAL-SAS | 0.508 | 0.282 |
0.257 | 0.159 | 0.502 |
0.278 | 0.258 | 0.161 |
GRU | 0.439 | 0.239 |
0.175 | 0.108 | 0.427 |
0.231 | 0.175 | 0.109 |
SQN-GRU | 0.458 | 0.252 |
0.200 | 0.128 | 0.457 |
0.251 | 0.200 | 0.127 |
EVAL-GRU | 0.456 | 0.251 |
0.201 | 0.128 | 0.457 |
0.251 | 0.201 | 0.128 |
Reproduced results for Purchase/Buy events, for Yoochoose (RC15) and RetailRocket datasets.
| Reproduced Results | ICML'24 Paper results |
| Yoochoose | RetailRocket |
Yoochoose | RetailRocket |
| H@20 | N@20 |
H@20 | N@20 | H@20 |
N@20 | H@20 | N@20 |
SAS | 0.561 | 0.317 |
0.355 | 0.222 | 0.560 |
0.315 | 0.351 | 0.218 |
SQN-SAS | 0.612 | 0.341 |
0.430 | 0.271 | 0.581 |
0.327 | 0.423 | 0.263 |
EVAL-SAS | 0.567 | 0.317 |
0.412 | 0.256 | 0.622 |
0.348 | 0.420 | 0.260 |
GRU | 0.546 | 0.299 |
0.235 | 0.145 | 0.561 |
0.310 | 0.231 | 0.146 |
SQN-GRU | 0.576 | 0.324 |
0.271 | 0.175 | 0.576 |
0.322 | 0.270 | 0.174 |
EVAL-GRU | 0.576 | 0.320 |
0.269 | 0.172 | 0.578 |
0.323 | 0.272 | 0.176 |
Results when trained without validation sessions
The original implementations of NARM and SQN-framework, dismiss the 'validation' sessions in the final training of the model.
NARM use the validation data to select the best-performing model at each epoch, which is then used in the final performance evaluation.
In the SQN-framework, the models are routinely evaluated with the validation data, but provides no contribution towards the model performance.
We trained the NARM and RL-enhanced models in a similar manner, respective to their original implementations, using our SQN-protocol data splits and tuned hyperparameters.
Below we report the results for the Yoochoose and RetailRocket datasets.
| Yoochoose - Clicks | Yoochoose - Purchases |
| H@10 | M@10 |
H@20 | M@20 | H@10 |
M@10 | H@20 | M@20 |
NARM | 0.263 | 0.143 |
0.308 | 0.147 | 0.235 | 0.155 |
0.261 | 0.157 |
SAS(RL) | 0.254 | 0.124 |
0.304 | 0.128 | 0.171 | 0.101 |
0.193 | 0.118 |
SAS-SQN | 0.258 | 0.127 |
0.310 | 0.130 | 0.175 | 0.106 |
0.198 | 0.108 |
SAS-EVAL | 0.259 | 0.127 |
0.309 | 0.130 | 0.176 | 0.102 |
0.202 | 0.104 |
GRU(RL) | 0.226 | 0.113 |
0.268 | 0.116 | 0.190 | 0.107 |
0.216 | 0.109 |
GRU-SQN | 0.222 | 0.112 |
0.260 | 0.115 | 0.180 | 0.100 |
0.205 | 0.101 |
GRU-EVAL | 0.217 | 0.109 |
0.257 | 0.112 | 0.180 | 0.099 |
0.203 | 0.101 |
| RetailRocket - Clicks | RetailRocket - Purchases |
| H@10 | M@10 |
H@20 | M@20 | H@10 |
M@10 | H@20 | M@20 |
NARM | 0.322 | 0.274 |
0.328 | 0.275 | 0.705 | 0.597 |
0.718 | 0.598 |
SAS(RL) | 0.317 | 0.208 |
0.357 | 0.210 | 0.639 | 0.528 |
0.661 | 0.530 |
SAS-SQN | 0.319 | 0.209 |
0.361 | 0.212 | 0.644 | 0.533 |
0.667 | 0.534 |
SAS-EVAL | 0.318 | 0.209 |
0.360 | 0.212 | 0.645 | 0.526 |
0.666 | 0.527 |
GRU(RL) | 0.207 | 0.117 |
0.244 | 0.120 | 0.358 | 0.219 |
0.409 | 0.222 |
GRU-SQN | 0.224 | 0.131 |
0.263 | 0.133 | 0.390 | 0.250 |
0.433 | 0.253 |
GRU-EVAL | 0.235 | 0.137 |
0.273 | 0.139 | 0.428 | 0.276 |
0.477 | 0.279 |