Reassessing the Effectiveness of Reinforcement Learning based Recommender Systems for Sequential Recommendation

Additional Online Material: Source Code, Full Results, Hyperparameters


Dilina Chandika Rajapakse Dietmar Jannach
Trinity College Dublin, Ireland University of Klagenfurt, Austria
rajapakd[at]tcd.ie dietmar.jannach[at]aau.at

Over the past few years, researchers have explored the use of reinforcement learning (RL) for sequential recommendation problems. However, since RL techniques commonly target at optimizing long-term rewards, it is surprising that RL-based models are reported to be competitive with traditional supervised models when evaluated under the myopic next-item prediction protocol. A recent study suggests that reported performance gains of combining RL with supervised learning techniques, as done in the Self-Supervised Q-Learning (SQN) framework, may actually not come from learning an optimal policy, but that the RL component helps to learn embeddings that encode the users' past interactions. Given these observations, we aimed to reassess the performance of RL-enhanced sequential recommendations in the SQN framework. While we were able to reproduce the results reported in the respective papers, we found that properly-tuned supervised learning models like GRU4Rec substantially outperform the proposed RL-models from the literature. Our analyses furthermore revealed that there is a significant inconsistency in terms of evaluation protocols in the literature, and that the use of third-party implementations of existing models may lead to unreliable conclusions. Overall, still more research and alternative evaluation schemes seem required to fully leverage the power of RL for sequential recommendation tasks.


Source code available at : https://github.com/dilina-r/rl-rec

Yoochoose - Full Results (Clicks only)

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.3950.2260.2680.5270.2440.3110.6460.2530.341
VSKNN0.4600.2630.3120.5920.2810.3550.6990.2890.382
GRU4Rec0.4770.2850.3330.6090.3030.3760.7130.3100.402
NARM0.4750.2850.3320.6090.3030.3750.7150.3100.402
SAS(RL)0.4660.2780.3240.5950.2950.3660.7000.3020.393
SAS-SQN0.4660.2770.3240.5960.2950.3660.7030.3020.394
SAS-EVAL0.4670.2770.3240.5980.2950.3670.7040.3020.394
GRU(RL)0.4020.2410.2810.5200.2570.3190.6220.2640.345
GRU-SQN0.4170.2480.2900.5390.2650.3300.6430.2720.356
GRU-EVAL0.4190.2490.2910.5440.2660.3320.6470.2730.358

RetailRocket - Full Results (Views only)

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.4620.3280.3620.5350.3380.3850.5930.3420.400
VSKNN0.4660.3310.3650.5370.3410.3880.5960.3450.403
GRU4Rec0.4190.2790.3140.5180.2920.3460.6100.2990.369
NARM0.4380.3270.3550.4990.3360.3750.5560.3400.389
SAS(RL)0.3580.2370.2670.4370.2480.2930.5060.2530.310
SAS-SQN0.3670.2470.2770.4470.2580.3030.5190.2630.321
SAS-EVAL0.3720.2500.2810.4530.2610.3070.5230.2660.324
GRU(RL)0.3170.2110.2380.3790.2200.2580.4330.2240.272
GRU-SQN0.3320.2260.2520.3980.2350.2740.4550.2390.288
GRU-EVAL0.3360.2280.2550.3970.2370.2750.4540.2400.289

Diginetica - Full Results (Views only)

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.2780.1660.1930.3860.1800.2280.5020.1880.257
VSKNN0.3340.2450.2670.4190.2570.2950.5130.2630.319
GRU4Rec0.4530.2960.3350.5560.3100.3690.6550.3170.394
NARM0.4150.2650.3020.5240.2800.3380.6280.2870.364
SAS(RL)0.3730.2260.2630.4770.2400.2970.5720.2470.320
SAS-SQN0.3770.2300.2660.4820.2440.3000.5810.2510.325
SAS-EVAL0.3740.2270.2630.4780.2410.2970.5760.2480.322
GRU(RL)0.2320.1380.1610.3050.1480.1850.3760.1520.203
GRU-SQN0.2810.1650.1940.3790.1780.2250.4670.1840.248
GRU-EVAL0.2840.1650.1950.3790.1780.2250.4690.1840.248

Yoochoose Clicks - Full Results

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.3620.2200.2550.4620.2330.2880.5420.2390.308
VSKNN0.4020.2430.2830.4980.2560.3140.5800.2620.335
GRU4Rec0.4260.2580.3000.5330.2720.3350.6170.2780.356
NARM0.3730.2200.2580.4780.2350.2920.5620.2400.314
SAS(RL)0.3590.2120.2490.4640.2260.2830.5510.2330.305
SAS-SQN0.3670.2150.2520.4690.2280.2860.5600.2350.309
SAS-EVAL0.3660.2180.2550.4740.2330.2900.5600.2390.312
GRU(RL)0.3000.1760.2070.3900.1880.2360.4630.1930.254
GRU-SQN0.3050.1780.2100.3960.1910.2390.4730.1960.259
GRU-EVAL0.3070.1800.2110.4000.1920.2410.4760.1980.261

Yoochoose Purchases - Full Results

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.6190.4000.4550.7390.4160.4940.8070.4210.511
VSKNN0.5130.3220.3700.6160.3360.4030.7070.3420.426
GRU4Rec0.3940.2710.3010.4760.2820.3280.5370.2860.344
NARM0.4130.2540.2930.5210.2680.3280.6040.2740.349
SAS(RL)0.3260.2060.2360.4110.2170.2630.4740.2220.279
SAS-SQN0.3290.2070.2370.4070.2180.2630.4750.2220.280
SAS-EVAL0.3350.2160.2450.4120.2260.2700.4760.2300.287
GRU(RL)0.2700.1590.1860.3570.1700.2140.4250.1750.232
GRU-SQN0.2860.1700.1990.3660.1810.2250.4440.1860.244
GRU-EVAL0.2850.1720.2000.3730.1840.2290.4450.1890.247

RetailRocket Clicks - Full Results

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.3030.2150.2370.3550.2220.2540.3990.2250.265
VSKNN0.3680.2820.3030.4160.2880.3190.4540.2910.328
GRU4Rec0.3780.2900.3120.4310.2970.3290.4800.3000.341
NARM0.3430.2740.2910.3810.2790.3030.4180.2810.313
SAS(RL)0.2930.2100.2310.3500.2180.2490.4030.2210.263
SAS-SQN0.2960.2180.2370.3480.2250.2540.3980.2280.267
SAS-EVAL0.2960.2160.2360.3490.2230.2530.3990.2270.266
GRU(RL)0.1560.1080.1200.1870.1130.1300.2180.1150.138
GRU-SQN0.1960.1340.1490.2350.1390.1620.2710.1410.171
GRU-EVAL0.1920.1340.1480.2320.1390.1610.2680.1410.170

RetailRocket Purchases - Full Results

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.3990.2960.3220.4550.3030.3400.5030.3070.352
VSKNN0.7000.6250.6440.7280.6290.6530.7520.6310.659
GRU4Rec0.7810.7290.7420.8060.7330.7500.8240.7340.755
NARM0.6230.5320.5550.6590.5370.5670.6920.5400.575
SAS(RL)0.6340.5140.5440.6890.5220.5620.7250.5240.571
SAS-SQN0.6550.5530.5790.6950.5590.5920.7210.5610.599
SAS-EVAL0.6570.5490.5760.6950.5540.5880.7230.5560.596
GRU(RL)0.1800.1270.1410.2160.1320.1520.2540.1350.162
GRU-SQN0.2380.1680.1860.2850.1750.2010.3310.1780.213
GRU-EVAL0.2360.1640.1820.2820.1700.1970.3300.1730.209

Diginetica Clicks - Full Results

H@5M@5N@5H@10M@10N@20H@20M@20N@20
SKNN0.2780.1660.1930.3860.1800.2280.5020.1880.257
VSKNN0.3340.2450.2670.4190.2570.2950.5130.2630.319
GRU4Rec0.4530.2960.3350.5560.3100.3690.6550.3170.394
NARM0.4150.2650.3020.5240.2800.3380.6280.2870.364
SAS(RL)0.3730.2260.2630.4770.2400.2970.5720.2470.320
SAS-SQN0.3770.2300.2660.4820.2440.3000.5810.2510.325
SAS-EVAL0.3740.2270.2630.4780.2410.2970.5760.2480.322
GRU(RL)0.2320.1380.1610.3050.1480.1850.3760.1520.203
GRU-SQN0.2810.1650.1940.3790.1780.2250.4670.1840.248
GRU-EVAL0.2840.1650.1950.3790.1780.2250.4690.1840.248

SKNN Parameters

YoochooseRetailrocketDigineticaValues
k500100100{50, 100, 500, 1000, 1500}
sample_size2500500500{500, 1000, 2500, 5000, 10000}
similaritycosinecosinecosine{'jaccard', 'cosine', 'binary', 'tanimoto'}

VSKNN Parameters

YoochooseRetailrocketDigineticaValues
k500500100{50, 100, 500, 1000, 1500}
sample_size50005002500{500, 1000, 2500, 5000, 10000}
weightingloglogquadratic{'linear', 'same', 'div', 'log', 'quadratic'}
weighting_scorequadraticlinearquadratic{'linear', 'same', 'div', 'log', 'quadratic'}
idf_weighting152{0, 1, 2, 5, 10}

GRU4Rec Parameters

Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0

YoochooseRetailrocketDigineticaValues
layers12896128min=64, max=512, step=32
batch_size208208224min=32, max=256, step=16
learning_rate0.130.140.25min=0.01, max=0.25, step=0.005
dropout_p_embed0.50.450.35min=0, max=0.5, step=0.05
dropout_p_hidden000.4min=0, max=0.7, step=0.05
momentum0.50.50min=0, max=0.9, step=0.05
sample_alpha0.20.20.3min=0, max=1.0, step=0.1
bpreg1.050.90.6min=0, max=2.0, step=0.05
elu_param000.5{0.5, 1, 0}
constrained_embeddingTrueTrueFalse{True, False}
loss'cross-entropy''bpr-max''cross-entropy'{'bpr-max', 'cross-entropy'}

NARM Parameters

Fixed Parameters: epochs=50

YoochooseRetailrocketDigineticaValues
batch_size144144128min=64, max=512, step=16
emb_dim16038496min=64, max=512, step=32
lr0.00060.00010.0006min=0.0001, max=0.1, step=0.00005

GRU(RL), SQN-GRU and EVAL-GRU Parameters

Fixed Parameters: epochs=40

YoochooseRetailrocketDigineticaValues
batch_size512512160min=64, max=512, step=64
discount00.20.2min=0, max=1.0, step=0.1
hidden_factor64512512min=64 max=512, step=32
lr0.0030.00050.001min=0.001, max=0.1, step=0.0005

SAS(RL), SQN-SAS and EVAL-SAS Parameters

Fixed Parameters: epochs=40, num_blocks=1, num_heads=1

YoochooseRetailrocketDigineticaValues
batch_size512,128,244min=64, max=512, step=64
discount0.4,0.7,0.6min=0, max=1.0, step=0.1
dropout0.5,0.5,0.3min=0, max=0.9, step=0.1
hidden_factor64,384,256min=64 max=512, step=32
lr0.0010.00150.0055min=0.001, max=0.1, step=0.0005

SKNN Parameters

YoochooseRetailrocketDigineticaValues
k1005050{50, 100, 500, 1000, 1500}
sample_size10001000500{500, 1000, 2500, 5000, 10000}
similarityjaccardcosinecosine{'jaccard', 'cosine', 'binary', 'tanimoto'}

VSKNN Parameters

YoochooseRetailrocketDigitenicaParam space
k15001000500{50, 100, 500, 1000, 1500}
sample_size250050002500{500, 1000, 2500, 5000, 10000}
weightingsameloglinear{'linear', 'same', 'div', 'log', 'quadratic'}
weighting_scorequadraticdivsame{'linear', 'same', 'div', 'log', 'quadratic'}
idf_weighting550{0, 1, 2, 5, 10}

GRU4Rec Parameters

Fixed Parameters: n_epochs=20, n_sample=2048, logq=1.0

YoochooseRetailrocketDigineticaValues
layers9638496min=64, max=512, step=32
batch_size14432144min=32, max=256, step=16
learning_rate0.050.0350.14min=0.01, max=0.25, step=0.005
dropout_p_embed0.20.40.45min=0, max=0.5, step=0.05
dropout_p_hidden0.250.20.25min=0, max=0.7, step=0.05
momentum0.550.10.5min=0, max=0.9, step=0.05
sample_alpha0.60.40.1min=0, max=1.0, step=0.1
bpreg0.450.051.75min=0, max=2.0, step=0.05
elu_param000.5{0.5, 1, 0}
constrained_embeddingTrueTrueTrue{true, false}
loss'cross-entropy''cross-entropy''cross-entropy'{'bpr-max', 'cross-entropy'}

NARM Parameters

Fixed Parameters: epochs=50

YoochooseRetailrocketDigineticaValues
batch_size96144176min=64, max=512, step=16
emb_dim224256512min=64, max=512, step=32
lr0.00060.00010.0001min=0.0001, max=0.1, step=0.00005

GRU(RL), SQN-GRU and EVAL-GRU Parameters

Fixed Parameters: epochs=40

YoochooseRetailrocketDigineticaValues
batch_size64256512min=64, max=512, step=64
discount0.50.40min=0, max=1.0, step=0.1
hidden_factor64384128min=64 max=512, step=32
lr0.0010.00150.005min=0.001, max=0.1, step=0.0005

SAS(RL), SQN-SAS and EVAL-SAS Parameters

Fixed Parameters: epochs=40, num_blocks=1, num_heads=1

YoochooseRetailrocketDigineticaValues
batch_size512512384min=64, max=512, step=64
discount0.20.50.2min=0, max=1.0, step=0.1
dropout0.70.60.6min=0, max=0.9, step=0.1
hidden_factor384256320min=64 max=512, step=32
lr0.0010.00010.0001min=0.001, max=0.1, step=0.0005

Reproducing Results from ICML'24 Paper Labarca et al.

The tables below report our reproduced metrics against the results stated in the ICML'24 paper titled "On the Unexpected Effectiveness of Reinforcement Learning for Sequential Recommendation". We use the original code, hyperparameters and data provided by the authors.


Reproduced results for Click events, for Yoochoose (RC15) and RetailRocket datasets.

Reproduced ResultsICML'24 Paper results
YoochooseRetailRocket YoochooseRetailRocket
H@20N@20 H@20N@20H@20 N@20H@20N@20
SAS0.4940.276 0.2330.1450.496 0.2750.2310.145
SQN-SAS0.4980.273 0.2600.1610.505 0.2800.2600.162
EVAL-SAS0.5080.282 0.2570.1590.502 0.2780.2580.161
GRU0.4390.239 0.1750.1080.427 0.2310.1750.109
SQN-GRU0.4580.252 0.2000.1280.457 0.2510.2000.127
EVAL-GRU0.4560.251 0.2010.1280.457 0.2510.2010.128

Reproduced results for Purchase/Buy events, for Yoochoose (RC15) and RetailRocket datasets.

Reproduced ResultsICML'24 Paper results
YoochooseRetailRocket YoochooseRetailRocket
H@20N@20 H@20N@20H@20 N@20H@20N@20
SAS0.5610.317 0.3550.2220.560 0.3150.3510.218
SQN-SAS0.6120.341 0.4300.2710.581 0.3270.4230.263
EVAL-SAS0.5670.317 0.4120.2560.622 0.3480.4200.260
GRU0.5460.299 0.2350.1450.561 0.3100.2310.146
SQN-GRU0.5760.324 0.2710.1750.576 0.3220.2700.174
EVAL-GRU0.5760.320 0.2690.1720.578 0.3230.2720.176

Results when trained without validation sessions

The original implementations of NARM and SQN-framework, dismiss the 'validation' sessions in the final training of the model. NARM use the validation data to select the best-performing model at each epoch, which is then used in the final performance evaluation. In the SQN-framework, the models are routinely evaluated with the validation data, but provides no contribution towards the model performance. We trained the NARM and RL-enhanced models in a similar manner, respective to their original implementations, using our SQN-protocol data splits and tuned hyperparameters. Below we report the results for the Yoochoose and RetailRocket datasets.

Yoochoose - ClicksYoochoose - Purchases
H@10M@10 H@20M@20H@10 M@10H@20M@20
NARM0.2630.143 0.3080.1470.2350.155 0.2610.157
SAS(RL)0.2540.124 0.3040.1280.1710.101 0.1930.118
SAS-SQN0.2580.127 0.3100.1300.1750.106 0.1980.108
SAS-EVAL0.2590.127 0.3090.1300.1760.102 0.2020.104
GRU(RL)0.2260.113 0.2680.1160.1900.107 0.2160.109
GRU-SQN0.2220.112 0.2600.1150.1800.100 0.2050.101
GRU-EVAL0.2170.109 0.2570.1120.1800.099 0.2030.101

RetailRocket - ClicksRetailRocket - Purchases
H@10M@10 H@20M@20H@10 M@10H@20M@20
NARM0.3220.274 0.3280.2750.7050.597 0.7180.598
SAS(RL)0.3170.208 0.3570.2100.6390.528 0.6610.530
SAS-SQN0.3190.209 0.3610.2120.6440.533 0.6670.534
SAS-EVAL0.3180.209 0.3600.2120.6450.526 0.6660.527
GRU(RL)0.2070.117 0.2440.1200.3580.219 0.4090.222
GRU-SQN0.2240.131 0.2630.1330.3900.250 0.4330.253
GRU-EVAL0.2350.137 0.2730.1390.4280.276 0.4770.279