In this paper, we propose a novel Deep Reinforcement Learning framework for news recommendation. Online personalized news recommendation is a highly challenging problem due to the dynamic nature of news features and user preferences. Although some online recommendation models have been proposed to address the dynamic nature of news recommendation, these methods have three major issues. First, they only try to model current reward (e.g., Click Through Rate). Second, very few studies consider to use user feedback other than click / no click labels (e.g., how frequent user returns) to help improve recommendation. Third, these methods tend to keep recommending similar news to users, which may cause users to get bored. Therefore, to address the aforementioned challenges, we propose a Deep Q-Learning based recommendation framework, which can model future reward explicitly. We further consider user return pattern as a supplement to click / no click label in order to capture more user feedback information. In addition, an effective exploration strategy is incorporated to find new attractive news for users. Extensive experiments are conducted on the offline dataset and online production environment of a commercial news recommendation application and have shown the superior performance of our methods.
Keywords: Reinforcement learning, Deep Q-Learning, News recommendation ACM Reference Format: 
Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A Deep Reinforcement Learning Framework for News Recommendation. In WWW 2018: The 2018 Web Conference, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA 10 Pages. https://doi.org/10.1145/3178876.3185994 
The explosive growth of online content and services has provided tons of choices for users. For instance, one of the most popular online services, news aggregation services, such as Google News [15] can provide overwhelming volume of content than the amount that users can digest. Therefore, personalized online content recommendation are necessary to improve user experience.
Several groups of methods are proposed to solve the online personalized news recommendation problem, including content based methods [19, 22, 33], collaborative filtering based methods [11, 28, 34], and hybrid methods [12, 24, 25]. Recently, as an extension and integration of previous methods, deep learning models [8, 45, 52] have become the new state-of-art methods due to its capability of modeling complex user item (i.e., news) interactions. However, these methods can not effectively address the following three challenges in news recommendation.
First, the dynamic changes in news recommendations are difficult to handle. The dynamic change of news recommendation can be shown in two folds. First, news become outdated very fast. In our dataset, the average time between the time that one piece of news is published and the time of its last click is 4.1 hours. Therefore, news features and news candidate set are changing rapidly. Second, users’ interest on different news might evolve during time. For instance, Figure 1 displays the categories of news that one user has read in 10 weeks. During the first few weeks, this user prefers to read about “Politics” (green bar in Figure 1), but his interest gradually moves to “Entertainment” (purple bar in Figure 1) and “Technology” (grey bar in Figure 1) over time. Therefore, it is necessary to update the model periodically. Although there are some online recommendation methods [11, 24] that can capture the dynamic change of news features and user preference through online model updates, they only try to optimize the current reward (e.g., Click Through Rate), and hence ignore what effect the current recommendation might bring to the future. An example showing the necessity of considering future is given in Example 1.1.
Example 1.1.When a user Mike requests for news, the recommendation agent foresees that he has almost the same probability to click on two pieces of news: one about a thunderstorm alert, and the other about a basketball player Kobe Bryant. However, according to Mike's reading preference, features of the news, and reading patterns of other users, our agent speculates that, after reading about the thunderstorm, Mike will not need to read news about this alert anymore, but he will probably read more about basketball after reading the news about Kobe. This suggests, recommending the latter piece of news will introduce larger future reward. Therefore, considering future rewards will help to improve recommendation performance in the long run.
 
Second, current recommendation methods [23, 35, 36, 43] usually only consider the click / no click labels or ratings as users’ feedback. However, how soon one user will return to this service [48] will also indicate how satisfied this user is with the recommendation. Nevertheless, there has been little work in trying to incorporate user return pattern to help improve recommendation.
The third major issue of current recommendation methods is its tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics. In the literature, some reinforcement learning methods have already proposed to add some randomness (i.e., exploration) into the decision to find new items. State-of-art reinforcement learning methods usually apply the simple ϵ-greedy strategy [31] or Upper Confidence Bound (UCB) [23, 43] (mainly for Multi-Armed Bandit methods). However, both strategies could harm the recommendation performance to some extent in a short period. ϵ-greedy strategy may recommend the customer with totally unrelated items, while UCB can not get a relatively accurate reward estimation for an item until this item has been tried several times. Hence, it is necessary to do more effective exploration.
Therefore, in this paper, we propose a Deep Reinforcement Learning framework that can help to address these three challenges in online personalized news recommendation. First, in order to better model the dynamic nature of news characteristics and user preference, we propose to use Deep Q-Learning (DQN) [31] framework. This framework can consider current reward and future reward simultaneously. Some recent attempts using reinforcement learning in recommendation either do not model the future reward explicitly (MAB-based works [23, 43]), or use discrete user log to represent state and hence can not be scaled to large systems (MDP-based works [35, 36]). In contrast, our framework uses a DQN structure and can easily scale up. Second, we consider user return as another form of user feedback information, by maintaining an activeness score for each user. Different from existing work [48] that can only consider the most recent return interval, we consider multiple historical return interval information to better measure the user feedback. In addition, different from [48], our model can estimate user activeness at any time (not just when user returns). This property enables the experience replay update used in DQN. Third, we propose to apply a Dueling Bandit Gradient Descent(DBGD) method [16, 17, 49] for exploration, by choosing random item candidates in the neighborhood of the current recommender. This exploration strategy can avoid recommending totally unrelated items and hence maintain better recommendation accuracy.
Our deep reinforcement recommender system can be shown as Figure 2. We follow the common terminologies in reinforcement learning [37] to describe the system. In our system, user pool and news pool make up the environment, and our recommendation algorithms play the role of agent. The state is defined as feature representation for users and action is defined as feature representation for news. Each time when a user requests for news, a state representation (i.e., features of users) and a set of action representations (i.e., features of news candidates) are passed to the agent. The agent will select the best action (i.e., recommending a list of news to user) and fetch user feedback as reward. Specifically, the reward is composed of click labels and estimation of user activeness. All these recommendation and feedback log will be stored in the memory of the agent. Every one hour, the agent will use the log in the memory to update its recommendation algorithm.
 
Our contribution can be summarized as below:
The rest of the paper is organized as follows. Related work is discussed in Section 2. Then, in Section 3 we present the problem definitions. Our method is introduced in Section 4. After that, the experimental results are shown in Section 5. Finally, brief conclusions are given in Section 6.
Recommender systems [3, 4] have been investigated extensively because of its direct connection to profits of products. Recently, due to the explosive grow of online content, more and more attention has been drawn to a special application of recommendation – online personalized news recommendation. Conventional news recommendation methods can be divided into three categories. Content-based methods [19, 22, 33] will maintain news term frequency features (e.g., TF-IDF) and user profiles (based on historical news). Then, recommender will select news that is more similar to user profile. In contrast, collaborative filtering methods [11] usually make rating prediction utilizing the past ratings of current user or similar users [28, 34], or the combination of these two [11]. To combine the advantages of the former two groups of methods, hybrid methods [12, 24, 25] are further proposed to improve the user profile modeling. Recently, as an extension and integration of previous methods, deep learning models [8, 45, 52] have shown much superior performance than previous three categories of models due to its capability of modeling complex user-item relationship. Different from the effort for modeling the complex interaction between user and item, our algorithm focuses on dealing with the dynamic nature of online news recommendation, and modeling of future reward. However, these feature construction and user-item modeling techniques can be easily integrated into our methods.
2.2.1 Contextual Multi-Armed Bandit models. A group of work [5, 7, 23, 40, 44, 50] begin to formulate the problem as a Contextual Multi-Armed Bandit (MAB) problem, where the context contains user and item features. [23] assumes the expected reward is a linear function of the context. [39] uses an ensemble of bandits to improve the performance, [40] proposes a parameter-free model, and [50] addresses the time-varying interest of users. Recently, some people try to combine bandit with clustering based collaborative filtering [14], and matrix factorization [6, 21, 32, 42, 43, 51], in order to model more complex user and item relationship, and utilize the social network relationship in determining the reward function. However, our model is significantly different from these works, because by applying Markov Decision Process, our model is able to explicitly model future rewards. This will benefit the recommendation accuracy significantly in the long run.
2.2.2 Markov Decision Process models. There are also some literature trying to use Markov Decision Process to model the recommendation process. In contrast to MAB-based methods, MDP-based methods can not only capture the reward of current iteration, but also the potential reward in the future iterations. [26, 27, 35, 36, 38] try to model the item or n-gram of items as state (or observation in Partially Observed MDP), and the transition between items (recommendation for the next item) as the action. However, this can not scale to large dataset, because when the item candidate set becomes larger, the size of state space will grow exponentially. In addition, the state transitions data is usually very sparse, and can only be used to learn the model parameters corresponding to certain state transitions. Therefore, the model is really hard to learn. Different from the literature, we propose a MDP framework with continuous state and action representation, which enables the system to scale up and the effective learning of model parameters by using all the state, action, reward tuples.
We define our problem as follows:
When a user $\mathsf $ sends a news request to the recommendation agent $\mathsf $ at time t, given a candidate set $\mathsf $ of news, our algorithm is going to select a list $\mathsf $ of top-k appropriate news for this user. The notations used in this paper are summarized in Table 1.
Table 1: Notations| Notation | Meaning | 
|---|---|
| $\mathsf $ | Agent | 
| $\mathsf $ , $\mathsf $ | User, User set | 
| $\mathsf $ | Action | 
| $\mathsf $ | State | 
| $\mathsf $ | Reward | 
| $\mathsf $ , $\mathsf $ | News, Candidate news pool | 
| $\mathsf $ | List of news to recommend | 
| $\mathsf $ | List of feedback from users | 
| $\mathsf $ | Deep Q-Network | 
| $\mathsf $ | Parameters of Deep Q-Network | 
Personalized news recommendation has attracted a lot of attention in recent years [11, 23, 45]. The current methods can be generally categorized as content based methods [19, 22, 33], collaborative filtering based methods [11, 28, 34], and hybrid methods [12, 24, 25]. Recently, many deep learning models [8, 45, 52] are further proposed in order to model more complex user item interactions. News recommendation problem becomes even more challenging when it happens in an online scenario due to three reasons. First, online learning are needed due to the highly dynamic nature of news characteristics and user preference. Second, only using click / no click labels will not capture users’ full feedback towards news. Third, traditional recommendation methods tend to recommend similar items and will narrow down user's reading choices. This will make users bored and lead to decrease of user satisfaction in the long run.
To address these three challenges, we propose a DQN-based Deep Reinforcement Learning framework to do online personalized news recommendation. Specifically, we use a continuous state feature representation of users and continuous action feature representation of items as the input to a multi-layer Deep Q-Network to predict the potential reward (e.g., whether user will click on this piece of news). First, this framework can deal with the highly dynamic nature of news recommendation due to the online update of DQN. Meanwhile, DQN is different from common online methods, because of its capability to speculate future interaction between user and news. Second, we propose to combine user activeness (i.e., how frequent a user returns to the App after one recommendation) and click labels as the feedback from users. Third, we propose to apply Dueling Bandit Gradient Descent exploration strategy [16, 49] to our algorithm which can both improve recommendation diversity and avoid the harm to recommendation accuracy induced by classical exploration strategies like ϵ-greedy [31] and Upper Confidence Bound [23].
Our method is significantly different from the MAB group of methods [5, 7, 23, 40, 44, 50] due to its explicit modeling of future rewards, and different from previous MDP methods [27, 35, 36, 38] using user log due to its continuous representation of state and action, and the capability to scale to large systems.
In this section, we will first introduce the model framework in Section 4.1. Then, we will illustrate the feature construction in Section 4.2 and the deep reinforcement learning model in Section 4.3. After that, the design of user activeness consideration is discussed in Section 4.4. Finally, the exploration module is introduced in Section 4.5.

As shown in Figure 3, our model is composed of offline part and online part. In offline stage, four kinds of features (will be discussed in Section 4.2) are extracted from news and users. A multi-layer Deep Q-Network is used to predict the reward (i.e., a combination of user-news click label and user activeness) from these four kinds of features. This network is trained using the offline user-news click logs. Then, during the online learning part, our recommendation agent $\mathsf $ will interact with users and update the network in the following way:
In order to predict whether user will click one specific piece of news or not, we construct four categories of features:
In order to focus on the analysis of the reinforcement learning recommendation framework, we did not try to add more features, e.g., textual features [45]. But they can be easily integrated into our framework for better performance.
Considering the previous mentioned dynamic feature of news recommendation and the need to estimate future reward, we apply a Deep Q-Network (DQN) [31] to model the probability that one user may click on one specific piece of news. Under the setting of reinforcement learning, the probability for a user to click on a piece of news (and future recommended news) is essentially the reward that our agent can get. Therefore, we can model the total reward as Equation 1.
\beginAs shown in Figure 4, we feed the four categories of features into the network. User features and Context features are used as state features, while User news features and Context features are used as action features. On one hand, the reward for taking action $\mathsf $ at certain state $\mathsf $ is closely related to all the features. On the other hand, the reward that determined by the characteristics of the user himself (e.g., whether this user is active, whether this user has read enough news today) is more impacted by the status of the user and the context only. Based on this observation, like [47], we divide the Q-function into value function $V(\mathsf )$ and advantage function $A(\mathsf , \mathsf )$ , where $V(\mathsf )$ is only determined by the state features, and $A(\mathsf , \mathsf )$ is determined by both the state features and the action features.
 
Traditional recommender systems only focus on optimizing CTR-like metrics (i.e., only utilizing click / no click labels), which only depicts part of the feedback information from users. The performance of recommendation might also influence whether users want to use the application again, i.e., better recommendation will increase the frequency for users to interact with the application. Therefore, the change of user activeness should also be considered properly.
Users request for news in a non-uniform pattern. Users usually read news for a short period (e.g., 30 minutes), during which they will request or click news with high frequency. Then they might leave the application and return to the application when they want to read more news after several hours. A user return happens when a user requests for news (users will always request for news before they click on news, therefore, user click is also implicitly considered).
We use survival models [18, 30] to model user return and user activeness. Survival analysis [18, 30] has been applied in the field of estimating user return time [20]. Suppose T is the time until next event (i.e., user return) happens, then the hazard function (i.e., instantaneous rate for the event to happen) can be defined as Equation 3 [1, 30]
\beginIn our problem, we simply set λ(t) = λ 0, which means each user has a constant probability to return. Every time we detect a return of user, we will set S(t) = S(t) + Sa for this particular user. The user activeness score will not exceed 1. For instance, as shown in Figure 5, user activeness for this specific user starts to decay from S 0 at time 0. At timestamp t 1, the user returns and this results in a Sa increase in the user activeness. Then, the user activeness continues to decay after t 1. Similar things happen at t 2, t 3, t 4 and t 5. Note that, although this user has a relatively high request frequency during t 4 to t 9, the maximum user activeness is truncated to 1.
The parameters S 0, Sa , λ 0, T 0 are determined according to the real user pattern in our dataset. S 0 is set to 0.5 to represent the random initial state of a user (i.e., he or she can be either active or inactive). We can observe the histogram of the time interval between every two consecutive requests of users as shown in Figure 6. We observe that besides reading news multiple times in a day, people usually return to the application on a daily regular basis. So we set T 0 to 24 hours. The decaying parameter λ 0 is set to 1.2 × 10 − 5 second − 1 according to Equation 4 and Equation 5. In addition, the user activeness increase Sa for each click is set to 0.32 to make sure user will return to the initial state after one daily basis request, i.e., $S_0 e^ <-\lambda _0 T_0>+ S_a = S_0$ .
The click / no click label $\mathsf _$ and the user activeness $\mathsf _$ are combined as in Equation 6.
\beginAlthough we use survival models here to estimate the user activeness, other alternatives like Poisson point process [ 13] can also be applied and should serve similar function.  
 

The most straightforward strategies to do exploration in reinforcement learning are ϵ-greedy [31] and UCB [23]. ϵ-greedy will randomly recommend new items with a probability of ϵ, while UCB will pick items that have not been explored for many times (because these items may have larger variance). It is evident that these trivial exploration techniques will harm the recommendation performance in a short period. Therefore, rather than doing random exploration, we apply a Dueling Bandit Gradient Descent algorithm [16, 17, 49] to do the exploration. Intuitively, as shown in Figure 7, the agent $\mathsf $ is going to generate a recommendation list $\mathsf $ using the current network $\mathsf $ and another list $\tilde<\mathsf >$ using an explore network $\tilde<\mathsf >$ . The parameters $\tilde<\mathsf >$ of network $\tilde<\mathsf >$ can be obtained by adding a small disturb $ \mathsf \mathsf $ (Equation 7) to the parameters $\mathsf $ of the current network $\mathsf $ .
\beginwhere α is the explore coefficient, and rand(− 1, 1) is a random number between -1 and 1. Then, the agent $\mathsf $ will do a probabilistic interleave [ 16] to generate the merged recommendation list $\hat<\mathsf >$ using $\mathsf $ and $\tilde<\mathsf >$ . To determine the item for each position in the recommendation list $\hat<\mathsf >$ , the probabilistic interleave approach basically will first randomly select between list $\mathsf $ and $\tilde<\mathsf >$ . Suppose $\mathsf $ is selected, then an item $\mathsf $ from $\mathsf $ will be put into $\hat<\mathsf >$ with a probability determined by its ranking in $\mathsf $ (items with top rankings will be selected with higher probability). Then, list $\hat<\mathsf >$ will be recommended to user $\mathsf $ and agent $\mathsf $ will obtain the feedback $\mathsf $ . If the items recommended by the explore network $\tilde<\mathsf >$ receive a better feedback, the agent $\mathsf $ will update the network $\mathsf $ towards $\tilde<\mathsf >$ , with the parameters of the network being updated as Equation 8
\beginOtherwise, the agent $\mathsf 
We conduct experiment on a sampled offline dataset collected from a commercial news recommendation application and deploy our system online to the App for one month. Each recommendation algorithm will give out its recommendation when a news request arrives and user feedback will be recorded (click or not). The basic statistics for the sampled data is as in Table 2. In the first offline stage, the training data and testing data are separated by time order (the last two weeks are used as testing data), to enable the online models to learn the sequential information between different sessions better. During the second online deploying stage, we use the offline data to pre-train the model, and run all the compared methods in the real production environment.
Table 2: Statistics of the sampled dataset| Stage | Duration | # of users | # of news | 
|---|---|---|---|
| Offline stage | 6 months | 541,337 | 1,355,344 | 
| Online stage | 1 month | 64,610 | 157,088 | 
As shown in Figure 8, the dataset is very skewed. The number of requests for each user follows a long tail distribution and most users only request news for less than 500 times. The number of times each news are pushed also follow a long tail distribution and most news are pushed to user less than 200 times.
 