Prévia do material em texto
2017 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE 2017)
ISBN: 978-1-60595-458-5
Ratings Distribution Recommendation Model-based Collaborative
Filtering Recommendation Algorithm
Tao-tao PAN*, Qin-rang LIU and Chang LIU
China National Digital Switching System Engineering and Technological R&D Center, Zhengzhou
450002, China
*Corresponding author
Keywords: Collaborative filtering, Similarity, Popular ratings, Rating scale.
Abstract. In order to solve the problem of the popular item ratings interfering in similarity calculation,
we proposed the ratings distribution recommendation model. Based on this model, we designed a new
collaborative filtering algorithm. According to ratings distribution, this algorithm firstly get the
amount of information carried (The Shannon Entropy). Then, it calculated the rating weights to filter
into traditional similarity calculation. The experimental results show that the algorithm can
effectively alleviate the above problem and improve the performance of the algorithm.
Introduction
With the development of Internet technology, the problem of information overload is becoming more
and more serious. It is very difficult for people to get the information they need from the ocean of
knowledge
[1]
. In this context, the recommendation system arises at the historic moment. According to
the different methods used in recommendation, recommendation system can be divided into
content-based recommendation system, collaborative filtering recommendation system and hybrid
recommendation system. Collaborative filtering algorithm first finds a set of users with similar interests
and goals of the user as the neighbor users, then according to the neighbor users on the item's rating to
predict the target user rating and recommendation. Collaborative filtering algorithm is widely used in
recommendation system because it is not limited by the content of the recommended item
[2]
.
Despite the great success of collaborative filtering technology in the field of recommender systems,
there are still serious problems such as data sparsity
[3]
, which affect the performance of
recommendation system. In order to improve the accuracy of recommendation, scholars have put
forward some solutions. Huang Chuangguang
[4]
proposed a method to adaptively select the
recommended target user group and on the basis of selected subgroups of uncertain neighbor trust as
method, finally get the neighbor. Kaleli
[5]
uses the information of the user to get the information
entropy of the user and the item rating, so as to adjust the choice of the nearest neighbor. The nearest
neighbor selection is more reasonable by clustering method, in the study of Wang
[6]
. Some scholars
study from the perspective of similarity calculation. Jang
[7]
and Xu
[8]
used the matrix filling to
alleviate the sparsity, but the rating filled the subjectivity is too strong. Guo
[9]
adopt the method of
matrix decomposition, the decomposition process will lose useful information. Luo
[10]
is introduced
to improve the number of common rating similarity calculation accuracy.
On the basis of previous studies, based on the rating of the individualized trend of value
distribution, this paper analyzes the following problem: the interference of the popular item ratings to
the similarity calculation, that is, the distribution probability is greater than the rating of the threshold
Q rating. The popular ratings represent the public preference of the item. When the user's rating of the
item is a popular rating, that is, the user's preference for this item belongs to the public preference, the
significance of this evaluation rating is small, cannot effectively reflect the user's personal preferences.
On the contrary, it can clearly reflect the user's individual preferences. The traditional similarity
calculation does not consider the interference of the popular item rating. When the data is sparse, the
interference is very obvious.
375
In order to solve the above problem, this paper designs a recommendation model based on the
traditional recommendation model. In the model, the amount of the information of the rating in the
item is estimated and the weight of the rating is distinguished, so as to reduce the interference to the
similarity calculation. Based on this model, this paper proposes a collaborative filtering algorithm
(RDRM-CF), which is based on the distribution model of the rating value distribution. The algorithm
can effectively filter the interference of the popular item ratings and improve the accuracy of the
algorithm.
Related Concepts and Model Design
Related Concepts and Definitions
Aiming at the problems existing in the traditional algorithm, this paper defines the concept of "item
rating heat value".
(Definitions 1)Item rating heat: Item i in the rating value j (j=1, 2, 3, 4, 5) the frequency of use,
set to j
id .
(Definitions 2)Item rating using probability: The heat value of the rating j in the item i
accounts for the proportion of the sum of all the ratings, set to jip , and
5
1
1ji
j
p
. The greater the
probability of the use of the rating, the greater the interference in the calculation of similarity
(Definitions 3)Item rating probability vector: In the item, the probability of 5 scoring values
(1, 2, 3, 5,), which is denoted as
ip ,
1 2 3 4 5( , , , , )i i i i i ip p p p p p , is a vector, which reflects the personalized
distribution trend of the rating value.
(Definitions 4)Item rating information content: The amount of information carried by the
rating value j in the item i (i.e., Shannon entropy) is set to j
iH . The amount of information is a
measure of the amount of information in the field of statistics, if the probability of occurrence of the
event, the less the amount of information, that is, the amount of information is inversely proportional
to the probability of occurrence. Assuming that the probability of occurrence of an event is p, the
amount of information for this event is shown in equation (1):
( ) log( )H X p (1)
(Definitions 5) Item rating weight: The weight of the rating j in the item i, which represents
the degree of credibility of the rating value, is set to jitr .
j
itr and
j
iH are proportional to the larger the
j
iH , the greater the
j
itr .
(Definitions 6) Item rating weight vector: The weights of the 5 ratings in the item (1, 2, 3, 5)
were recorded in
iTr ,
1 2 3 4 5( , , , , )i i i i i iTr tr tr tr tr tr .
Model Design
The traditional algorithm does not take into account the interference of the item rating heat value in
the calculation of similarity, which can not guarantee the reliability of the results of similarity
calculation. In order to solve the above problem, this paper designs a recommendation model of rating
value distribution, as shown in figure 1. In the statistics module statistics, the weight of the rating will
be get. In similarity module the rating weight is embedded into the formula of similarity calculation,
in order to avoid the interference of the popular item rating.
376
user-item rating
matrix
Statistical Module
Ratings
Probability of each
rating in the Item
The Shannon
entropy of each
rating in the Item
The weight of
each rating in the
Item
The weight vector
of each Item
Ratings
Similarity
Similarity
Module
Recom
m
endation M
odule
Figure 1. Rating value distribution recommendation model.
Algorithm Design
The Statistics of Each Rating Value in the Item
If there are m users and n items in the system, the rating information can be expressed by the rating
matrix X. X which represents the user u on the item I rating, with the 1~5 5 grades to express the user'spreference for the item, the user - item rating matrix as shown in formula (2):
1,1 1,
,1 ,
n
m m n
r r
R
r r
(2)
In the item, the distribution of the rating value presents a personalized trend, that is, the different
values of the items in the item are different. The introduction of parameter
,u i
t , [1, ]u m , if ,u ir j ,
{1,,2,3,4,5}j , then , 1u it , else , 0u it . The value of the i value of the item j
j
id as shown in formula
(3).
,
1
m
j
i u i
u
d t
(3)
Probability Statistics of Each Rating in the Item
According to the value of the value of the calculation of the probability of heat, the item i rating value
of the probability of jip , as shown in the formula (4) :
1 2 3 4 5
j
j i
i
i i i i i
d
p
d d d d d
(4)
Because the problem of data sparseness can affect the accuracy of the probability calculation of the
rating value, this paper introduces the adjustment parameter t, t is a certain non negative number, the
improved formula (4) as shown in formula (5):
1 2 3 4 5( ) 5
j
j i
i
i i i i i
d t
p
d d d d d t
(5)
The Amount of Information in Each Item in the Item
According to the formula (1), it is known that the amount of information of each rating is shown in
formula (6):
2log
j j
i iH p
(6)
377
Weight Calculation of Each Rating in the Item
The core idea of this section is to divide the corresponding weights according to the size of the value
of the information, in order to filter the interference. The greater the probability of a rating value, the
smaller the amount of information, the smaller the meaning, so the smaller the value of the rating. The
formula of the rating value j
itr and its probability
j
itr is shown in (7):
1 1 2 1= log , 0
j j j
i i itr k H k p k
(7)
At the same time, due to the presence of
1 2 3 4 5( ) 5
j
j i
i
i i i i i
d t
p
d d d d d t
. Because of parameter t,
j
ip is generally in the 0.2 floating up and down. In order to make the weight formula with a uniform
standard, so that =0.2jip , its weight is 1, the weight formula into formula (8) shown:
2
2
log
log 5
j
j i
i
p
tr
(8)
The weight vector of each rating information in the item i is shown in formula (9):
1 2 3 4 5( , , , , )i i i i i iTr tr tr tr tr tr (9)
Similarity Calculation
The traditional similarity calculation algorithm commonly used methods are the following three types:
cosine similarity, modified cosine similarity and Pearson correlation coefficient. The similarity
calculation method is improved based on the correlation coefficient of Pearson. The user u and user V
use the Pearson correlation coefficient to calculate the similarity as shown in formula (10):
,
, ,
, ,
2 2
, ,
( )( )
( , )
( ) ( )
i u v
i u v i u v
u i u v i v
I I
u i u v i v
I I I I
r r r r
sim u v
r r r r
(10)
,u ir , ,v ir , respectively, that the user u, V on the item I rating value. ur , vr , respectively, that the user u,
the average rating of v.
uI , vI , respectively, indicating that the user u, v has rating the item. ,u vI said
the user u, v joint scoring items. The similarity calculation using Pearson correlation coefficient, the
weight of each default item rating the same, but did not take into account the "interference rating
value" in the similarity calculation process, the calculation result is not accurate enough. In this paper,
the item rating value information is used to measure the weight of the different scoring values in the
item, and the weight of the item rating weight vector iTr . is divided into different weights in the item I
by iTr . The improved similarity calculation method is shown in formula (11):
, ,
,
, ,
, ,
, ,
2 2
, ,
( )( )
( , )
( ( )) ( ( ))
u i v i
i u v
u i v i
i u v i u v
r r
u i u v i v i i
I I
r r
i u i u i v i v
I I I I
r r r r tr tr
tsim u v
tr r r tr r r
(11)
,u ir
itr is the weight of the rating value for the user u in the item i. ,u ir is the weight of the user's v
rating ,v ir . Use the value of the amount of information to adjust the weights of the different items in
the item, so that the similarity of the results of higher credibility.
378
Experimental Results and Analysis
Data Sets and Comparison Algorithms
Movielens_100K was used in the experiment. Movielens_100K stores 943 users for the 1682 films of
the 100000 scoring. In the experiment, 20% of the data sets were randomly selected as the test set and
the other as the training set, and then the performance of the following three algorithms was compared
with that of the 80% algorithms.
(1) The traditional collaborative filtering recommendation algorithm (Per-CF) based on the
similarity of Pearson correlation coefficient;
(2) Collaborative filtering recommendation algorithm (LFRM-CF) based on low pass filtering
recommendation model proposed by [11];
(3) A collaborative filtering recommendation algorithm (RDRM-CF) proposed in this paper based
on the rating value distribution recommendation model.
Evaluation Criterion
In this paper, the average absolute deviation (MAE) is used as the evaluation standard to evaluate
the accuracy of the proposed algorithm:
, ,
1
N
u i u i
i
p q
MAE
N
(12)
,u ip , which represents the user's u forecast rating on the item i, ,u iq said the user u the true rating of
the item i, N said the user u has been the number of items.
Experiment and Result Analysis
Experiment 1 Selection of threshold λ and parameter t
In order to select the appropriate parameter t, MAE is the best. The experiment makes the nearest
neighbor number N=30, take t=5, 10, 15, 20, 25, 35, 40,50 and observe the value of MAE, the
experimental results in Movielens_100K data set are shown in Figure 4:
Figure 4. Relationship between t and MAE.
As can be seen from Figure 4: focus on the Movielens_100K data, for different number of nearest
neighbors, when t=35, the best RDRM-CF algorithm, several experiments are the best when t=34, so
the follow-up experiment t=34.
Experiment 2 The influence of nearest neighbor number on algorithm accuracy
Change the value of the number of neighbors K, Movielens_100K data sets to compare the
accuracy of the three algorithms, the experimental results shown in Figure 5.
379
Figure 5. The relationship between the k and MAE.
As can be seen from Figure 5:
(1) The accuracy of the three algorithms increases gradually with the increase of k, and the accuracy
of the algorithm tends to be stable around k=40.
(2)The accuracy of RDRM-CF algorithm is higher than the other two algorithms.
Analysis of experimental results:
(1) When the number of neighbors increases, the useful information increases, and the accuracy of
the algorithm is improved. When the number of neighbors is higher than 40, the accuracy of the
algorithm tends to be stable. Therefore, the k=40.
(2) Movielens_100K belongs to the high sparse data set, and the interference of the item heat rating
is very obvious, which leads to the decrease of the accuracy of the similarity calculation. The
RDRM-CF algorithm through the statistics item rating information value to reasonably classify the
weight rating and filter the interference of the popular item ratings to the similarity calculation, so
RDRM-CF algorithm accuracy ratio the other two methods.
Experiment 3 The influence of sparsity on algorithm accuracy
In order to further analyze the data set of sparsity effects on RDRM-CF algorithm, the random user
rating to reduce set different sparsity, focuses on the comparison of three algorithms in
Movielens_100K data accuracy, the experimental results are shown in figure 6.
Figure 6. The relationship between sparsity and MAE sparsity.
As can be seen from figure 6:
With the increase of sparsity, the accuracy of the three algorithms is gradually decreased, but the
accuracy of the algorithm RDRM-CF is also decreased, but it has been higher than the other two
algorithms.
Analysis of experimental results:
With the increase of the sparsity, the rating information is reduced and the rating of the items is
decreased rapidly, and the reliability of the result is reduced. At the same time, with the increases of
sparsity, the interference of the popular item ratings to similarity calculation will be more obvious,
and the RDRM-CF algorithm can reasonably divide the different weight of ratings, so the
performance of RDRM-CF has been better than the other two algorithms.
Conclusion
Traditional algorithm does not take into account the interference of the popular item rating to the
similarity calculation, which leads to the low accuracy of recommendation. In order to solve the
problem, this paper designs a recommendation model of rating distribution. The model can
reasonably distinguish rating weights of each item value. Based on this model, the RDRM-CF
380
algorithm is proposed. The experimental results show that the RDRM -CF algorithm improves the
performance compared with the traditional algorithm.
Reference
[1] He C, Parra D, Verbert K. Interactive recommender systems: A survey of the state of the art and
future research challenges and opportunities [J]. Expert Systems with Applications, 2016, 56: 9-27.
[2] Su X, Khoshgoftaar T M. A survey of collaborative filtering techniques [J]. Advances in artificial
intelligence, 2009, 2009: 4.
[3] Kim H N, Ji A T, Ha I, et al. Collaborative filtering based on collaborative tagging for enhancing
the quality of recommendation[J]. Electronic Commerce Research and Applications, 2010, 9(1):
73-83.
[4] Huang Chuangguang, Yin Jian, Wang Jing, et al. Uncertain neighbor’s collaborative filtering
recommendation algorithm [J]. Chinese Journal of Computers, 2010, 33(8): 1369-1377.
[5] Kaleli C. An entropy-based neighbor selection approach for collaborative filtering [J].
Knowledge-Based Systems, 2014, 56: 273-280.
[6] Wang Z, Yu X, Feng N, et al. An improved collaborative movie recommendation system using
computational intelligence [J]. Journal of Visual Languages & Computing, 2014, 25(6): 667-675.
[7] Jang S, Yang J, Kim D K. Minimum MSE design for multiuser MIMO relay. IEEE
Communications Letters, 2010, 14(9): 812-814
[8] Xu R, Wang S, Zheng X, et al. Distributed collaborative filtering with singular ratings for large
scale recommendation [J]. Journal of Systems and Software, 2014, 95: 231-241.
[9] Guo G, Zhang J, Yorke-Smith N. TrustSVD: Collaborative Filtering with Both the Explicit and
Implicit Influence of User Trust and of Item Ratings[C]//AAAI. 2015: 123-129.
[10] Luo X, Xia Y, Zhu Q, et al. Boosting the K-Nearest-Neighborhood based incremental
collaborative filtering [J]. Knowledge-Based Systems, 2013, 53: 90-99.
[11] Wang Xingmao. Research on collaborative filtering recommendation algorithm based on user
[D]. The PLA Information Engineering University, 2015.
381