Prévia do material em texto
2017 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE 2017) ISBN: 978-1-60595-458-5 Ratings Distribution Recommendation Model-based Collaborative Filtering Recommendation Algorithm Tao-tao PAN*, Qin-rang LIU and Chang LIU China National Digital Switching System Engineering and Technological R&D Center, Zhengzhou 450002, China *Corresponding author Keywords: Collaborative filtering, Similarity, Popular ratings, Rating scale. Abstract. In order to solve the problem of the popular item ratings interfering in similarity calculation, we proposed the ratings distribution recommendation model. Based on this model, we designed a new collaborative filtering algorithm. According to ratings distribution, this algorithm firstly get the amount of information carried (The Shannon Entropy). Then, it calculated the rating weights to filter into traditional similarity calculation. The experimental results show that the algorithm can effectively alleviate the above problem and improve the performance of the algorithm. Introduction With the development of Internet technology, the problem of information overload is becoming more and more serious. It is very difficult for people to get the information they need from the ocean of knowledge [1] . In this context, the recommendation system arises at the historic moment. According to the different methods used in recommendation, recommendation system can be divided into content-based recommendation system, collaborative filtering recommendation system and hybrid recommendation system. Collaborative filtering algorithm first finds a set of users with similar interests and goals of the user as the neighbor users, then according to the neighbor users on the item's rating to predict the target user rating and recommendation. Collaborative filtering algorithm is widely used in recommendation system because it is not limited by the content of the recommended item [2] . Despite the great success of collaborative filtering technology in the field of recommender systems, there are still serious problems such as data sparsity [3] , which affect the performance of recommendation system. In order to improve the accuracy of recommendation, scholars have put forward some solutions. Huang Chuangguang [4] proposed a method to adaptively select the recommended target user group and on the basis of selected subgroups of uncertain neighbor trust as method, finally get the neighbor. Kaleli [5] uses the information of the user to get the information entropy of the user and the item rating, so as to adjust the choice of the nearest neighbor. The nearest neighbor selection is more reasonable by clustering method, in the study of Wang [6] . Some scholars study from the perspective of similarity calculation. Jang [7] and Xu [8] used the matrix filling to alleviate the sparsity, but the rating filled the subjectivity is too strong. Guo [9] adopt the method of matrix decomposition, the decomposition process will lose useful information. Luo [10] is introduced to improve the number of common rating similarity calculation accuracy. On the basis of previous studies, based on the rating of the individualized trend of value distribution, this paper analyzes the following problem: the interference of the popular item ratings to the similarity calculation, that is, the distribution probability is greater than the rating of the threshold Q rating. The popular ratings represent the public preference of the item. When the user's rating of the item is a popular rating, that is, the user's preference for this item belongs to the public preference, the significance of this evaluation rating is small, cannot effectively reflect the user's personal preferences. On the contrary, it can clearly reflect the user's individual preferences. The traditional similarity calculation does not consider the interference of the popular item rating. When the data is sparse, the interference is very obvious. 375 In order to solve the above problem, this paper designs a recommendation model based on the traditional recommendation model. In the model, the amount of the information of the rating in the item is estimated and the weight of the rating is distinguished, so as to reduce the interference to the similarity calculation. Based on this model, this paper proposes a collaborative filtering algorithm (RDRM-CF), which is based on the distribution model of the rating value distribution. The algorithm can effectively filter the interference of the popular item ratings and improve the accuracy of the algorithm. Related Concepts and Model Design Related Concepts and Definitions Aiming at the problems existing in the traditional algorithm, this paper defines the concept of "item rating heat value". (Definitions 1)Item rating heat: Item i in the rating value j (j=1, 2, 3, 4, 5) the frequency of use, set to j id . (Definitions 2)Item rating using probability: The heat value of the rating j in the item i accounts for the proportion of the sum of all the ratings, set to jip , and 5 1 1ji j p . The greater the probability of the use of the rating, the greater the interference in the calculation of similarity (Definitions 3)Item rating probability vector: In the item, the probability of 5 scoring values (1, 2, 3, 5,), which is denoted as ip , 1 2 3 4 5( , , , , )i i i i i ip p p p p p , is a vector, which reflects the personalized distribution trend of the rating value. (Definitions 4)Item rating information content: The amount of information carried by the rating value j in the item i (i.e., Shannon entropy) is set to j iH . The amount of information is a measure of the amount of information in the field of statistics, if the probability of occurrence of the event, the less the amount of information, that is, the amount of information is inversely proportional to the probability of occurrence. Assuming that the probability of occurrence of an event is p, the amount of information for this event is shown in equation (1): ( ) log( )H X p (1) (Definitions 5) Item rating weight: The weight of the rating j in the item i, which represents the degree of credibility of the rating value, is set to jitr . j itr and j iH are proportional to the larger the j iH , the greater the j itr . (Definitions 6) Item rating weight vector: The weights of the 5 ratings in the item (1, 2, 3, 5) were recorded in iTr , 1 2 3 4 5( , , , , )i i i i i iTr tr tr tr tr tr . Model Design The traditional algorithm does not take into account the interference of the item rating heat value in the calculation of similarity, which can not guarantee the reliability of the results of similarity calculation. In order to solve the above problem, this paper designs a recommendation model of rating value distribution, as shown in figure 1. In the statistics module statistics, the weight of the rating will be get. In similarity module the rating weight is embedded into the formula of similarity calculation, in order to avoid the interference of the popular item rating. 376 user-item rating matrix Statistical Module Ratings Probability of each rating in the Item The Shannon entropy of each rating in the Item The weight of each rating in the Item The weight vector of each Item Ratings Similarity Similarity Module Recom m endation M odule Figure 1. Rating value distribution recommendation model. Algorithm Design The Statistics of Each Rating Value in the Item If there are m users and n items in the system, the rating information can be expressed by the rating matrix X. X which represents the user u on the item I rating, with the 1~5 5 grades to express the user'spreference for the item, the user - item rating matrix as shown in formula (2): 1,1 1, ,1 , n m m n r r R r r (2) In the item, the distribution of the rating value presents a personalized trend, that is, the different values of the items in the item are different. The introduction of parameter ,u i t , [1, ]u m , if ,u ir j , {1,,2,3,4,5}j , then , 1u it , else , 0u it . The value of the i value of the item j j id as shown in formula (3). , 1 m j i u i u d t (3) Probability Statistics of Each Rating in the Item According to the value of the value of the calculation of the probability of heat, the item i rating value of the probability of jip , as shown in the formula (4) : 1 2 3 4 5 j j i i i i i i i d p d d d d d (4) Because the problem of data sparseness can affect the accuracy of the probability calculation of the rating value, this paper introduces the adjustment parameter t, t is a certain non negative number, the improved formula (4) as shown in formula (5): 1 2 3 4 5( ) 5 j j i i i i i i i d t p d d d d d t (5) The Amount of Information in Each Item in the Item According to the formula (1), it is known that the amount of information of each rating is shown in formula (6): 2log j j i iH p (6) 377 Weight Calculation of Each Rating in the Item The core idea of this section is to divide the corresponding weights according to the size of the value of the information, in order to filter the interference. The greater the probability of a rating value, the smaller the amount of information, the smaller the meaning, so the smaller the value of the rating. The formula of the rating value j itr and its probability j itr is shown in (7): 1 1 2 1= log , 0 j j j i i itr k H k p k (7) At the same time, due to the presence of 1 2 3 4 5( ) 5 j j i i i i i i i d t p d d d d d t . Because of parameter t, j ip is generally in the 0.2 floating up and down. In order to make the weight formula with a uniform standard, so that =0.2jip , its weight is 1, the weight formula into formula (8) shown: 2 2 log log 5 j j i i p tr (8) The weight vector of each rating information in the item i is shown in formula (9): 1 2 3 4 5( , , , , )i i i i i iTr tr tr tr tr tr (9) Similarity Calculation The traditional similarity calculation algorithm commonly used methods are the following three types: cosine similarity, modified cosine similarity and Pearson correlation coefficient. The similarity calculation method is improved based on the correlation coefficient of Pearson. The user u and user V use the Pearson correlation coefficient to calculate the similarity as shown in formula (10): , , , , , 2 2 , , ( )( ) ( , ) ( ) ( ) i u v i u v i u v u i u v i v I I u i u v i v I I I I r r r r sim u v r r r r (10) ,u ir , ,v ir , respectively, that the user u, V on the item I rating value. ur , vr , respectively, that the user u, the average rating of v. uI , vI , respectively, indicating that the user u, v has rating the item. ,u vI said the user u, v joint scoring items. The similarity calculation using Pearson correlation coefficient, the weight of each default item rating the same, but did not take into account the "interference rating value" in the similarity calculation process, the calculation result is not accurate enough. In this paper, the item rating value information is used to measure the weight of the different scoring values in the item, and the weight of the item rating weight vector iTr . is divided into different weights in the item I by iTr . The improved similarity calculation method is shown in formula (11): , , , , , , , , , 2 2 , , ( )( ) ( , ) ( ( )) ( ( )) u i v i i u v u i v i i u v i u v r r u i u v i v i i I I r r i u i u i v i v I I I I r r r r tr tr tsim u v tr r r tr r r (11) ,u ir itr is the weight of the rating value for the user u in the item i. ,u ir is the weight of the user's v rating ,v ir . Use the value of the amount of information to adjust the weights of the different items in the item, so that the similarity of the results of higher credibility. 378 Experimental Results and Analysis Data Sets and Comparison Algorithms Movielens_100K was used in the experiment. Movielens_100K stores 943 users for the 1682 films of the 100000 scoring. In the experiment, 20% of the data sets were randomly selected as the test set and the other as the training set, and then the performance of the following three algorithms was compared with that of the 80% algorithms. (1) The traditional collaborative filtering recommendation algorithm (Per-CF) based on the similarity of Pearson correlation coefficient; (2) Collaborative filtering recommendation algorithm (LFRM-CF) based on low pass filtering recommendation model proposed by [11]; (3) A collaborative filtering recommendation algorithm (RDRM-CF) proposed in this paper based on the rating value distribution recommendation model. Evaluation Criterion In this paper, the average absolute deviation (MAE) is used as the evaluation standard to evaluate the accuracy of the proposed algorithm: , , 1 N u i u i i p q MAE N (12) ,u ip , which represents the user's u forecast rating on the item i, ,u iq said the user u the true rating of the item i, N said the user u has been the number of items. Experiment and Result Analysis Experiment 1 Selection of threshold λ and parameter t In order to select the appropriate parameter t, MAE is the best. The experiment makes the nearest neighbor number N=30, take t=5, 10, 15, 20, 25, 35, 40,50 and observe the value of MAE, the experimental results in Movielens_100K data set are shown in Figure 4: Figure 4. Relationship between t and MAE. As can be seen from Figure 4: focus on the Movielens_100K data, for different number of nearest neighbors, when t=35, the best RDRM-CF algorithm, several experiments are the best when t=34, so the follow-up experiment t=34. Experiment 2 The influence of nearest neighbor number on algorithm accuracy Change the value of the number of neighbors K, Movielens_100K data sets to compare the accuracy of the three algorithms, the experimental results shown in Figure 5. 379 Figure 5. The relationship between the k and MAE. As can be seen from Figure 5: (1) The accuracy of the three algorithms increases gradually with the increase of k, and the accuracy of the algorithm tends to be stable around k=40. (2)The accuracy of RDRM-CF algorithm is higher than the other two algorithms. Analysis of experimental results: (1) When the number of neighbors increases, the useful information increases, and the accuracy of the algorithm is improved. When the number of neighbors is higher than 40, the accuracy of the algorithm tends to be stable. Therefore, the k=40. (2) Movielens_100K belongs to the high sparse data set, and the interference of the item heat rating is very obvious, which leads to the decrease of the accuracy of the similarity calculation. The RDRM-CF algorithm through the statistics item rating information value to reasonably classify the weight rating and filter the interference of the popular item ratings to the similarity calculation, so RDRM-CF algorithm accuracy ratio the other two methods. Experiment 3 The influence of sparsity on algorithm accuracy In order to further analyze the data set of sparsity effects on RDRM-CF algorithm, the random user rating to reduce set different sparsity, focuses on the comparison of three algorithms in Movielens_100K data accuracy, the experimental results are shown in figure 6. Figure 6. The relationship between sparsity and MAE sparsity. As can be seen from figure 6: With the increase of sparsity, the accuracy of the three algorithms is gradually decreased, but the accuracy of the algorithm RDRM-CF is also decreased, but it has been higher than the other two algorithms. Analysis of experimental results: With the increase of the sparsity, the rating information is reduced and the rating of the items is decreased rapidly, and the reliability of the result is reduced. At the same time, with the increases of sparsity, the interference of the popular item ratings to similarity calculation will be more obvious, and the RDRM-CF algorithm can reasonably divide the different weight of ratings, so the performance of RDRM-CF has been better than the other two algorithms. Conclusion Traditional algorithm does not take into account the interference of the popular item rating to the similarity calculation, which leads to the low accuracy of recommendation. In order to solve the problem, this paper designs a recommendation model of rating distribution. The model can reasonably distinguish rating weights of each item value. Based on this model, the RDRM-CF 380 algorithm is proposed. The experimental results show that the RDRM -CF algorithm improves the performance compared with the traditional algorithm. Reference [1] He C, Parra D, Verbert K. Interactive recommender systems: A survey of the state of the art and future research challenges and opportunities [J]. Expert Systems with Applications, 2016, 56: 9-27. [2] Su X, Khoshgoftaar T M. A survey of collaborative filtering techniques [J]. Advances in artificial intelligence, 2009, 2009: 4. [3] Kim H N, Ji A T, Ha I, et al. Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation[J]. Electronic Commerce Research and Applications, 2010, 9(1): 73-83. [4] Huang Chuangguang, Yin Jian, Wang Jing, et al. Uncertain neighbor’s collaborative filtering recommendation algorithm [J]. Chinese Journal of Computers, 2010, 33(8): 1369-1377. [5] Kaleli C. An entropy-based neighbor selection approach for collaborative filtering [J]. Knowledge-Based Systems, 2014, 56: 273-280. [6] Wang Z, Yu X, Feng N, et al. An improved collaborative movie recommendation system using computational intelligence [J]. Journal of Visual Languages & Computing, 2014, 25(6): 667-675. [7] Jang S, Yang J, Kim D K. Minimum MSE design for multiuser MIMO relay. IEEE Communications Letters, 2010, 14(9): 812-814 [8] Xu R, Wang S, Zheng X, et al. Distributed collaborative filtering with singular ratings for large scale recommendation [J]. Journal of Systems and Software, 2014, 95: 231-241. [9] Guo G, Zhang J, Yorke-Smith N. TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of Item Ratings[C]//AAAI. 2015: 123-129. [10] Luo X, Xia Y, Zhu Q, et al. Boosting the K-Nearest-Neighborhood based incremental collaborative filtering [J]. Knowledge-Based Systems, 2013, 53: 90-99. [11] Wang Xingmao. Research on collaborative filtering recommendation algorithm based on user [D]. The PLA Information Engineering University, 2015. 381