Privacy Preserving MFI Based Similarity Measure for Hierarchical Document Clustering

Download Full Text
P. Rajesh, G. Narasimha, N. Saisumanth
Published Date:
July 05, 2012
Volume 2, Issue 4
7 - 12

maximal frequent item set, apriori algorithm, hierarchical document clustering, equivalence relation
P. Rajesh, G. Narasimha, N. Saisumanth, "Privacy Preserving MFI Based Similarity Measure for Hierarchical Document Clustering". International Journal of Research in Computer Science, 2 (4): pp. 7-12, July 2012. doi:10.7815/ijorcs.24.2012.033 Other Formats


The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.

  1. Ruxixu, Donald Wunsch., “A Survey of Clustering Algorithms”. In the Proceedings of IEEE Transactions on Neural Networks, Vol. 16, No. 3, May 2005.
  2. Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: A Review”. In the Proceedings of ACM Computing Surveys, Vol.31, No.3, 1999, pp: 264-323. doi:10.1145/331499.331504
  3. Kleinberg, J.M., “Authoritative Sources in a Hyperlinked Environment”. In the Journal of the ACM, Vol. 46, No.5, 1999, pp: 604-632. doi:10.1145/324133.324140
  4. Ling Zhuang, Honghua Dai. (2004). “A Maximal Frequent Item Set Approach for Web Document Clustering”. In Proceedings of the IEEE Fourth International Conference on Computer and Information Technology 2004 (CIT-2004). doi:10.1109/CIT.2004.1357322
  5. Michael, W., Trosset. (2008). “Representing Clusters: k-Means Clustering, Self-Organizing Maps and Multidimensional Scaling”. Technical Report, Department of Statistics, Indian University, Bloomington, 2008.
  6. Michael Steinbach, George karypis, and Vipinkumar. (2000). “A Comparison of Document Clustering Techniques”. In Proceedings of the Workshop on Text Mining, 2000 (KDD-2000), Boston, pp: 109-111.
  7. Beil, F., Ester, M., Xu, X. (2002). “Frequent Term-Based Text Clustering”. In Proceedings of 8th International Conference on Knowledge Discovery and Data mining 2002 (KDD-2002), Edmonton, Alberta, Canada.
  8. BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003). “Hierarchical Document Clustering using Frequent Item Sets”. In Proceedings SIAM International Conference on Data Mining 2003 (SIAM DM-2003), pp:59-70.
  9. Agrawal, R., Srikant, R. (1994). “Fast Algorithms for Mining Association Rules”. In the Proceedings of 20th International Conference on Very Large Data Bases, 1994, Santiago, Chile, PP: 487-499.
  10. Liu, W.L., and Zeng, X.S. (2005). “Document Clustering Based on Frequent Term Sets”. Proceedings of Intelligent Systems and Control, 2005.
  11. Zamir, O., Etzioni, O. (1998). “Web Document Clustering: A Feasibility Demonstration”. In the Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.
  12. Kjersti, (1997). “A Survey on Personalized Information Filtering Systems for the World Wide Web”. Technical Report 922, Norwegian Computing Center, 1997.
  13. Prasannakumar, J., Govindarajulu, P., “Duplicate and Near Duplicate Documents Detection: A Review”. European Journal of Scientific Research ISSN 1450-216X Vol.32 No.4 ,2009, pp:514-527
  14. Syed Mudhasir,Y., Deepika,J., “Near Duplicate Detection and Elimination Based on Web Provenance for Efficient Web Search”. In the Proceedings of International Journal on Internet and Distributed Computing Systems, Vol.1, No.1, 2011.
  15. Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate Document Detection Survey”. In the Proceedings of International Journal of Computer Science and Communications Networks, Vol.2, N0.2, pp:147-151.
  16. Doug Burdick, Manuel Calimlim, Johannes Gehrke. (2001). “A Maximal Frequent Itemset Algorithm for Transactional Databases”. In the Proceedings of ICDE, 17th International Conference on Data Engineering 2001 (ICDE-2001). doi:10.1109/ICDE.2001.914857
  17. Murali Krishna, S., Durga Bhavani, S., “An Efficient Approach for Text Clustering Based On Frequent Item Sets”. European Journal of Scientific Research ISSN 1450-216X, Vol.42, No.3, 2010, pp:399-410.
  18. Lopresti, D.P. (1999). "Models and Algorithms for Duplicate Document Detection". In the Proceedings of Fifth International Conference on Document Analysis and Recognition 1999 (ICDAR-1999), 20th-22th Sep, pp:297-300. doi:10.1109/ICDAR.1999.791783

  • Rajesh, P., and G. Narsimha. "Cerebration of Privacy Preserving Data Mining Algorithms." Proceedings of the World Congress on Engineering and Computer Science. Vol. 2. 2014.