IJACEN Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark

Journal Paper

Paper Title :Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark

Author :Yadigar Erdem, Caner Ozcan

Article Citation :Yadigar Erdem ,Caner Ozcan , (2017 ) " Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark " , International Journal of Advance Computational Engineering and Networking (IJACEN) , pp. 86-90, Volume-5,Issue-7

Abstract : The components forming the information society nowadays are seen in all areas of our lives. As computers have a great deal of importance in our lives, the amount of information has begun to gather meaningful and specific qualities. Not only the amount of information is increased, but also the speed of access to information has increased. Large data is the transformed form of all data recovered from different sources such as social media sharing, network blogs, photos, videos, log files, etc. into a meaningful and workable forms. Clustering on Big Data with machine learning methods is very useful. Clustering process allows very similar data to be placed under a group by separating the data into a specific group. Once datasets are divided, outlier detection is used to find fraudulent data. In this study, it is aimed to make data clustering and outlier detection process faster by using Apache Spark technology on Big Data with K-means clustering method. Clustering on Big Data can be time consuming. For this reason, Apache Spark fast cluster computing architecture is used in this study. It is aimed to perform fault tolerant, reliable, consistent and fast clustering process using this technology. The MLlib library of Spark components has a relatively small code size and ease of use. Its goal is to make practical machine learning scalable and useful. K-means method, which is included in the MLlib library used in this study, provides a successful analysis of big data. The results are presented in tables and graphs using sample dataset. Index Terms— Apache Spark, Big Data, K-means Clustering, Outlier Detection.

Type : Research paper

Published : Volume-5,Issue-7


	\|		PDF	\|	Viewed - 66	\|	Published on 2017-09-08

Apr. 2024
Submitted Papers	:	80
Accepted Papers	:	10
Rejected Papers	:	70
Acc. Perc	:	12%
Issue Published	:	133
Paper Published	:	1552
No. of Authors	:	4025

Published : Volume-5,Issue-7

JOURNAL SUPPORTED BY