Fast Data Clustering and Outlier Detection using K-Means Clustering on Apache Spark
The components forming the information society nowadays are seen in all areas of our lives. As computers have a
great deal of importance in our lives, the amount of information has begun to gather meaningful and specific qualities. Not only
the amount of information is increased, but also the speed of access to information has increased. Large data is the transformed
form of all data recovered from different sources such as social media sharing, network blogs, photos, videos, log files, etc. into
a meaningful and workable forms. Clustering on Big Data with machine learning methods is very useful. Clustering process
allows very similar data to be placed under a group by separating the data into a specific group. Once datasets are divided,
outlier detection is used to find fraudulent data. In this study, it is aimed to make data clustering and outlier detection process
faster by using Apache Spark technology on Big Data with K-means clustering method. Clustering on Big Data can be time
consuming. For this reason, Apache Spark fast cluster computing architecture is used in this study. It is aimed to perform fault
tolerant, reliable, consistent and fast clustering process using this technology. The MLlib library of Spark components has a
relatively small code size and ease of use. Its goal is to make practical machine learning scalable and useful. K-means method,
which is included in the MLlib library used in this study, provides a successful analysis of big data. The results are presented in
tables and graphs using sample dataset.
Index Terms— Apache Spark, Big Data, K-means Clustering, Outlier Detection.