Accuracy Spam detection using ANN and ABC Algorithm
Social networks provide a way for users to remain in contact with their friends. Increasing popularity of social networks allows the social site users to gather large amounts of individual information about their friends. Visiting social networking sites is the fourth most popular event on the Internet. Statistics as of January 2017 show that the amount of internet users is around 3.7 billion and about 73% of these users are active in different domain of OSM website. The OSM platform takes different forms depending on the type of content available on the site such as personal blogs, Weibo services, forums, video sharing portals, and image hosting with shared websites. Facebook is an online social media and social networking site, while Twitter, Sina Weibo and Tumblr are social networks and micro-blogging sites.
Unfortunately, this wealth of information as well as its convenience to access user information can attract the attention of a malicious group. That’s why these networks have been invaded by spammers.
Spam detection system in social sites has been designed to detect the spammer by using machine learning approach.
1. You Tube
It is the world’s prime user-driven video comfortable provider; it has become a major platform to broadcast the multimedia information. A foremost contribution to its achievement is from the user-to-social social experience that differentiates it from traditional broadcasted content.
Figure 1: Shot of a video and channel metadata on You Tube
Twitter is the most fashionable online micro-blogging and social networking website. Just like the other online social media websites, Twitter has low publish barriers and permit users to post content in the structure of tweet. Unlike YouTube, Twitter allows users to create post text and attach multimedia content like images, URLs, and videos as outside entities. The snapshot of Twitter website along with different activities that are performed on the website is depicted in the below figure.
Figure 2: Snapshot of Twitter
The text on Twitter posts is known as tweets with 140 characters and so is called microposts. While, on YouTube, tags are provided as additional information, on Twitter, tags are the incorporated element of the tweets and are called as hashtags (# followed by the keyword). Hashtags on Twitter are the most significant and normally used entities for different purposes; e.g. Making tweets easily finds by other users to share the same interests, conveying opinions, post highlights or ultimate phrases for the tweet.
1. Spam Types
- Email Spam: Unwanted messages that are automatically generated and spread over the collected e-mail list are known as e-mail spam. Due to the excessive use of the internet, email has become one of the fastest forms of communication. It becomes a critical issue to protect emails from spammers. A spam protection mechanism is shown in the figure below. In this figure spam filter is used which can classify spam mail from normal mails. It is possible by training the system in a way so that it can distinguish between real and unwanted mail as shown in figure-
Figure 3: Prevention mechanism to Email Spam
2. Phishing Spam
The phishing spammer’s main aim is to collect sensitive messages such as usernames, passwords, and account details, by differentiating like real entity in an electronic communication.
Figure 4: Phishing Spam
Figure 4 illustrates an example of phishing spam. Here the examples of online banking have been considered. After comparing the two images, it is observed that the fake and real site URLs are different, and the fake site does not provide any security code as it is provided by the real site.
A mainframe virus is a malware program, when the program is executed; it repeats itself by introducing another program and added its own code in the system. When this process goes well, the affected area should be affected with a computer bugs.
2. STEP DETECTION TECHNIQUES
Figure 5: Spam Detection Technique
1. Data Acquisition: Retrieving data from social networks requires exploring user groups to collect different types of information, such as web links between users, uploaded and downloaded content, ratings and post reviews. For mining the data from social networks mainly three schemes have been used:
i. Network Traffic Analysis: This is a typical traffic sniffing and learning approach that takes packet from a network link and then examine request-response pair from network traces involving communicating of the network user with a social network.
ii. Ad-hoc applications: In this model, a user does not cooperate directly with application servers because the social network infrastructure offered an interface layer between user and application.
iii. Crawling: It is the popular data acquisition technique used in social sites that mainly includes the query data which is obtained from the user information available on social media.
2. Pre-processing: True data is noisy, incomplete and inconsistent. Therefore, it is necessary to clean data so that the actual information from the collected data can be acquired. Lowercasing, lemmatization, punctuation removal, and short word removal are some of the process used in pre-processing steps.
3. Feature Extraction: Every sentence is divided or broken into list of words in MATLAB the command tokenize has been utilized for performing the Tokenization operation .
4. Feature Optimization: For the optimization the extracted text features artificial bee colony technique is used.
i. Artificial Bee Colony (ABC): In this algorithm, the artificial bee fly in a vast searching space and selects the food sources as per the understanding and the nest companions and regulates the positions. A number of (scouts) flies and selects random food sources that do not use experience. If the nectar’s value of a novel source is more than the preceding one, the bees will learn the novel position and overlook the preceding one.
5. Classification: Artificial neural network (ANN) is used to differentiate the spammer and the genuine user. ANNs are computer programs with biological inspirations designed to mimic the way in which human information processes the information via brain. Moreover, Naïve Bayes is a supervised classification technique that works on the basis of Bayes theorem. It can work on large dataset with high accuracy and speed. Furthermore, in SVM, the data is plotted in the n-dimensional space and each data comprises of feature value along with their coordinate. The hyperplane has been determined to differentiate normal and spam text. In this research, SVM is used to separate the category of normal text from the spam text.
The objectives of this work have been identified as follows
4. Proposed Model
Figure 6: Proposed Model
1. Input: Twitter posts are usually utilized as the input data.
2. Processing: Normalization is used to convert text into lower case. 2. Punctuation is used to eliminate the comma, brackets, and full stop and sentence separation. 3. Stop word removal is used to filter is, am, here, there, and those, words. 4. Tokenization is used to breakdown the sentences into single words and it is also supporting the weight of the string as per the alphabet.
3. Classification: ABC algorithm is used to identify optimal feature sets from spam and non-spam files and ACC is an Artificial neural network is used to differentiate between the spammer and the real user.
4. Evaluation: The F-measure is a measure of the test accuracy and it is also the weighted harmonic mean of the precision and recall of the test.
Figure 7: Comparison of Accuracy vs. number of test sample
Figure 7 represents the accuracy graph designed for proposed algorithm (ABC with ANN) along with two other existing classification algorithms named as Naïve Bayes and SVM. From the graph it is clear that when the machine learning scheme is trained with the optimized features the accuracy of the system is higher compared to the individual classifiers. The average accuracy of proposed work, Naïve Bayes and SVM are 99.14, 94.26 and 91.36 respectively. Thus there is an increase of 5.18 % from Naïve Bayes and 8.52 % from SVM.
Figure 8: Comparison of Precision vs. number of test sample
The comparison of precision examined for the proposed model has been compared with the existing classifiers Naïve Bayes and SVM. From the graph, it has been clear that the detection of identifying spam using the ABC algorithm in hybridization with ANN perform better compared to the Naïve Bayes and SVM. The average value of precision measured for the proposed, Naïve Bayes and SVM are 0.98 and 0.95, and 0.94 respectively. Thus, there is an enhancement of 3.16% in the precision rate while ANN with ABC algorithm compared to naivey Bayes and of 4.26% compared to SVM approach during the classification of spam in the designed model.
Figure 9: Comparison of Recall vs. number of test sample
The recall rate computed by Naïve Bayes, SVM and the proposed work is depicted in figure 9. The red and the blue and the green colour line represent the recall rate observed for seven number of test samples determined for Naïve Bayes, proposed and SVM respectively The average value of recall measured for these three classification algorithms (ABC+ANN, Naïve Bayes and SVM) are 0.977, 0.944 and 0.942 respectively. It is observed that the recall rate of the proposed work has been increased by 3.5% from naivey bayes and 3.72 % from SVM approach.
Figure 10: Comparison of F-measure vs. number of test sample
F-measure is the harmonic means of precision and recall. The values measured for the F-measure parameter for proposed along with Naïve Bayes and SVM is depicted in figure 10. The average value of F-measure determined for the proposed, Naïve Bayes and SVM are 0.98 and 0.94, 0.93 respectively. It is observed that the F-measure of the proposed work has been increased by 4.26% from naïve Bayes and 5.38% from SVM respectively.
6. Conclusions and Future Work
There are millions of users on social networks around the world. The ease of access to users, as well access the stored information from their profiles attracts spammers and other malicious users. In this research, we focused on the detection of social spam preferably in the Twitter micro-blogging site. ANN approach with ABC has been applied to differentiate between the spam and the non –spam tweets. According to twitter’s spam policy, features of the text data are extracted using tokenization mechanism. The experimental results demonstrated the effectiveness of the proposed work in terms of computed parameters such as error, execution time, accuracy, precision, recall and F-measure. From the experiment, it has been observed that pre-processing, optimization with classification technique increased the accuracy of the spam detection system. The accuracy of the proposed system to detect spam in the Twitter site of about 99.14% has been achieved. At last, the comparison between proposed technique and existing classification algorithms (Naïve Bayes and SVM) has been provided. From the experiment, it has been observed that the proposed scheme (ABC with ANN) has perform well compared to individual classifiers (Naïve Bayes and SVM).
In future, the work can be extended by using other social media dataset and identifying the spam.
. Wang, A. H. (2010, June). Detecting spam bots in online social networking sites: a machine learning approach. In IFIP Annual Conference on Data and Applications Security and Privacy Springer, Berlin, Heidelberg, pp. 335–342.
. Gao, H., Chen, Y., Lee, K., Palsetia, D., & Choudhary, A. N. (2012, February). Towards Online Spam Filtering in Social Networks. In NDSS, Vol. 12, pp. 1–16.
. Ezpeleta, E., Iturbe, M., Garitano, I., de Mendizabal, I. V., & Zurutuza, U. (2018, June). A Mood Analysis on Youtube Comments and a Method for Improved Social Spam Detection. In International Conference on Hybrid Artificial Intelligence Systems Springer, Cham, pp. 514–525.
. Stringhini, G., Kruegel, C., & Vigna, G. (2010, December). Detecting spammers on social networks. In Proceedings of the 26th annual computer security applications conference pp. 1–9. ACM.
. Lam, H. Y., & Yeung, D. Y. (2007). A learning approach to spam detection based on social networks (Doctoral dissertation, Hong Kong University of Science and Technology).
. Mccord, M., & Chuah, M. (2011, September). Spam detection on twitter using traditional classifiers. In international conference on Autonomic and trusted computing , Springer, Berlin, Heidelberg, pp. 175–186.
. Srinivasan, K., & Sureka, V. (2017). Profiling Online Social Networks for Spam Detection
. Dwyer, C., Hiltz, S., & Passerini, K. (2007). Trust and privacy concern within social networking sites: A comparison of Facebook and MySpace. AMCIS 2007 proceedings, 339.
. Zhang, X., Zhu, S., & Liang, W. (2012, December). Detecting spam and promoting campaigns in the twitter social network. In Data Mining (ICDM), 2012 IEEE 12th International Conference on IEEE, pp. 1194–1199.
. Ahmed, F., & Abulaish, M. (2012, June). An mcl-based approach for spam profile detection in online social networks. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2012 IEEE 11th International Conference on IEEE,pp. 602–608.