IMPROVING MALWARE RETRIEVAL USING SEMANTIC-AWARE METRIC LEARNING

Santosh Kumar Kande

Authors

Santosh Kumar Kande Author

Keywords:

Malware, Information Retrieval, Semantic Space, Deep Learning, Multilabel

Abstract

This research presents an advanced approach to enhance the performance of a Malware Retrieval (MR) system by incorporating semantic-aware metric learning techniques. The study leverages labeled datasets obtained from VirusTotal, combining expert-verified labels with automated labeling from antivirus engines. The MR system is trained using various models, including single-label and multi-label baselines, and introduces center models with semantic components. Extensive quantitative and qualitative evaluations demonstrate that centerless models outperform baselines, especially in precision. In addition, class variance analysis confirms the effectiveness of centerloss in im-proving the discriminative power of representation vectors. This research showcases the potential for MR systems to incorporate semantic understanding and achieve improved performance in malware retrieval tasks.

References

Z. Chen, M. Roussopoulos, Z. Liang, Y. Zhang, Z. Chen, and A. Delis, “Malware characteristics and threats on the internet ecosystem,” Journal of Systems and Software, vol. 85, no. 7, pp. 1650–1672, 2012.

Y. Park, D. Reeves, V. Mulukutla, and B. Sundaravel, “Fast malware classification by automated behavioral graph matching,” in Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research. ACM, 2010, p. 45.

J. Bai, J. Wang, and G. Zou, “A malware detection scheme based on mining format information,” The Scientific World Journal, vol. 2014, 2014.

Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, “Droid-sec: deep learning in android malware detection,” in ACM SIGCOMM Computer Communication Review, vol. 44, no. 4. ACM, 2014, pp. 371–372.

J. Saxe and K. Berlin, “Deep neural network-based malware detection using two-dimensional binary program features,” in Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, 2015, pp. 11–20.

X. Jiang, X. Wang, and D. Xu, “Stealthy malware detection through vmm-based out-of-the-box semantic view reconstruction,” in Proceedings of the 14th ACM Conference on Computer and Communications Security. ACM, 2007, pp. 128–138.

L.-K. Yan and H. Yin, “Droidscope: Seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis.” in USENIX Security Symposium, 2012, pp. 569–584.

A. Reina, A. Fattori, and L. Cavallaro, “A system call-centric analysis and stimulation technique to automatically reconstruct android malware behaviors,” EuroSec, April, 2013.

M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant, “Semantics-aware malware detection,” in Security and Privacy, 2005 IEEE Symposium on. IEEE, 2005, pp. 32–46.

M. Zhang, Y. Duan, H. Yin, and Z. Zhao, “Semantics-aware android mal-ware classification using weighted contextual api dependency graphs,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2014, pp. 1105–1116.

J. Jang, D. Brumley, and S. Venkataraman, “Bitshred: Feature hashing malware for scalable triage and semantic analysis,” in Proceedings of the 18th ACM Conference on Computer and Communications Security. ACM, 2011, pp. 309–320.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.

Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 499–515.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Infor-mation Processing Systems, 2012, pp. 1097–1105.

R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Computing Surveys (Csur), vol. 40, no. 2, p. 5, 2008.

J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to rank using user clicks and visual features for image retrieval,” IEEE Transactions on Cybernetics, vol. 45, no. 4, pp. 767–779, 2015.

M. Schedl, E. Gómez, J. Urbano et al., “Music information retrieval: Recent developments and applications,” Foundations and Trends® in Information Retrieval, vol. 8, no. 2-3, pp. 127–261, 2014.

L. Goeuriot, G. J. Jones, L. Kelly, H. Muller,¨ and J. Zobel, “Medical information retrieval: Introduction to the special issue,” Information Retrieval Journal, vol. 19, no. 1-2, pp. 1–5, 2016.

A. Mourão, F. Martins, and J. Magalhães, “Multimodal medical information retrieval with unsupervised rank fusion,” Computerized Medical Imaging and Graphics, vol. 39, pp. 35–45, 2015.

I. Santos, X. Ugarte-Pedrero, F. Brezo, P. G. Bringas, and J. M. Gómez-Hidalgo, “Noa: An information retrieval based malware detection system,” Computing and Informatics, vol. 32, no. 1, pp. 145–174, 2013.

A. H. Lashkari, F. Mahdavi, and V. Ghomi, “A boolean model in information retrieval for search engines,” in Information Management and Engineering, 2009. ICIME’09. International Conference on. IEEE, 2009, pp. 385–389.

J. Guo, Y. Fan, Q. Ai, and W. B. Croft, “A deep relevance matching model for ad-hoc retrieval,” in Proceedings of the 25th ACM Inter-national on Conference on Information and Knowledge Management. ACM, 2016, pp. 55–64.

T.-Y. Liu et al., “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.

F. Diaz, B. Mitra, and N. Craswell, “Query expansion with locally-trained word embeddings,” arXiv preprint arXiv:1605.07891, 2016.

P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. ACM, 2013, pp. 2333–2338.

D. Roy, D. Paul, M. Mitra, and U. Garain, “Using word embeddings for automatic query expansion,” arXiv preprint arXiv:1606.07608, 2016.

B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana, “A dual embedding space model for document ranking,” arXiv preprint arXiv:1602.01137, 2016.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

A. Severyn and A. Moschitti, “Learning to rank short text pairs with convolutional deep neural networks,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2015, pp. 373–382.

J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning for content-based image retrieval: A comprehensive study,” in Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 2014, pp. 157–166.

Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in Neural Information Processing Systems, 2014, pp. 1988–1996.

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou,´ and T. Mikolov, “Fasttext.zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.

R. Baeza-Yates, B. Ribeiro-Neto et al., Modern Information Retrieval. ACM press New York, 1999, vol. 463.

V. Total, “Virustotal-free online virus, malware and url scanner,” Online: https://www. virustotal. com/en, 2012.

L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath, “Malware images: visualization and automatic classification,” in Proceedings of the 8th International Symposium on Visualization for Cyber Security. ACM, 2011, p. 4.

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.

V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814.

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

L. Nataraj, D. Kirat, B. Manjunath, and G. Vigna, “Sarvam: Search and retrieval of malware,” in Proceedings of the Annual Computer Security Conference (ACSAC) Worshop on Next Generation Malware Attacks and Defense (NGMAD), 2013.

J. Upchurch and X. Zhou, “Variant: a malware similarity testing framework,” in Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on. IEEE, 2015, pp. 31–39.

S. Palahan, D. Babić, S. Chaudhuri, and D. Kifer, “Extraction of statistically significant malware behaviors,” in Proceedings of the 29th Annual Computer Security Applications Conference. ACM, 2013, pp. 69–78.

B. Mitra, F. Diaz, and N. Craswell, “Learning to match using local and distributed representations of text for web search,” in Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017, pp. 1291–1299.

D. Cohen and W. B. Croft, “End to end long short term memory networks for non-factoid question answering,” in Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. ACM, 2016, pp. 143–146.

C.-K. Yeh, W.-C. Wu, W.-J. Ko, and Y.-C. F. Wang, “Learning deep latent space for multi-label classification.” in Association for the Advancement of Artificial Intelligence, 2017, pp. 2838–2844.

IMPROVING MALWARE RETRIEVAL USING SEMANTIC-AWARE METRIC LEARNING

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

cover