Interesting projects ideas - PSJoshi/Notes GitHub Wiki

Detecting malware even when it is encrypted - Machine Learning for network HTTPS analysis

With the increasing amount of malware HTTPS traffic, it is a challenge to discover new features and methods to detect malware without decrypting the traffic. A detection method that does not need to unencrypt the traffic is cheaper (because no traffic interceptor is needed), faster and private, respecting the original idea of HTTPS. Our research goal is to detect malware HTTPS connections using data from Bro IDS logs [1], that does not need to unencrypt the traffic.

We created and extracted our features from data logs that the Bro IDS is able to generate from a pcap file. Bro offers information about flows, SSL handshakes and X.509 certificates. These three types of data give us enough information to create powerful features and machine learning algorithms to detect the malicious HTTPS traffic with good accuracy.

Our machine learning algorithm uses 30 different features. These features are divided into features for flows, features for SSL handshakes and features for X.509 certificates. One of our main contributions is that our data model is based on connection 4-tuples. A connection 4-tuple aggregates the group of flows which share the same SrcIP, DstIP, DstPort, and protocol. Therefore, each connection summarizes the behavior of the malware while connecting to the same C&C server. Such aggregation proved paramount for the success of our method.

A core part of our research was the production and selection of correct datasets. We used 13 datasets from the CTU-13 malware dataset [2], 55 malware datasets from the Stratosphere Malware Capture Facility Project (done by Maria Jose Erquiaga)[3] and we produced 20 of our own normal datasets. Each dataset was processed to extract the Bro files from the original pcap files. Afterwards, each dataset was labeled using our expert knowledge. The Amount of malware and normal traffic in our entire dataset is balanced.

Our detection method consisted in using and comparing several machine learning algorithms to learn how the normal HTTPS traffic differs from the malware HTTPS based on our behavioral features. Our results show that malware HTTPS behaviour is distinct from normal HTTPS behaviour and that our methods are able to detect malware with good accuracy without decrypting the traffic.

Malicious network traffic datasets:

Interesting papers

Malware detection by analyzing network traffic with neural network - http://ecmlpkdd2017.ijs.si/papers/paperID193.pdf
Detection of https based malware traffic - https://dspace.cvut.cz/bitstream/handle/10467/68528/F3-BP-2017-Strasak-Frantisek-strasak_thesis_2017.pdf
Finding bots in network traffic - http://conferences.sigcomm.org/co-next/2012/eproceedings/conext/p349.pdf

Bro IDS projects

Brothon - https://pypi.python.org/pypi/brothon
bat - https://pypi.python.org/pypi/bat/0.3.1
Bat Github page - https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb

Site search

Google Site Search will be shut down in less than 2 weeks and the amazing interface will be no more available. It's time to look for alternatives. Elastic Site Search (from the creators of Elasticsearch) is one possible Google Site Search alternative that is easy to implement and provides relevant search right out the box. Of course, you can write your own search engine in python/go/ruby that includes web crawler and index all the web pages.

Desired features:

Replace existing Google Site Search
Index your website content and make it searchable
Guide web visitors to the content that's most relevant to them
Leverage robust search analytics to make informed decisions

Benign and Malicious bots detection in Web traffic

Now a days, a major portion of web traffic is attributed to automated robots/ bots. Although some of them are performing useful tasks related to crawling and indexing of webpages(e.g. Google/Bing bots) and are categorized as good bots, others may be malicious and should be banned as it may affect performance and privacy of information on the website. The irrecoverable risks associated with malicious bots that intentionally try to evade intrusion protections system and other security mechanisms can not be ignored. There is a urgent need to identify these bots and characterize their behavior. A machine learning based approach that automatically detects and classifies the benign and malicious bots in web traffic is desired to alleviate the issues faced by security administrators.

Useful links:

User authentication using system calls

Traditional authentication mechanisms require users to enter long and complex passwords. However, these passwords are difficult for humans to remember and many a times users forget it. If user/password is compromised due to phishing and other activities, it will allow an illegitimate user to access a computer without raising any suspicion. But, if verification of user's identity with password is supplemented by other means like observation of individuals on-screen behavior, there is very less chance that illegitimate user can escape detection.

In principle, a variety of activities on a computer can be monitored in order to authenticate users like keyboard activity, mouse movements, program running sequences and so on. It is also possible to monitor system calls, or requests for service made to the operating system. In fact, most of a user’s interactions with the desktop applications on a computer will result in one or more system calls. The hypothesis is that the sequence of calls generated by a user’s interaction with a computer can be used to uniquely identify him/her. In fact, the same idea can be extended to identify malicious program also.

In order to collect system call traces, there are number of options like RohitLabs API monitor, Microsoft Detours, a C++ API that allows for the monitoring and interception of system calls and in the Linux world, there is sysdig. It is to be noted that Microsoft Detours has both a free Express version and a Professional version. The primary advantages of the Professional over the Express version are that it supports 64-bit programs and all Windows processors. However, it is planned to concentrate on x86 architecture, Detour's Express version is more than adequate.

This blog entry summarizes the concept very well - https://www.coveros.com/monitoring-system-calls-for-active-authentication-with-detours/

In fact, Defense Advanced Research Projects Agency (DARPA) has invited proposals from various organizations on development of Active Authentication program. Its main goal was to develop “novel ways of validating an identity of [a] person that focus on the unique aspects of the individual through the use of software-based bio-metrics.”

"Modelling behavior through system call sequences" is an active research topic and I plan to pursue it.

Privacy and Security risks in Android VPN apps

Millions of users use Mobile VPN solutions for various reasons like circumventing censorship or accessing geo-blocked content or accessing organization network in a secure way. Users are more concenred about security and privacy especially operating from public environment. However, most of the users are unaware about security/privacy settings and they do not have any knowledge about the mobile traffic flows. It is required that a detailed comprehensive analysis of Android Applications that are commonly used for VPN setup are evaluated on the basis of their permission, static/dynamic behaviour.

Good paper - https://www.ftc.gov/system/files/documents/public_comments/2018/08/ftc-2018-0052-d-0036-155000.pdf

Empirical analysis of Android Security Apps

Third-party security apps have become intergral part of Android eco system. Many users are installing these application to secure their devices from malwares. However, these apps often demand access to resources such as storage, text messages, browser history etc and it may contain personal sensitive information. It is required to evaluate and study the behaviour of these applications from different aspects such as metadata, permissions, static and/or dynamic analysis etc.

Good paper - https://arxiv.org/pdf/2007.03905.pdf

Automated forensic disk collection in cloud

Many cloud solutions provide scaling capabilities on demand. Many organizations take advantage of it while launching their services for the first time as interest among users is very high initially and later on, it subsides and only serious users remain and use the service. From security point of view, you should have ability to quickly gather forensically sound disk and memory evidence if there is a security incident. It is required that incident response (IR) team must be able to collect and analyze evidence quickly while maintaining accuracy for the time period surrounding the event. In cloud environment, it becomes challenging and time consuming for IR team to collect all the relevant evidence as there are large number of machine instances and accounts. If evidances are collected manually on one-by-one basis, precious time is lost which would have been spent analyzing and responding to an event. The delays in data collection allows the attacker to continue to work through systems to fulfil his malicious motives.

IR teams are looking for indicators of compromise (IoCs) data to identify potential suspicious activity within networks that warrants further investigation. Typically, these include file hashes, domains, IP addresses, or user agent strings. IoCs are used by many endpoint services/applications to help you discover potentially malicious activity in the network.

IR team must gather a point-in-time copy of relevant forensic data to determine the root cause, and evaluate the likelihood of malicious event(s). This process involves gathering snapshots of all attached volumes, a live dump of the system’s memory, instance metadata, and any logs that relate to the instance. These sources help IR team to identify and work towards a root cause.

It is important to take a point-in-time snapshot of an instance as close in time to the incident as possible. If there is a delay in capturing the snapshot, it can alter or make evidence unusable as the data is changed or deleted. To speed up the snapshot process, you need a way to automate the collection and delivery of potentially hundreds of disk images while ensuring that each snapshot is collected in the same way and without compromising the integrity of the evidence.

Need to think over the approaches to apply automation in forensic disk data collection for various options.

Interesting link:

Memory dump processing (Volatility) automation - https://github.com/vavarachen/volatility_automation

friTap – Decrypting TLS Traffic on the Fly

In recent years, obtaining decrypted network traffic for forensic purposes and analysis has become a more and more challenging task, both for forensic researchers as well as law enforcement agencies. Current techniques such as SSL pinning may render established analysis approaches like MitM proxies useless and prevent investigators and researchers from getting insights into encrypted traffic – even with full access to the device. In many cases, the time-consuming process of reverse engineering the application of interest remained the only option to obtain the keys for decrypting the network traffic, which lays the foundation for further protocol research and tool development.

In this talk, we present friTap a methodical approach to intercept the generation of encryption keys used by TLS for the purpose of decrypting the entire traffic an application sends. friTap is an open source framework built on top of FRIDA and is able to decrypt TLS traffic on all major operating systems including different CPU architectures.

Our approach enables researchers in network forensics to analyze the widely used proprietary network protocols in advance in order to gain insight into their structure, identify existing artifacts and finally develop methods and tools to aid future forensic analyses. To support this process, friTap provides an easy-to-use approach for researchers to create decrypted test data needed.