Bayesian Classification for Android Applications
A.Feature Extraction There are many features we can extract from applications: permission, string, API and Android Market data etc. We extract features of the application based on the following considerations: (i) the category of the application is determined by its functions; (ii) a feature can be chosen as the classification feature only if it can reflect the functions of an application. Through our in-depth analysis of the relationship between application categories and features, we retrieve several features from the applications: (i) the permissions actually used by the application, (ii) the strings contained in the application that can reflect the application function, (iii) description of the application on Android Market.
1)Used Permissions Extraction When Android application developers develop different application functions, they need call the APIs provided by SDK. At the same time, they need to declare the corresponding permissions in the manifest.xml. Therefore, the permission is a good feature choice for Android categorization. However, due to the lack of development experience, the lack of documentation and some other reasons, some permission declared in the manifest.xml file are not actually used by the application. android project ideas for students Therefore, the permissions declared in manifest.xml can’t completely reflect the real characteristics of the application. Hence, differently from previous work using the permissions declared in the manifest.xml, we extract the actually used permissions from the application. Thanks to the work of Au et al. , through generating the graph call for Android source code, they extract the permission mappings for each API. So, through extracting the APIs called by the application and utilizing the data provided by Au, we can get the actually used permissions. The steps we extract the permissions are as follows: 1)Disassemble the Android application apk to get the classes.dex file. 2)Disassemble classes.dex file to constituent .smails files. 3)Mine .smalis files to extract called APIs. 4)Combine APIs we extract and Permission Mapping data provided by Au to get the actually used permissions.
2)String Extraction The strings contained in the application mainly include two kinds: those of the first kind are the strings embedded in the program component to show the function of a component (e.g., Chats, Moments, Contacts). https://codeshoppy.com/android-app-ideas-for-students-college-project.html According to our analysis, one of the most important features of those strings is that they contain less than three words. The other kind strings are simple sentences prompting user for interactive information (e.g., Network unavailable, Check network). Obviously, the first kind of strings can reflect the function of the program and they are important features for Android categorization.
However, according to our analysis, over 90% of the second kind are prompt messages and their meanings are common for most applications. Besides, because of developer’s unique personality, most messages with the same meaning are expressed with completely different sentences. There is no doubt that these strings can’t do help to the improvement of the classification result. However, they will reduce the accuracy of classification. Thus, in this paper, we extracted the first kind of strings as the classification features, while filtered out the second kind of strings. The components are typically defined in the layout file under the resources/layout* directories, and the strings referenced by the components are typically defined in the resource files (e.g., strings.xml). Hence, by extracting the string, which contains less than 3 words, referenced by components in the layout file, we can extract the string features that reflect application functions. We also use stanfordnlp  to remove stop words and finish the work of stemming words. As shown in Fig.2, the steps we extract strings contained in the application are as follows: 1)Disassemble Android application to get the resources files. 2)Extract the strings features by analysis all the layout files under the resources/layout* directories. 3)Use stanfordnlp to remove stop words and finish the work of stemming words.
3)Application’s Description Extraction In order to let users know the function of the application, the publisher of the application introduces the application through pictures, text or video on the Android Market. The text description of the application contains a large number of features that reflect the function of the application. Therefore, the description of an application on Android Market is an important feature for android categorization. To get the description from Android Market, we used open-source non-official APIs, called android-market-api.
B.Bayesian classification Bayesian method is an efficient supervised learning algorithm. The most important capability of Bayesian method is that it can calculate the probability that a new sample data belongs to a certain category (e.g., the probability of an application to be of a certain category), according to the historical datasets we provide. Bayesian method is also highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. As Bayesian method is high adaptive to our dataset and efficient, we employ Bayesian as the classification algorithm in this paper.