PHP Vulnerability Detection Based on Taint Analysis
INTRODUCTION PHP language is widely used on the Web services, But it has the inherent vulnerability, so it is easy to produce a variety of vulnerabilities in the programming process, which is more common with SQL injection vulnerabilities and XSS (Cross Site Scripting) vulnerabilities. According to OWASP (Open Web Application Security)  2017 latest data, in the top ten Web application security vulnerabilities, the first category is injection attacks, and Cross Site Scripting attack ranked third. The contributing factor of these two vulnerabilities is mainly that the validity of the user input data can not be reasonably verified, so this type of vulnerabilities is called tainted vulnerabilities. Currently there are several industry PHP static analysis tools such as: Pixy, RIPS, PHP-Sat, WAP and Fortify SCA and so on. Pixy  is an open source PHP static detection tool developed by Java. It is very effective in detecting vulnerabilities through data flow analysis. But Pixy only supports PHP4 syntax, which does not support for PHP5 and later syntax features. The Pixy is mainly aimed at XSS vulnerabilities, and in contrast is not ideal for other vulnerabilities. RIPS  is a PHP development tool for the detection of PHP. It is a static analysis tool based on the token flow. It uses PHP’s built-in token_get_all () to parse the PHP code to get the token, and then converts the code into an intermediate format for easy analysis, It adopts the taint analysis method to detect vulnerabilities. WAP(Web Application Protection)  is a PHP vulnerability detection tool which is developed by Java. This tool analyzes the source code by taint analysis. The advantage is that you can automatically repair the identified vulnerabilities. In addition, there are a lot of people also designed the relevant PHP static analysis tools. Balzarotti  is combined with string analysis method, based on Pixy tool to achieve a combination of static and dynamic PHP code analysis method. Rimsa  Used e-SSA as an intermediate representation to perform a taint analysis of PHP. Y.Zheng and X.Zhang  proposed path and context-related inter-process analysis methods to detect vulnerabilities. But there are still many problems with the above tools, for instance the intermediate representation is not complete resulting in the loss of a lot of important information; Some tools can not be fully compatible new version of the PHP syntax features; And the false negatives and false positives are relatively high. In this paper, based on the experience of predecessors, we use the PHP-Parser which can be compatible with the latest and more widely used PHP5 and PHP7 to conduct lexical and grammatical analysis for PHP. The PHP-Parser can produce AST (Abstract Syntax Tree) with very complete information. Then building the CFG(Control Flow Graph). At last we perform the fine-grained taint propagate analysis and detect the possible vulnerabilities. The experimental results show that we can really find a number of vulnerabilities by this method.
LEXICAL AND GRAMMATICAL ANALYSIS For taint analysis, the PHP source code must be transformed into an intermediate representation which is conducive to taint analysis. In this paper, we use the CFG as the intermediate representation. But the generation of CFG depends on the AST. The AST reflects the program structure, so we need to conduct lexical and grammatical analysis for PHP source code at first. There are several ways to generate the AST. Some use HHVM , a PHP virtual machine used to implement the PHP language to generate AST; some based on ANTLR , a grammar parser that automatically generates a AST based on input. In this paper, we use the PHP-Parser  to conduct lexical and grammatical analysis for PHP. PHP-Parser is mainly used to generate AST, It is a great help for the PHP static analysis. PHP syntax is large, probably contains 140 different nodes. In order to facilitate the analysis, they are divided into three categories: (1) Statement nodes, which do not return a value and can not occur in an expression. For example, a class definition is a statement. (2) Expression nodes, which return a value and thus can occur in other expressions, e.g. $var and func(). (3) Scalar nodes, like ‘string’ or magic constants like __FILE__. (4) There are some nodes not in either of these groups, such as names and call arguments.
THE CFG GENERATION ALGORITHM When we get the PHP source code, first of all, the source code will be converted to AST by PHP-Parser. If we perform data flow analysis on the AST, the AST and data flow analysis can not be fully compatible because the programming language will have branches, loops, jumps and conditional expressions and other structures, which implied discontinuous control flow. Therefore, we transform the AST into a CFG, and then perform a data flow analysis. The specific process is as follows: (1) Getting the array of AST which is generated by PHPParser. If it has only one statement which is a class definition, we extract the inner class function, pass it to the CFG constructor construct_cfg () and construct CFG of the PHP Function. Then constructing the CFG of the main program of the file. (2) Constructing a series of CFG node which is included in the AST, containing the assignment expression, unset, global, break, return, even including function calls, method calls, etc. The above statement is relatively simple, directly call CFGNodeStmt () Method can be. But the conditional statement and loop statement processing is more complex, so they need to be dealt with alone. (3) We set up an empty input node as the current node. Then we traverse each of the statements which included in the AST array variable $stmts, and determine the type of the statement. For example, if the statement is the assignment statement, firstly we build a statement node. Then connecting the current node with the assignment node. The current node is the parent node, the assignment node is the child node. Next the assignment node as the current node and then handling the next node. (4) When all nodes are traversed, we construct an exit node and connect the exit node with the last previously processing node which as the parent node of the exit node. Finally return to the constructed CFG. The algorithm is used to convert AST to CFG which is used to conduct the data flow analysis. IV. TAINT ANALYSIS PROCESS Taint analysis is a practical method of data flow analysis. In the past few decades, data flow analysis has been an important research direction in the field of information security, and a lot of work has been done in developing data flow strategy. In this paper, we perform a flow-sensitive forward taint analysis on the generated CFG. To begin with getting the generated mainCFG and functionCFG, and then initialize the pre-defined taint source information. Afterwards, preforming fine-grained taint analysis on the mainCFG and functionCFG. Getting the node of the CFG to do the taint analysis until all the nodes are traversed. Finally, printing the tainted types and variables to determine the vulnerabilities. A. Identify the Taint Source The taint source represents the directly introduction of the untrustworthy data in system. Identifying the taint source is the precondition for the taint propagate analysis. This paper uses the heuristic strategy  to identify the taint source. That is, all the program external input data are considered as tainted data, which may contain malicious attacks. When the program’s sensitive function obtains the untrustworthy data which inputted by users, it may bring about tainted vulnerabilities. So we need to make a mode for the parameters of the user input and MySQL sensitive functions. The user’s direct input can be gotten by the super-global variables. The super global variables comprise ‘$ _GET’, ‘$ _ POST’, ‘$ _ COOKIE’ and so on. MySQL sensitive functions include ‘mysql_query’, ‘mysql_fetch_array’ etc. They are added as pre-defined taint source functions and respectively become into two categories. The first category contains superglobal variables, the second category contains SQL functions. And then setting two functions, separately, getting parameters propagated by CFG and then to determine whether these parameters exist in the two types of pre-defined taint source functions. If the parameters exist in the pre-defined taint source functions, they may contain tainted variables, marked as a taint source. And to determine which kind of taint function they belong to. The first category is marked as userTainted, the second category is marked as secretTainted. B. Perform Flow-sensitive Forward Taint Analysis on the CFG After we have defined the taint source, we will perform the taint propagation analysis on the generated CFG. That is to track the propagation path of the taint data in the program. In this paper, the taint propagation analysis process on the CFG is as follows: First, we construct the CFG taint map which is an abstract object storage, which is generated directly by calling the PHP built-in function SqlObjectStorage(). The map contains a series of tainted variables per CFG node. Putting the nodes of the CFG into a queue. And initially the entry node is as the current node to determine whether the taint map contains the current node. If it is not contained, the current node is put into the taint map and perform taint analysis on the current node. Then we start to traverse the nodes of the CFG. To get the child node of the current node and analysis it until all nodes are traversed and all nodes are analyzed.
TESTING AND EVALUATION DVWA (Damm Vulnerable Web App)  is used as test set. DVWA is a Web application which are developed based on PHP and MySQL. It is a vulnerability experimental platform, including SQL injection, XSS and so on. The number of vulnerabilities is been determined, so relatively, it can be accurately identify false positives and false negatives. We have tested 16 programs which include SQL injection and XSS. The programs are divided into four groups, each with four programs. The vulnerability level of each groups are low, medium, high, impossible. That the low, medium and high each contains a vulnerability, the impossible is a security program that does not contain vulnerability. The actual detection results are shown in Table 1, where Y represents the detected vulnerability and N represents the undetected vulnerability.
CONCLUSIONS This paper presents a method for detecting PHP code vulnerabilities based on taint analysis. It use the PHP-Parser syntax parser which currently supports PHP5.2 to PHP7.1 all the features. Then we generate the CFG depending on the generated AST. After that, tracking the program parameters, variables and other external input, marking the input type, propagating to various types of vulnerability functions via the taint. Finally, according to the output of the tainted types and variables to examine the vulnerabilities. The analysis method of this paper currently only supports the procedure-oriented program, and further research is needed for the object-oriented analysis. There are fewer types of vulnerabilities currently supported for detection. The above questions will be further studied in the future.