Information Leveraging from Business Invoices

Rule based Information Leveraging from Business Invoices

[pdf-embedder url=”” title=”Rule based Information Leveraging from Business”]

Abstract— Invoices are interchanged between business organizations on a day-to-day basis, and they all contain the similar kind of information i.e. What is the name of issuing company? To whom is it issued? What is the amount of the invoice? What is the mode of payment? Capturing and structuring this information can give floor to critical supply chain and cash flow questions and help business analysts make better decisions. In order to represent data with irregular or hidden structure, semi structured data allows a “schema-less” description format in which the data is loosely constrained than in usual database system. In this paper annotation is used on the basis of some rules to add more structure to business invoices in order to simplify and standardize the storage and retrieval of business information.

The business data today is so large and growing that it has reached the concept of what is termed today as Big Data. This Big Data contains each and every type of business data ranging from small kilobyte excel or word files to huge gigabytes multimedia files and still it does not end there. All this data needs to be stored in a data warehouse (DW) which is a integrated repository of data. This DW then assists decision makers and business analysts in decision-making process. [1][2]. Big Data is building itself voluminous with a high velocity combined with high variety while also targeting the veracity of data i.e. the worthiness and accountability of the data in question. This is the popular 4V model of Big Data given by IBM [3]. When we talk about the variety inherent in big data, then we have 3 major heads under which all the business data can be clubbed. These are the structured data, the unstructured data(USD) and the semi-structured data. All these data are a part of the big data warehouse. Structured data is that data which has a definite and fixed schema into which it is stored and all basic traditional data mining operations can be applied on it to process information. USD is a free flowing form of data which has no defined schema as it is so hazy to be stored in the form of cell and values. There are various techniques and mechanisms available in literature to handle the processing of unstructured data. Some of the existing techniques of data handling have been extensively reviewed in [4]. These techniques can be broadly categorized under the umbrella term of Data Analytics [5]. And then is the semistructured data which can be understood as a bridge between structured and unstructured data and contains some-what inherent structure which can be exploited to extract suitable knowledge. It can be handled through a mix of traditional data mining operations and contemporary data analytics operations. Data can be phenomenal for any organisation and it can be tricky to understand which type of data it is. There are mainly 3 broad categories of data: unstructured, semi-structured and structured. To understand it better human body can be assumed as a metaphor of Big Data. If an organization is a human body then the data entering through our eyes can be called unstructured data as this data is full of important information but it cannot be understood or analyzed until it is processed by the brain. Same is the case with the organizations, unstructured data enter the organization through their day-to-day activities everyone can see it but cannot understand it until it is compared or analyzed somehow. If air is considered to be entering through our nose as data then it will be treated as semi-structured data as it is separated from other useless data i.e. having a little structure but cannot be understood well before they are extracted by the lungs i.e. extracting Oxygen from air which symbolizes the useful data extraction from unstructured data. And finally the combination of ear nose and eyes will produce structured data as the data entered through all of them are simultaneously processed and verified and has a definite structure. Now, this amount of data available in all kinds of format has been increasing abundantly in recent years. The data incorporates itself into various formats, ranging from unstructured data in file systems to extremely structured in database systems, while this paper talks about semi-structured data that has implicit structure not as rigid and regular as that found in a relational database system but the structure exists there, and it needs to be extracted [6]. In this paper, semi-structured documents are talked about and rules are defined to process the information held in an invoice which has semi-structure inherently.
Although, the proposed approach is heuristics-based on propositional logic, yet, the implementation of those rules is an on-going task, which is not covered in this paper. The authors are still in the process of developing a complete engineering system which takes invoice in electronic format as input and checks for the inherent limited structure followed by the extraction, storage and processing. The rest of the paper is organised as follows. Section II describes the related work in this regard. Section III presents rules for extracting facts out of business invoices while also drawing their proves from propositional logic. Section IV concludes the paper and talks about the future work.