PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA
Asthma is one of the most prevalent and costly chronic conditions in the United States which cannot be cured. However accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Current national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering non-traditional, digital information to perform disease surveillance.
We introduce a novel method of using multiple data sources for predicting the number of asthma related emergency department (ED) visits in a specific area. Twitter data, Google search interests and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% precision. The results can be helpful for public health surveillance, emergency department preparedness, and, targeted patient interventions.
Asthma is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3,500 deaths, and incurs more than 50 billion dollars in direct medical costs annually. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma. Although asthma cannot be cured, many of its adverse events can be prevented by appropriate medication use and avoidance of environmental triggers. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments.
Notoriously, such data have a lag-time of weeks, therefore providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics. Common data sources for these variables include electronic medical records (EMR), medical insurance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality improvement project for asthma care, Parkland Center for Clinical Innovation (PCCI) researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED visits within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a narrower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within 2 weeks of data collection, while being safely deferred for patients with a later predicted high-risk period.
Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available public data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering non-traditional, digital information to perform disease surveillance. These include diverse datasets such as those stemming from social media, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140-character messages called “tweets”. It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More importantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveillance.
Another notable non-traditional disease surveillance systemhas been a data-aggregating tool called Google Flu Trends which uses aggregated search data to estimate flu activity. Google Trends was quite successful in its estimation of influenza-like illness. It is based on Google’s search engine which tracks how often a particular search-term is entered relative to the total search-volume across a particular area. This enables access to the latest data from web search interest trends on a variety of topics, including diseases like asthma. Air pollutants are known triggers for asthma symptoms and exacerbations. The United States Environmental Protection Agency (EPA) provides access to monitored air quality data collected at outdoor sensors across the country which could be used as a data source for asthma prediction. Meanwhile, as health reform progresses, the quantity and variety of health records being made available electronically are increasing dramatically. In contrast to traditional disease surveillance systems, these new data sources have the potential to enable health organizations to respond to chronic conditions, like asthma, in real time. This in turn implies that health organizations can appropriately plan for staffing and equipment availability in a flexible manner. They can also provide early warning signals to the people at risk for asthma adverse events, and enable timely, proactive, and targeted preventive and therapeutic interventions.
USE OF HANGEUL TWITTER TO TRACK AND PREDICT HUMAN INFLUENZA INFECTION
AUTHOR: Kim, Eui-Ki, et al.
PUBLISH: PloS one vol. 8, no.7, e69305, 2013.
Influenza epidemics arise through the accumulation of viral genetic changes. The emergence of new virus strains coincides with a higher level of influenza-like illness (ILI), which is seen as a peak of a normal season. Monitoring the spread of an epidemic influenza in populations is a difficult and important task. Twitter is a free social networking service whose messages can improve the accuracy of forecasting models by providing early warnings of influenza outbreaks. In this study, we have examined the use of information embedded in the Hangeul Twitter stream to detect rapidly evolving public awareness or concern with respect to influenza transmission and developed regression models that can track levels of actual disease activity and predict influenza epidemics in the real world. Our prediction model using a delay mode provides not only a real-time assessment of the current influenza epidemic activity but also a significant improvement in prediction performance at the initial phase of ILI peak when prediction is of most importance.
A NEW AGE OF PUBLIC HEALTH: IDENTIFYING DISEASE OUTBREAKS BY ANALYZING TWEETS
AUTHOR: Krieck, Manuela, Johannes Dreesman, Lubomir Otrusina, and Kerstin Denecke.
PUBLISH: In Proceedings of Health Web-Science Workshop, ACM Web Science Conference. 2011.
Traditional disease surveillance is a very time consuming reporting process. Cases of notifiable diseases are reported to the different levels in the national health care system before actions can be taken. But, early detection of disease activity followed by a rapid response is crucial to reduce the impact of epidemics. To address this challenge, alternative sources of information are investigated for disease surveillance. In this paper, the relevance of twitter messages outbreak detection is investigated from two directions. First, Twitter messages potentially related to disease outbreaks are retrospectively searched and analyzed. Second, incoming twitter messages are assessed with respect to their relevance for outbreak detection. The studies show that twitter messages can be – to a certain extent – highly relevant for early detecting hints to public health threats. According to the law on German Protection against Infection Act (Infektionsschutzgesetz (IfSG), 2001) the traditional disease surveillance relies on data from mandatory reporting of cases by physicians and laboratories. They inform local county health departments (Landkreis) which in turn report to state health departments (Land). At the end of the reporting pipeline, the national surveillance institute (Robert Koch Institute) is informed about the outbreak. It is clear that these different stages of reporting take time and delay a timely reaction.
TOWARDS DETECTING INFLUENZA EPIDEMICS BY ANALYZING TWITTER MESSAGES
AUTHOR: Culotta, Aron.
PUBLISH: In Proceedings of the first workshop on social media analytics, pp. 115-122. ACM, 2010.
Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.
Existing methods in the increased availability of information in the Web, in the last years, a new research area has been developed, namely Infodemiology. It can be defined as the “science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy”. As part of this research area, several kinds of data have been studied for their applicability in the context of disease surveillance. Google flu trends exploit the search behavior to monitor the current flurelated disease activity. It could be shown by Carneiro and Mylonakis that Google Flu Trends can detect regional outbreaks of influenza 7–10 days before conventional Centers for Disease Control and Prevention surveillance systems.
Google messages and their relevance for disease outbreak detection has been reported already that especially tweets are useful to predict outbreaks such as a Norovirus outbreak at a university analysed twitter news during the influenza epidemic 2009. They compared the use of the term “H1N1” and “swine flu” over the time. Furthermore, they analysed the content of the tweets (ten content concepts) and validated twitter as a the real time content. They analysed the data via Infovigil an infosurveillance system by using an automated coding. To find out if there is a relationship between automated and manual coding, the tweets were evaluated by a Pearson´s correlation. Chew et al. found a significant correlation between both coding in seven content concept it needs to be investigated whether this source might be of relevance for detecting disease outbreaks in Germany. Therefore, only German keywords are exploited to identify Twitter messages. Further, we are not only interested in influenza-like illnesses as the studies available so far, but also in other infectious diseases (e.g. Norovirus and Salmonella).
Existing methods have a common format: [username] [text] [date time client]. The length is restricted to 140 characters. In terms of linguistics, each twitter user can write as he or she likes. Thus, the variety reaches from complete sentences to listing of keywords. Hashtags, i.e. terms that are combined with a hash (e.g. #flu) denote topics and are primarily utilized by experienced users categories google according to their contents in more details, google messages can • Provide information, • Express opinions or • Report personal issues is provided, the authority of that information cannot normally not be determined, so it might be unverified information. Opinions are often expressed with humor or sarcasm and may be highly contradictive in the emotions that are expressed.
Our proposed methods to leverage social media, internet search, and environmental air quality data to estimate ED visits for asthma in a relatively discrete geographic area (a metropolitan area) within a relatively short time period (days) to this end, we have gathered asthma related ED visits data, social media data from Twitter, internet users’ search interests from Google and pollution sensor data from the EPA, all from the same geographic area and time period, to create a model for predicting asthma related ED visits. This work is different from extant studies that typically predict the spread of contagious diseases using social media such as Twitter. Unlike influenza or other viral diseases, asthma is a non-communicable health condition and we demonstrate the utility and value of linking big data from diverse sources in developing predictive models for non-communicable diseases with a specific focus on asthma.
Research studies have explored the use of novel data sources to propose rapid, cost-effective health status surveillance methodologies. Some of the early studies rely on document classification suggesting that Twitter data can be highly relevant for early detection of public health threats. Others employ more complex linguistic analysis, such as the Ailment Topic Aspect Model which is useful for syndrome surveillance. This type of analysis is useful for demonstrating the significance of social media as a promising new data source for health surveillance. Other recent studies have linked social media data with real world disease incidence to generate actionable knowledge useful for making health care decisions. These include which analyzed Twitter messages related to influenza and correlated them with reported CDC statistics validated Twitter as a real-time content, sentiment, and public attention trend-tracking tool. Collier employed supervised classifiers (SVM and Naive Bayes) to classify tweets into four self-reported protective behavior categories. This study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data.
Our work uses a combination of data from multiple sources to predict the number of asthma-related ED visits in near real-time. In doing so, we exploit geographic information associated with each dataset. We describe the techniques to process multiple types of datasets, to extract signals from each, integrate, and feed into a prediction model using machine learning algorithms, and demonstrate the feasibility of such a prediction.
The main contributions of this work are:
- Analysis of tweets with respect to their relevance for disease surveillance,
- Content analysis and content classification of tweets,
- Linguistic analysis of disease-reporting twitter messages,
- Recommendations on search patterns for tweet search in the context of disease surveillance.
HARDWARE & SOFTWARE REQUIREMENTS:
v Processor – Pentium –IV
- Speed – 1 GHz
- RAM – 256 MB (min)
- Hard Disk – 20 GB
- Floppy Drive – 44 MB
- Key Board – Standard Windows Keyboard
- Mouse – Two or Three Button Mouse
- Monitor – SVGA
- Operating System : Windows XP or Win7
- Front End : JAVA JDK 1.7
- Back End : MYSQL Server
- Server : Apache Tomact Server
- Script : JSP Script
- Document : MS-Office 2007