SUGM 2015 - Canberra

Stata User Group Meeting Presentations

Demetris Christodoulou, University of Sydney

xcluster: A partially heterogeneous framework for short panel data models

xtcluster implements the partially heterogeneous framework proposed by Sarafidis and Weber (2015). The algorithm classifies individuals into panel data regression clusters, such that within each cluster the slope coefficients are homogeneous and intra-cluster heterogeneity is attributed to the presence of individual- and time-specific effects. The slope coefficients differ across clusters. The optimal number of clusters and the associated optimal partition is determined using a model information criterion that is consistent for T fixed, as N grows large. The proposed method relies on the data to suggest any clustering structure that might exist. Hence, it can be particularly useful when there is no a priori information about a potential clustering structure, or when one is interested in examining how far a structure that might be meaningful according to some economic measure lies from the structure that is optimal from a statistical point of view.
See full presentation

Jo Dipnall, Deakin University

Identifying biomarkers in epidemiological studies using a fusion of data mining and traditional statistical techniques in Stata

Background: Epidemiological studies generally incorporate vast numbers of variables. There are a multitude of techniques for variable selection in data mining, machine learning and traditional statistics with varying accuracy. The aim of this study was to incorporate these techniques in Stata to identify key biomarkers, from a large number measured, and explore their associations with depression. Methods: Data from the National Health and Nutrition Examination Study (2009-2010) were utilised (n=5,227, mean age=43 yr). Depressive symptoms were measured using the Patient Health Questionnaire-9. Blood and urine samples were taken and large numbers of biomarkers measured (n=67). Anthropometric measurements, demographics and medications were determined. Lifestyle and health conditions was obtained via a questionnaire. A 4-step analysis process was performed incorporating multiple imputation, a Stata boosted regression plugin and traditional statistical techniques. Covariates included sex, age, race, smoking, food security, PIR, BMI, diabetes, inactivity and medications. The final model controlled for confounders and effect moderators. All analysis was managed within Stata’s project and macro do environment. Conclusion: Four out of a possible 67 biomarkers were identified as being associated with depressive symptoms. Implementing this research’s complex analysis strategy entirely from within Stata eliminated cross platform errors and ensured easy replication of the results.
See full presentation

James Hurley, Ballarat Health Services

Use of graphical commands to eyeball the data for pneumonia prevention in intensive care units

There are >200 published studies of methods to prevent infections acquired in the Intensive Care Unit (ICU) such as pneumonia and bacteremia. The application of combinations of various antibiotic topically to the upper airway appears to be the most effective method (>40 studies). Surprisingly, within these studies of topical antibiotics as the prevention method the incidence of pneumonia and bacteremia among the control groups is as much as double that versus control groups within studies of methods other than topical antibiotics. Why? Graphics as obtained with metandi obtained with meta-analysis of diagnostic tests offer a ‘novel’ approach to modelling the relationship between control group rate and intervention effect size within controlled trials. Stata offers a broad range of commands to study statistical relationships but an outstanding feature is the range of graphical commands available that enable the data to be ‘eyeballed’. In this talk I will demonstrate using graphs produced by ‘metan’, ‘metandi, ‘funnelcompar’, ‘ellip’, and good old ‘twoway (scatter)’ that the relationship between control group incidence and effect size in this context is not simple. Is it “cause and effect” or the other way around? Ref: Hurley JC (2014) Topical antibiotics as a major contextual hazard toward bacteremia within selective digestive decontamination studies: a meta-analysis. BMC Infect Dis 14:714. http://www.biomedcentral.com/1471-2334/14/714/
See full presentation

Susan Kim, Flinders University

Using interrupted time series analysis to examine the effectiveness of the Comprehensive Stroke Unit model

Stroke care on comprehensive stroke unit (CSU) is the gold standard. Care for stroke patients often involve neurologists as well as other physicians with stroke care expertise and training, i.e. stroke physicians. The aim of this study is to examine whether the CSU results in better outcomes irrespective of the physician. Patients’ data from a single centre with ischemic stroke admitted between 2000 and 2014 were analysed retrospectively. Three system changes were made during this time: (1) patients were initially seen by a neurologist and transferred to a stroke physician from 2004; (2) advent of a stroke trained neurologist in 2007; (3) a CSU model with care by a single stroke physician led by a Stroke Director from 2010. Interrupted time series analysis was used to model the changes in patients’ outcomes and complication rates over time using monthly aggregated data. The percentage of patients discharged to rehabilitation facilities significantly changed after each implementation (p<0.01) and significantly less number of patients developed the aspiration pneumonia post 2010 (p=0.045). More patients were sent to rehabilitation facilities and less with complications post CSU model, so better outcomes can be achieved via CSU model of care even when staffed by non-neurologist stroke physicians.
See full presentation

Kodjo Kondo, University of New England

Count model selection and post-estimation towards the evaluation of composite flour technology

This paper presents Stata estimation and post-estimation analyses in identifying determinants of the probability and extent of adoption of composite flour technology in bread baking in the Dakar Region of Senegal (West Africa). The technology is promoted to limit dependency on imported wheat. A hurdle regression model is estimated using socio-economic and production data collected from 150 bakers in 2014. The hurdle model, which was preferred over the negative binomial and the zero-inflated negative binomial models, allows us to disentangle factors affecting the adoption decisions from those influencing the quantities used. Findings indicate that the ownership of a 50 kg mixer, training programs on composite flour production and the number of bakeries owned positively affect adoption decisions, while the quantity decisions are influenced by membership in the baker federation and the expected output. The wheat and millet flour price ratio positively affects both decisions. These results imply that efforts to increase the adoption rate and its extent should promote the 50 kg mixers, intensify the professional training on composite flour production, institutionalize the use of composite flour and contribute to making local flour cheaper than wheat flour by intensifying local cereal production.
See full presentation

JinJing Li and Yohannes Kinfu, University of Canberra

rdecompose: Outcome decomposition for aggregate data

Social, behavioural and health scientists frequently apply methods for decomposing changes or differences in outcome variables into components of change. A number of Stata commands, such as those based on the Blinder-Oaxaca approach, have been developed over the years to facilitate this exercise using unit level data. However, despite the abundance of aggregate data and wide use of corresponding aggregate data decomposition techniques there are no comparable user developed Stata commands for decomposing changes or differences using aggregate level data. In this paper, we introduce a new Stata command for aggregate data decomposition, based on Gupta's reformulation, and demonstrate applications from a wide range of settings that include demography, epidemiology and health economics. Our command in Stata also extends existing approaches to allow any number of factors and various functional relationships that are not available in any platform.
See full presentation

Rosie Meng, Flinders University

Model comparison for analysis of population surveillance data

Objective: To evaluate the relative merits of different approaches to the analysis of population-level bowel cancer surveillance data using available Stata routines. The focus is on selecting models to suit the research questions and the ease of interpretation. Methods: Outcomes of colonoscopies for colorectal cancer surveillance was obtained from the South Australian Southern Cooperative Program for the Prevention of Colorectal Cancer (SCOOP). Research questions were to identify whether patient and adenoma characteristics were associated with the degree of neoplasia advancement at the next surveillance colonoscopy. Amongst 379 patients with a diagnosis of low or high risk adenoma at index colonoscopy between, their first surveillance colonoscopy were performed between 06-Dec-2001 and 21-Dec-2010. Five regression models were constructed: 1) Cox cause-specific model (stcox); 2) Cox model with stratification; 3) parametric survival model (streg); 4) competing risks survival model (stcrreg); 5) multinominal logistic regression (mlogit). Results: The 4 survival models generally had good agreement and also are consistent with Kaplan-Meier curves, but results from mlogit differ significantly from the rest. Conclusions: Survival analysis is preferred for surveillance data especially when follow-up time varies considerably between individuals. A cause-specific Cox model may be preferred over a competing risks model to ease result interpretation.
See full presentation

Philip Morrison, Victoria University of Wellington

Applications of "margins" in social science

Although introduced in Stata 11 and 12 margins and marginsplot are not as widely used in social science as they could be.1 In advocating wider use of these tools, this paper illustrates their application in two studies based on Statistics New Zealand Surveys. The first models loneliness and its relationship to social connection and draws on the 8000 unit records from the New Zealand General Social Survey, NZGSS2012. The second asks how job satisfaction is associated with job insecurity, based on the 20,000 records of permanent employees within the pooled 2008 and 2012 Survey of Working Life (SoWL). The margins and associated commands greatly expand our ability to assess the effects (associations) of the (usually categorical) attributes of respondents on outcomes of policy interest. The paper focusses on the additional insights gained especially when margins is combined with more recent user additions (e.g. coefplot and marginscontplot).
See full presentation

Normand Peladeau, Provalis Research

Text analysis using WordStat 7 within Stata

WordStat for Stata offers advanced text analytics features, allowing Stata 13 and 14 users to analyze text stored in both short and long string variables using numerous text mining features, such as topic modeling, document clustering, automatic classification, as well as state-of-the-art dictionary-based content analysis. Extracted themes may then be related to structured data using various statistics and graphic displays. WordStat also offers a tool to create a Stata project from lists of documents (including .DOC, HTML and PDF files) and to automatically extract from those, numerical, categorical data and dates.

Rebecca Pope, StataCorp

Treatment effects for survival-time outcomes: Theory and applications using Stata 14

The potential-outcomes framework for estimating treatment effects from observational data treats the unobserved outcome as a missing data problem. When we extend this framework to the analysis of survival-time outcomes, we also allow for data that is missing due to censoring. This requires us to make additional assumptions and changes the properties of some of the estimators. Beginning with a brief review of key concepts of survival-time data, I discuss potential outcomes in the context of survival analysis. I also explain some of the advantages to using treatment-effects analysis relative to traditional survival analysis. Alongside a brief overview of some of the estimators that are implemented in Stata 14, I demonstrate the application of survival treatment-effects analysis. Examples include analysis of single- and multivalued-treatments and postestimation checking of model assumptions.
See full presentation

Steve Quinn, Flinders University

The Hjort-Hosmer goodness-of-fit statistic for binary regression

The statistic most commonly used to evaluate the adequacy of logistic regression model is the Hosmer-Lemeshow statistic[1]. The authors proposed a goodness-of-fit test based on partitioning the fitted probabilities into a number of groups and compared observed events to expected events within each group. They showed via simulations that the resulting statistic follows a chi-squared distribution with degrees of freedom approximately equal to the number of groups minus two. The Hjort-Hosmer statistic[2] also assesses model adequacy and is based on partial sums of residuals that are sorted by their corresponding fitted values. The basic idea is that if a model is correctly fitted, then the partial sums should vary randomly about zero, and better model fit should correspond to smaller maximal partial sums. In this talk the Hosmer-Lemeshow and Hjort-Hosmer statistic are compared in binary regression models with different links, and we describe the hjorthos that calculates the Hosmer-Hjort statistic.
See full presentation

Bill Rising, StataCorp

Bayesian analysing using Stata

Bayesian analysis made its official Stata debut with the release of Stata 14. In this talk, we will explore some simple applications to demonstrate the basics of Stata's user interface and suite of commands for Bayesian analysis.
See full presentation

Malcolm Rosier, SDAS

A practical introduction to Stata 14 item response theory

Stata 14 includes a module on Item Response Theory (IRT). We discuss basic characteristics of measurement in the social sciences, show how traditional measurement techniques and IRT are related, and discuss merits, constraints and uses of IRT. The IRT procedure produces a calibrated scale of the underlying (latent) dimension at the interval level of measurement. The same scale is used to obtain a measure of the difficulty of each item and of the ability of each person. We illustrate the One-Parameter and Two-Parameter Logistic Models by analysing a mathematics achievement test with dichotomous responses, scored correct or incorrect. We then introduce the IRT procedures applied to ordered categorical data. We apply the Rating Scale Model (RSM) and the Graded Response Model (GRM) to attitude scale data.
See full presentation

Markus Schaffner, Queensland University of Technology

Statdoc: Document and explore

Statdoc is a small utility program written in Java that automatically documents data analysis projects. It is modelled after similar tools used in software development and as such supports good coding standards. The program can run stand-alone or from within Stata and produces a set of static html files that reveal information about the files in a given folder structure. Statdoc automatically discover as much information as possible about the data, the variables, script files, and output files that it can identify and highlights the links between them. It features an enhanced documenting comment type, which allows to record supporting meta-information. This way it allows the user to organise projects with ease and assist to uncover information about other people’s projects. The utility is aimed at real world research projects where a multitude of data sources, script files, and outputs are not uncommon. Since the documentation is produced as static html files, it also facilitates sharing the complete information about a project on the web, helping efforts to make the data analysis process more transparent. Statdoc is available as an open source project on Github (for more information and examples see https://github.com/mas802/statdoc).
See full presentation

Tyman Stanford, The University of Adelaide

An assessment of current software: Parameter estimate accuracy for generalized linear mixed models with binary outcome data.

Generalized Linear Mixed Models (GLMMs) are a widely used class of models that assume the expected value of an outcome variable is determined by a linear combination of predictor variables, via an invertible link function, with both fixed and random model coefficients. Estimation of the model coefficients has improved with increased computational power; the current gold standard to estimate GLMM coefficients requires adaptive Gauss-Hermite quadrature approximation of the profiled likelihood function, usually a multi-dimensional integral, to obtain (approximate) maximum likelihood solutions. The performance of widely used software packages in estimating fixed and random coefficients with a Bernoulli outcome variable is the focus of this work. The packages surveyed, many with multiple routines available to perform GLMM parameter estimation, are Stata, R, SAS, ADMB, SPSS and Matlab. The GLMM routines in these packages are applied to multiple simulated datasets with known parameters to determine the accuracy of parameter estimates of both fixed effects and the variance components. The effect of increasing the number of adaptive Gauss-Hermite quadrature integral approximation points on the bias and precision of the estimates, as well as the effect on model selection using AIC, will be presented. The computational time taken to generate model parameter estimates using simulated data is also presented, an additional consideration in practice.
See full presentation

Bill Tyler, Charles Darwin University

Causal inference and treatment effect: An integrative framework for evaluation research

The increased popularity of quasi-experimental designs with observational data in policy-oriented evaluation studies, while enriching the environment of Stata applications, has complicated the options available to health and other social science researchers. In cross-cultural policy-related research, the tensions between multilevel and counterfactual modeling present particular problems for satisfying evidential criteria for both efficacy and effectiveness within what is often viewed as a homogeneous field for educational and child development policy. This presentation offers an integrative framework for interrogating the options for extending propensity-score analysis and other counterfactual approaches to multilevel modeling. The utility of this framework is illustrated from issues arising from ongoing evaluation projects in the areas of indigenous school-based interventions in remote community settings in Northern Australia.
See full presentation

Richard Woodman, Flinders University

Comparison of structural equation models with a binary outcome using Stata and MPlus

Structural Equation Modelling (SEM) is a powerful technique for examining complex relational structures and potential causal pathways. Although many software packages including AMOS, STATA, Mplus, LISREL and R provide routines for SEM with continuous outcomes, not all are capable of handling categorical data. In addition, there are differences between software in regards to the availability of desirable SEM features including model fit indices, tests of group invariance, direct and indirect effect estimates, modification indices and estimation approaches. Mplus software is widely used in the Social Sciences and is considered by many as the gold-standard software for SEM. STATA introduced SEM in version 12 and implemented SEM for categorical outcomes in version 13. This presentation will describe and compare the available estimation options of STATA and Mplus for SEM using a clinical dataset that includes the binary outcome of coronary artery disease (CAD). We used cross-sectional data on 242 individuals with CAD and 218 individuals without CAD to examine the potential causal pathways and direct and indirect effects of homocysteine on CAD. Data was available for systolic blood pressure, triglycerides, and cholesterol sub-fractions. Body mass index, blood urea nitrogen, C-reactive protein and uric acid were used as markers of insulin sensitivity, renal function, inflammation and oxidative stress respectively. In addition to discussing the available estimation features of the 2 software, this talk also compares the respective syntaxes and path diagramming features.
See full presentation

Stata User Group Meeting 2015 - Canberra

24 - 25 September 2015