Developments, news and strategies for drug development specific to phase I through Phase III global clinical trial management, execution, project management and outsourcing. Go→
News, articles and issues specific to clinical trial practice and implementation at the investigative site level. Go→
News, developments and strategies related to eClinical, data management, data collection, ePRO, and more information technology used in the drug development chain. Go→
News, articles and issues specific to laboratories role in the clinical trial, including ECG, imaging, genotyping, tissue samples and more. Go→
News, developments and strategies for clinical trials conduct in relation to the FDA, EMEA and other global regulatory authorities overseeing the drug development industry. Go→
News, articles and strategies related to clinical trial design which impact postmarketing studies, therapeutic areas, adaptive trials, statistics, protocols and more. Go→
Data management professionals can help statisticians and statistical programmers reach greater efficiency. Many of the problems in the clinical datasets that trouble statisticians are largely avoidable. This article describes the benefits of preventing problems, identifies database design efficiencies that can reduce the time and effort required of programmers and statisticians, and explores proactive steps that can lead to significant cost savings. Giving us “good” data, telling us what you know about the data, and making the database efficient to use can save time. In some cases, time-savers for your colleagues also save time for your data management staff. We recognize that data managers can provide statisticians with “good” data only when good data exists. When source data is missing or is clearly in error, no amount of data managing can overcome that deficiency. For the purposes of this article, unless otherwise noted, we assume that accurate source data exists. It is up to clinic staff and data monitors to make that assumption true. Data problems What kinds of errors are actually found by statisticians? What are the “ghosties” and “ghoulies” in data—the problems that statisticians see, but wish they did not? Computerized checks on range-of-data values for a single variable are called range checks or univariate checks. Each data value is compared with a previously specified high range limit, a low range limit, or both. Data values outside the limit ranges are flagged as failing the range check. Range checks are a well known means of reviewing data for “bad values,” yet statisticians still see—in locked databases—data that should have failed range check tests. Examples taken from real data include a heart rate of 800 beats per minute, blood parameters inconsistent with life, and an efficacy score of 600 (range 0–100). Fortunately, most data managers perform computerized range checks on the data; unfortunately, some data management processes do not implement range checking or fail to use appropriate checking limits. Perhaps the most common “ghosties” lurking in databases are internal inconsistencies. Internal inconsistencies occur when the data value for one variable is logically inconsistent with the data value for another variable in the same database. Logical inconsistencies can be detected with multivariate checking programs. These checks are more complicated than range checks, and defining them requires a deeper understanding of the nature of the data. Some data problems that could be identified by the judicious use of multivariate computer checks are shown in the Multivariate Data Check box. Data values that individually pass range-check tests may still fail multivariate checks. Poor data coding also besets some databases. I have seen datasets coded so that a generic drug and its brand name equivalent were coded to different preferred terms. I have also seen inconsistent adverse experience (AE) coding in datasets. Statisticians are not trained to find errors in medical coding, however, so coding errors of that type may escape our attention. Perhaps the most serious “data error” that statisticians find is the absence of a variable—not missing values for variables, but actually missing variables. Typically a variable is missing because the case report form (CRF) did not capture the requisite information. My experience is that missing times—clock times or calendar dates—are among the most common missing variables. Statisticians are often called upon to perform survival analysis, which is an analysis of the time from one event (such as study drug administration) to some other event (such as a clinical outcome). All too often, the CRF design fails to capture the time of one of the events with sufficient accuracy, so the analysis cannot be performed. For long-term survival times, dates of the two events provide adequately detailed information for analysis. Shorter-term survival times require hours and minutes—or even seconds—for meaningful data analysis. A variable is often missing on a CRF because the protocol does not explicitly identify the times as a variable of record, and the persons designing the CRF may lack sufficient statistical expertise to recognize that the protocol-specified analyses require the time data. Passing the buck or saving a buck When data management locks a database that is still dirty, the data problems are passed off to the statisticians and programmers. The data problems do not go away. In fact, they become more expensive to fix. It costs more, in the long run, for data management to pass the buck—however inadvertently—to statisticians. First, consider the typical steps in the life of clinical data, from a biostatistician’s perspective. Step 1. Source data are recorded at the clinical site. Step 2. The study coordinator records the data on the CRF at the clinical site. Step 3. The monitor reviews the data at the clinical site. Step 4. Data managers enter and process the data, finally locking the database. Step 5. The statisticians start summarizing and analyzing the data. The first three steps generally occur in the same place with minimal lag time. Next, the data management process is implemented. Because CRFs provided to data management are rarely pristine, timelines and budgets are set to account for a query process. Frequently the queries about a particular subject’s CRF are performed in a timely manner, but it may be weeks, months, or even years before all the subjects have completed the trial and the database is locked. Statisticians and statistical programmers may get their first look at the data only near the end of a study. Here, at the fifth and final step, statisticians may encounter some of the problems discussed earlier: range check or coding problems and internal inconsistencies. Problems discovered this late in the process have expensive consequences. Time is saved when data managers clean the data before passing it on to the statisticians—and time is money. It is simply cost-efficient to provide excellent data management. Sources of additional cost Problem processing. When statisticians detect a data problem, they must first investigate it, then document it and write a memo requesting resolution. This processing takes time for both statisticians and data managers. When data managers must query a study site, even more documentation of the request for information is needed. In many cases data managers could have done this work initially at a much lower cost. Query resolution. When a data issue requires a query to the site, getting the cooperation of the site may be difficult or impossible because of a long time lag. The site staff members, at best, are now working on other projects and generally are not motivated to research the problem. The personnel involved in the study may have left the site or, worse yet, the site may have closed. Response documentation. When responses are provided, they must be documented at each step of the process. If the site provides new data, or data management has detected an entry error, then the dataset may have to be unlocked, with all the documentation that unlocking requires. When data management provides the new data, the statisticians must rerun programs. The programs and possibly the clinical study report (CSR) must then document the issue and its resolution. Analysis. Frequently, while the statisticians await resolution of the data issues, the medical team is clamoring for results. To meet these urgent requests, the statisticians and programmers may implement temporary patches to the programs that summarize and analyze the affected data. Later, when they have the final data value, they must correct and rerun the programs. Duplicated effort. When the data problem cannot be resolved (for example, when source data is no longer available), then the statistician may have to perform the analyses twice—once with the suspicious data included and once with it omitted. The results of both analyses must be written up. There is no allotment in either the timeline or the budget for statisticians to do their work twice. Consistency The first five items in the list lead to efficiencies in several ways. Programs that are developed for one protocol can be reused, with minor modifications, for similar protocols—an obvious time-saver. Furthermore, inconsistencies can introduce errors. For example, suppose that SEX is coded as M/F for one protocol and the coding is incorporated as part of a program (IF SEX = “M” THEN . . .). When the program is copied for use with a subsequent protocol, the program may run—but not properly—if SEX has been coded as 0/1 in the second protocol. Programs can be modified, of course, but the objective is to minimize the number of modifications necessary. Consider the construction of ISS/ISE databases. When ISS/ISE generation gets underway, efficiencies are critical. The submission clock is ticking—loudly—and millions of dollars in profits can be lost to time delays. This is when database consistencies pay off. When ISS/ISE databases are compiled, datasets for the same type of data are pooled across relevant studies. It is easier if the datasets to be pooled have the same name—not essential, but easier. It is much easier to merge the data if the datasets being pooled have a similar structure. One can restructure datasets—but doing so costs time. Variables that have been coded inconsistently can be missed, or noticed only late in the process. At a minimum, that can cost time to recode the common variables consistently. When the same variable has different attributes in different datasets, errors can occur and go unnoticed, at least initially. For example, characters/fields can be truncated, so that site 001 becomes site 00. Again, at best, time is wasted reconciling the attributes. Clearly, time is saved when programmers need not rework variable attributes. Providing consistencies in naming conventions is a little-recognized efficiency that data managers can provide to statisticians and programmers. Anyone programming the data can always look up the variable name—they do it all the time—but it takes time. Programmers who are handed a database with consistent naming need not spend so much time checking on variable names or fixing draft programs that failed to work because a variable name was “remembered” incorrectly. If the programmers know, for example, that all dates have the suffix “DT,” it is simply easier to program. Easier is faster—and saves money. In short, data managers can help clinical programmers and statisticians save time by providing consistent datasets. Programmers, however, can also help themselves. Frequently, the master datasets provided by data management are modified to create analysis datasets to use in generating the tables and analyses. The same efficiencies introduced by consistent datasets are available to programmers of the analysis datasets. We statisticians have only ourselves to blame if we fail to attend to the consistencies in our analysis datasets as we move from protocol to protocol. Communication Clinical monitors and data managers could forestall questions from statisticians by telling them in advance what is known about the database. Statisticians sometimes find nuggets of information on the comment page of the CRF. Data managers often dislike comment pages because they are hard to enter. Statisticians, however, often gain considerable insight into the conduct of a study from notes recorded by the study coordinator and the monitor. I know of a case in which insight gained on the comment page of a protocol was instrumental in changing the analysis not only of that protocol, but of several previously analyzed protocols. The statisticians were unaware of an irregularity in the conduct of the studies, and it was only a comment in the third of four protocols that alerted them to the irregularity. Data summaries in the previous protocols were revised to reflect the newly acquired knowledge. Some statisticians encourage the inclusion of comment pages in the CRF. Careful CRF development, however, and ongoing communication between monitors and statisticians, might avoid such problems. To preclude the necessity for questions, data managers can communicate directly with statisticians using special missing-value codes (.A = not applicable or .D = not done, for example) and data status flags. A status flag variable can be created for every CRF variable, and the status flag values can distinguish between various data management situations. For example, data managers can flag a problematic value (an out-of-range value, for example, or a required value that is missing) to indicate that “data management knows this value is wrong, but it is the best we can do.” This saves time—both documentation time and calendar time awaiting an answer—for the statistician, who need not query data management about the value. Furthermore, data management is spared receiving and responding to these questions from the statisticians. Pay now or pay more later First consider the question of how much money can be saved when data is delivered in clean, consistent databases. Sponsors already paid a great deal of money to have the data processed. Frequently, they have to pay more—later in the process—when programmers and statisticians detect data errors. Similarly, a portion of the expense of conducting a protocol is spent on summarizing and analyzing the data. Some expenses will always be incurred, but why pay more than necessary, and why build delays into the process? Time is money, particularly when an ISS/ISE project is under way. How much can be saved by implementing high-quality data management that provides genuinely clean data in consistently constructed databases? No one really knows, but we can make some guesses. I saw one project that took four to six extra months to complete because the statisticians discovered data errors after the database was locked. Data errors in many small projects incur extra costs in the $3,000–$20,000 range. When ISS/ISE databases need to be developed, reconciling inconsistent datasets frequently requires vast amounts of costly time. Weeks can be spent on such efforts for even relatively small submissions. If statisticians and programmers have to set aside a project to await resolution of data errors, when they return to it later they must spend costly time reacquainting themselves with the data and the data problems. Although we cannot estimate cost savings precisely, we do know that they can be substantial. If we estimate, quite conservatively, that better data management can save even one hour of statistical and programming staff time per week, that is 2.5% of their time. Even accounting for the additional data management time and effort taken to ensure clean data and a consistent database, the investment at the beginning more than pays for itself in the long run. In addition to those noted above, other costs can also become an issue. Missing or incorrect data is itself a cost. Data creates the information of a study. When a data value is missing—or set to missing because it is obviously grossly incorrect—some information from the study is being thrown away. To the extent that the missing data value could have been replaced by good data had a query been generated, or been generated in a timely manner (remember, sites do close), the missing data value is a data management issue. On the other hand, it is clearly incumbent upon the clinic staff to generate accurate and complete data to the extent possible. Data managers cannot enter data that does not exist, nor can they be expected to find all recording errors. A systolic blood pressure of 102, for example, could be misrecorded or misentered as 120 and no range check program or programmer would notice the error. Regardless of the cause of an unnoticed incorrect data value, incorrect values inflate the variance. Inflating the variance is tantamount to decreasing the sample size—a waste of study information. In short, unnoticed data errors and missing data waste study resources. Implementing efficiencies Companies spend a great deal of money on data management, but they can increase efficiencies and thus reduce the direct costs of drug development and shorten timelines. Involve a statistician in the data management process. When statisticians review CRFs, they can ensure that all essential information is collected and make a case for not collecting nonessential information. As noted, one relatively common mistake in CRF design is the failure to record a time that is needed for statistical analysis. Statisticians and statistical programmers can also help data managers identify relevant multivariate data checks that can be used to find errors early—before the data trail is cold—in order to gain greater efficiencies. Statisticians can also be useful in helping data managers decide which questionable data values are worth querying. When a large number of possible queries are involved, a sponsor may reasonably elect to document a decision to make no attempts to resolve some types of data issues—those for which resolutions would change neither the summary tables nor the conclusions of the study. In such cases, a statistician can help assess what types of data issue resolutions will and will not affect the summaries. Implement a top-notch data management system. Often companies agree to error rates of 1/1,000 or even 5/1,000 analyzable data fields. A very good, properly operated data management system—one that provides pre-entry review, range checks, comprehensive multivariate checks, and true double-entry with third-party review—can consistently yield error rates no higher than 1/10,000 analyzable fields. Because it is so much less expensive to provide clean data to start with, the overall cost of clean data is competitive. Sponsors pay for data management anyway. They may as well pay for clean data to begin with. Additional efficiencies can be realized if data management departments facilitate communication with statisticians by including a comment page, status flags, and special missing characters. Standardize databases. Reusing programs saves an enormous amount of time. The ability to seamlessly merge databases to prepare ISS/ISE databases can save a staggering amount of effort. In short, high-quality data management is cost-efficient. Katherine L. Monti, PhD, is senior statistical scientist and director of the Massachusetts office, Rho, Inc., 199 Wells Ave., Suite 302, Newton, MA 02459, (617) 965-8000, fax (617) 965-8014, e-mail: kmonti@rhoworld.com, and is an adjunct associate professor, Biostatistics, at the University of North Carolina, Chapel Hill. This article is based on the author’s presentation at the Drug Information Association’s 16th Annual Symposium and Exhibition on Clinical Data Management, Atlantic City, 14 March 2001. SIDEBAR: Standards That Increase Programming Efficiencies Dataset names
When data value assignment is a data management task, use common values for the data in different databases (that is, standardize the assigned values). For example:
Standardize the following attributes of a common variable:
Standardize variable names across protocols; maintain this consistency for the same variable in different datasets in the same database (and across databases where practical).
Select consistent prefixes and suffixes for variable names.
|
Featured Jobs |