Author

My photo
Racine, Wisconsin, United States
We (my wife and I) are celebrating the 11th Anniversary of HAPLR, and more importantly, our 38th Anniversary. The HAPLR system uses data provided by 9,000 public libraries in the United States to create comparative rankings. The comparisons are in broad population categories. HAPLR provides a comparative rating system that librarians, trustees and the public can use to improve and extend library services. I am the director of Waukesha County Federated Library System.

Monday, November 23, 2009

The LJ Index and Misbehaving Data

For more see: http://www.haplr-index.com/outliers_and_misbehaving_data_in.htm

Why did LJ decide to use the “outlier” numbers that caused San Diego County to get a five star rating that appears questionable? Did this decision cost other libraries star ratings?

Why does the LJ Index “Score Calculation Algorithm” allow one measurement to swamp the score? Is this data misbehavior intended as one of the authors claims below?
http://www.libraryjournal.com/article/CA6636731.html

In the LJ Index calculations, San Diego County’s incredibly high score (889% above the group average!) for Public Internet Use cancels relatively low scores for Circulation (48% below), Visits (29% below), and Program Attendance (20% below). In the latest LJ Index, San Diego ranked 4th and got 5 stars.

LJ’s February edition omitted San Diego County Library because it did not report Public Internet Use sessions. The Library received five stars in the November edition, called Round Two It reported 16.5 million “Public Internet Use” sessions. Newer data on the California State Library web site reports a more likely 1.4 million. Did San Diego County, among many others, report hits rather than sessions? Didn’t the numbers surprise LJ?

For 16.5 million sessions to be correct, on average, all visitors had to have used the internet terminals an average of 4.2 times every time they visited the library! That is highly unlikely. IMLS, the federal agency that publishes the data, has “edit checks” that are supposed to alert data coordinators to numbers that are out of range. Somehow, 132 libraries in 38 states were reported as having almost every reported visitor use the Public Internet Terminals at every visit. IMLS published a remarkable 8 sessions for every user visit for one library. Did the process work for the latest data?
http://harvester.census.gov/imls/publib.asp

How does this affect the LJ Index Star Libraries roster? With the more reasonable 1.5 million number, wouldn’t San Diego’s score fall from 989 to 450? Rather than 5 Stars for being 4th ranked out of 36 libraries, they would fall to 22nd ranked and no stars. Isn’t that precisely what will happen with round three?

Am I wrong that this single correction changes the scores of every other library in the group? In all, 29 of the 36 libraries would change rankings with just this one outlier number corrected. Isn’t that a lot of volatility for just one data element?

Should LJ have left San Diego County Library out of the mix because of the questionable data? At the very least, should they not have acknowledged the problems? The LJ authors have certainly given me enough grief about not giving sufficient warning about the vagaries of HAPLR data over the years.

In Ain’t Misbehavin’! , LJ Index co-author Ray Lyons’ Blob piece says, “LJ Index scores are not well behaved. That is, why they don’t conform to neat and tidy intervals the way HAPLR scores range from about 30 to 930.” Lyons says that LJ Index is more informative than percentile-based rankings like HAPLR. Lyons notes that the LJ Index has a “challenging problem” with outliers that can distort the ratings. Is that what happened here? Aren’t there other examples of this happening in other spending categories?
http://libperformance.com/2009/04/11/aint-mishavin-uneven-lj-index-score-ranges-are-more-informative/

1 comment:

  1. Thomas Hennen
    6014 Spring Street,
    Racine, WI 53406


    Dear Mr. Hennen,

    We appreciate your comments on the FY2007 Public Libraries Survey. A top-priority goal for IMLS is to administer the library survey to provide data that is used and useful for researchers, practitioners and policy makers.

    You have asked an important question about how data is collected and how edit checks are made -- and in particular the validity of the responses to questions about the number of computer users.

    The “number of computer users” variable is estimated in different ways by different libraries. Some libraries use a sampling method which involves extrapolation of users based on physical observation over a given time period; other libraries use the output statistics from reservation software.

    As you know, variables that require observation of a certain behavior, such as visitation or counts of computer uses, have a higher probability of measurement error than administrative variables such as fiscal, staffing and collection numbers.

    To date, the biggest emphasis in edit checks for that data item has been comparing the current year to the prior year and examining the difference. So in cases where the difference between prior year and current year data is not significant the result is not questioned. If the difference is significant, the state data coordinator will be contacted for clarification and review of the element.

    Edit checks are an essential and time-consuming part of the survey process and we try to strike a balance between the twin goals of accuracy and timeliness. We are fortunate to have committed state data coordinators who are deeply engaged in the data collection and know their libraries well. The State Data Coordinators strive for accurate data and share data collection methods to promote best practices. Also, IMLS hosts an annual conference for data collectors with the purpose of improving practices, identifying issues and developing strategies to ensure that data quality continually improves.

    IMLS will continue to review the edit check tolerances for this and other data elements, and to encourage more consistent, standardized methods for determining the number of visitors and computer users at the branch and system level. These are important measures and the variability you point out is something we all want to address.

    Thank you for your interest in and use of the data. We want this data to be used and we want users to raise questions; this dialogue will ultimately improve the survey product and contribute to improved library practice.

    We appreciate your input and welcome any ideas you have for improving the quality of this and other data elements.

    Sincerely,

    Mamie Bittner
    Deputy Director, Office of Policy, Planning Research and Communication
    Institute of Museum and Library Services

    ReplyDelete

Blog Archive