Definition from the AP Stylebook, 2016

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

data journalism

  Data sources used in stories should be vetted for integrity and validity. When evaluating a data set, consider the following questions:

-What is the original source for the data? How reliable is it? Can we get answers to questions about it?

Is this the most current version of the data set? How often is the data updated? How many years of data have been collected?

Why was the data collected? Was it for purposes of advocacy? Might that affect the data’s reliability or completeness? Does the data make intuitive sense? Are there anomalies (outliers, blank values, different types of data in the same field) that would invalidate the analysis?

-What rules and regulations affect the gathering (and interpretation) of the data?

-Is there an alternative source for comparison? Does the data for a parallel industry, organization or region look similar? If not, what could explain the discrepancy?

-Is there a data dictionary or record layout document for the data set? This document would describe the fields, the types of data they contain and details such as the meaning of codes in the data and how missing data is indicated. If the data collectors used a data entry form, is the form available to review? For example, if the data entry was performed by inspectors, is it possible to see the form they used to collect the data and any directions they received about how to enter the data?

  Data and the results of analysis must be represented accurately in stories and visualizations. Any limitations of the data must also be conveyed. If one point in the analysis is drawn from a subset of the data or a different data set altogether, explain why this was done.

Use statistics that include a meaningful base for comparison (per capita, per dollar). Data should reflect the appropriate population for the topic: for example, use voting-age population as a base for stories on demographic voting patterns. Avoid percentage and percent change comparisons from a small base. Rankings should include raw numbers to provide a sense of relative importance.

  When comparing dollar amounts across time, be sure to adjust for inflation. When using averages (that is, adding together a group of numbers and dividing the sum by the quantity of numbers in the group), be wary of extreme, outlier values that may unfairly skew the result. It may be better to use the median (the middle number among all the numbers being considered) if there is a large difference between the average (mean) and the median.

  Correlations should not be treated as a causal relationship. Where possible, control for outside factors that may be affecting both variables in the correlation. Use round numbers where possible, particularly to avoid a false appearance of precision. Be clear about limitations of sample size in reporting on data sets. See the polls and surveys section for more specific guidance on margin of error.

Try not to include too many numbers in a single sentence or paragraph.