Module #1: Basic Data Analysis, Newsroom Math


Sections in this Module

  • Basics of Data Analysis
  • Numbers in the Newsroom
  • Excel Exercise: Transit Data and Calculating a Rate
  • Review: Mac OSX Basics

Basics of Data Analysis

Transparency
Reliability: How sure are we that we got the right answer? That we’ve done everything correctly?
Replicability: If we had to do it all again, would we get the same answer? If someone else did it, would they?
Transparency: If our results are challenged, can we show exactly what we’ve done to defend it?
–Matt Waite

Data Analysis

— Review methodology with one or more other data people
— Check results to other available comparable data
— Ensure all record counts are consistent across stages
— Check averages
— Examine outputs to ensure logical consistency (do things that should add up to 100% add up to 100%?)
— Recheck all coding line by line if possible or in aggregate if not
— Re-read all programs/scripts
— Re-run entire analysis from scratch
— Check each number against analysis or source material prior to publication
— Recheck each number against analysis or source material on each draft

Credit: Daniel Lathrop. Dallas Morning News

AP Stylebook Entry on Data Journalism

Data sources used in stories should be vetted for integrity and validity. When evaluating a data set, consider the following questions:
–What is the original source for the data? How reliable is it? Can we get answers to questions about it?
–“ Is this the most current version of the data set? How often is the data updated? How many years of data have been collected?
–Why was the data collected? Was it for purposes of advocacy? Might that affect the data’s reliability or completeness? Does the data make intuitive sense? Are there anomalies (outliers, blank values, different types of data in the same field) that would invalidate the analysis?
–What rules and regulations affect the gathering (and interpretation) of the data?
–Is there an alternative source for comparison? Does the data for a parallel industry, organization or region look similar? If not, what could explain the discrepancy?
–Is there a data dictionary or record layout document for the data set? This document would describe the fields, the types of data they contain and details such as the meaning of codes in the data and how missing data is indicated. If the data collectors used a data entry form, is the form available to review? For example, if the data entry was performed by inspectors, is it possible to see the form they used to collect the data and any directions they received about how to enter the data?
Data and the results of analysis must be represented accurately in stories and visualizations. Any limitations of the data must also be conveyed. If one point in the analysis is drawn from a subset of the data or a different data set altogether, explain why this was done.
Use statistics that include a meaningful base for comparison (per capita, per dollar). Data should reflect the appropriate population for the topic: for example, use voting-age population as a base for stories on demographic voting patterns. Avoid percentage and percent change comparisons from a small base. Rankings should include raw numbers to provide a sense of relative importance.
When comparing dollar amounts across time, be sure to adjust for inflation. When using averages (that is, adding together a group of numbers and dividing the sum by the quantity of numbers in the group), be wary of extreme, outlier values that may unfairly skew the result. It may be better to use the median (the middle number among all the numbers being considered) if there is a large difference between the average (mean) and the median.
Correlations should not be treated as a causal relationship. Where possible, control for outside factors that may be affecting both variables in the correlation. Use round numbers where possible, particularly to avoid a false appearance of precision. Be clear about limitations of sample size in reporting on data sets. See the polls and surveys section for more specific guidance on margin of error.
Try not to include too many numbers in a single sentence or paragraph.

A refresher on AP Stylebook on numbers

How to Lie With Statistics

https://www.datasciencecentral.com/profiles/blogs/how-to-lie-with-visualizations-statistics-causation-vs

 

Sheffo, Catherine. “How to Avoid 10 Common Mistakes in Data Reporting.” 
American Press Institute (blog), August 9, 2016. https://www.americanpressinstitute.org/publications/data-reporting-common-mistakes/

Writing Assignment

Write a minimum two paragraphs on the Basics of Data analysis readings. Discuss two items that impressed you the most and explain why.
Due XXXX, 11:59 pm on Blackboard. 

 Numbers in the Newsroom

Sarah Cohen, Math Diva

Sarah Cohen’s “Numbers in the Newsroom” is a classic in journalism numeracy. She is a Pulitzer-winning journalist at The Washington Post, a former Duke University professor, a data journalist at The New York Times., now a professor at Arizona State University. That’s why we read her book.

* Limit yourself to 8- 12 digits, including dates such as 2012, in a single paragraph.
–This allows us to stress the most important numbers

–Simplify your story using rates, ratios or percentages. “One in four” = ratio or rate. “Forty percent” = ratio or rate. 235 deaths per 100,000 is another. See pg. 11

*Memorize some common numbers on your beat: Population of Fayetteville. Population of Arkansas. Population of the U.S. Per capita income Arkansas and U.S.

*Round off! Unless you’re dealing with really small numbers, decimal points may not be meaningful. “I’m a big fan of rounding,” Cohen said.* To make a very small number more understandable, divide it into 1. For example, .0081 is the proportion of the U.S. population who die every year. 1/.0081 translates to 1 in every 124 Americans die each year.* If you have a story filled with numbers – and not people — it needs to be really, really short.

* Portion of whole – For example, at the time of the Million Man March in 1995, a turnout of 1 million black men would have represented 1/12th of all the black men in the country at the time.

Rates and Ratios

Numbers in the Newsroom: Rates and Ratios

Class exercise: Cohen: Think in ratios – construct a ratio on the poverty beat. Memorize common numbers on the beat:

Use the Census Poverty Data: US Ark Counties Poverty ACS_16_5YR_DP03_with_ann-1w6iwss

–In the spirit of “memorizing numbers on your beat,” find three statistics about poverty in this dataset

–Construct a rate or ratio about the number of households earning $15,000 to $24,999 for the U.S., Arkansas, and the counties with the highest and lowest percentages in this category. Remember – “percents are Fractions. Fractions are percents”

Cohen Numbers in Newsroom - Common mistakes.pdf 
Write a paragraph with at least two questions or observations.

Excel Exercise: Transit Data and Calculating a Rate

Basic Excel:
http://www.interhacktives.com/2015/11/02/quick-tips-excel-google-sheets/

This exercise involves calculating train rate fatalities.
Click here for the instructions:
Exercise4

Click here for the data:
transit

Notes:
–Create data dictionary, backup, do four corners test
–Be very careful about copying different block of data to a new sheet: mixups
–Copy labels over and then delete them just to be sure all is aligned
–Class walkthrough with 2008 – 2009 derailments
–Be very specific about the headers: Total Derailments 2009, Vehicle Revenue Miles
–Word Wrap for headers
–We are constructing two derail rates, one in 2009 and another in 2008.
–Results are 0? Wait, check the decimal tool
–Results to two decimals. Rarely more than that
–Copy of acronym definitions to data dictionary

Exercise #1:
–Calculate derailment rates for 2008-2013, determine the average rate, which agency had the highest average rate?

Exercise #2:
–Calculate the rate of fatalities (excluding suicides) by total miles (vehicle revenue miles)
–Copy all of the Total Heavy Rail Fatality Sum, excluding suicides and all of the Vehicle Revenue Miles (VRM)
–Create rates for each year, then average them

Which city has the highest rate of fatalities (excluding suicides) over the last six years
and where does Chicago rank?

Exercise #3:
Over the six years, did Chicago transit have more derailments than other major city
transit systems? Is it getting better or worse?

Which year was the worst for all major transit in terms of fatalities (excluding suicides)?

How many suicides happened at CTA in 2013?

What questions should I ask the DOT data clerks regarding the data?
What other data might be useful to mine after this story runs?

Resources: Excel Formulas in NICAR Coursepack

transit

Relative Risk

“Black applicants are denied mortgages at twice the rate of whites with similar incomes.”

If 20 smokers per thousand contract cancer, and yet non-smokers have a cancer rate of only 10 per thousand, the relative risk of smoking is 2.

“More than” or “less than” = compute difference between the smokers, an extra step

Example:  Relative RiskFiguring Rates – Numbers in NewsroomMathCrib-Doig

Excel Exercises

Click here for: Basics and Sorting in Excel
Click here for: CityBudget.xls

Make Two Folders: Original. Working.
Duplicate Spreadsheet: Right Click | Duplicate
Data Dictionary: Who are you and where did you come from
Copy sheets, Rename Tabs
Copying Formulas: The Black + Sign
Sorting
Brain Storm: Story Ideas from Sorting Difference
Formatting Data in Dollars
Percentage Change: NOO!
Part of Whole: Anchoring Values - i.e. $C$17
Basic Chart

Click here for the data: UrbanPop
Click here for assignment: Exercise #1 
Answer these questions:
Sorting
–Which urban agglomeration was the largest in 1950?
–Which is expected to be the largest in 2030?
 Percentage Change
Formula: (New number-Old Number)/Old Number * 100 and use % symbol
Create column
What is difference.
—copy forumula
What is percentage change
—copy formula
 Percentage Change
–Which had greatest rate of change between 1950-2015?
–Are any urban areas expected to lose population from 2010 to 2030?
–If so, how many and which one is expected to lose the most?
–Which United States urban area is expected to have the largest percent increase from 2015 to 2030?


Refresher on Mac OSX operating system
Here is a short video course that you can skim through and get up to speed on how to use the Apple operating system, OSX.
https://www.linkedin.com/learning/macos-mojave-essential-training/understand-macos-the-foundation-of-working-with-a-mac?u=50849081
I would hammer through the following as soon as possible.
Chs. 1, 3 are important
Chapter 2: Finder will be crucial.
Ch. 5 on downloading from the web is important
Ch. 4, 13 should be skimmed
Chs 6-11 aren’t important for our class

To Add:

Quiz
Excel Quiz Due Sept. 7, 11:59 p.m.
https://learn.uark.edu/webapps/assessment/take/launchAssessment.jsp?course_id=_244039_1&content_id=_7327798_1&mode=cpview
Quiz: Basic Excel. See Blackboard. Quiz due Saturday, Sept. 7


Excel Exercises:
NICAR Coursepack
Intro Excel w Exercise #1


Exercise Filtering: Crime Rates and Ratios
--Find Average Crime Rate Statewide
--Filter above and below average
--Find Average Population
--Filter above and below average
NICAR coursepack: Pivot Tables
In class exercise: MLB Salaries
QUESTIONS
1) Did the National League or the American League pay more in salaries? Who has the higher average salary?
2) Which division pays the most in salaries? The least?
3) Which team had the most players on the roster?
In-class exercise WorldBank
Using the WorldBank data , build a Pivot Table.
--Trick:  Shift+Ctrl+8 
--Produce a list ranking the countries with the most companies disbarred, sorted descending. Copy the results and paste into a new tab.
--Produce a list of the firms that have more than one disbarment, sorted descending. Copy the results and paste into a new tab.
What is the most common violation, and how many times did it occur?


Pivot Table
Class Exercise: Student Loans - Pivot Table
Analyze Student Loan Data
Sort the Data:
-Sort by Schools with Largest Enrollment. Write a text answer with the top five schools by enrollment.
-Sort by Schools with Highest Median Debt for Graduates. Write a text answer with the top five schools by Median Debt for Graduates.
-Sort by Schools with Highest Median Debt for Student Who Withdrew. Write a text answer with the top five schools byHighest Median Debt for Student Who Withdrew.




 


Megan Putney, Mike's Hard Lemonade
https://training.uark.edu/professional-development/courses/tableau.php