Module #6: Writing About Data

Writing Style Notes
 
Writing style notes
Don’t use this “respectively” construction. 
It confuses the readers and leads to data errors.
Other private trade schools also made the top 10 list, such as Philander Smith in Little Rock and Bryan University in Rogers, with increases of 81 percent and 74 percent respectively.
 
.ITT Technical Institute, the University of phoenix, Philander Smith College and Bryan University are all classified as private schools and are the top four schools with graduated student loan debt in the state. Their debt has increased 106 percent, 102 percent, 81 percent and 74 percent respectively since 2012.

Common Errors – Math
Percent vs Percentage Point
At Lyon College, 67 percent of non-first-generation students paid back their loans within five years, while only 53 percent of first-generation students did the same, which results in a 14 percent POINT difference. The median debt for both types of students was the same though, at $12,000.
You mean “percentage point.” 14 percent of 67 is 9.4.
Steve Doig – MathCrib-Doig

AP Style with Numbers

College Scorecard Data

Learning to Love Data Dictionary


https://collegescorecard.ed.gov/data/
All 1,826 columns explained here.
https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf
https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx
Group Assignment: Examine The Data Dictionary. Sketch out the broad categories describing the data. Hint: There’s a section on demographics – race, ethnicity. 
Consult this article for ideas about the types of data you can examine
http://libertystreeteconomics.newyorkfed.org/2016/09/the-changing-role-of-the-community-college-and-for-profit-college-borrowers.html
Questions:
What are the broad groupings?
What questions do you want to answer with this data and which data fields should we look at next?
Compose your thoughts at the bottom of this Google Doc:

Student Loan Data

Private vs. Public Colleges: How Does The Debt Load Differ? 
Return of the Data Dictionary: Student Loan Dataset
https://collegescorecard.ed.gov/data/
All 1,826 columns explained here.
1) Find the field that distinguishes between public and private colleges:
https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf
https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx
2) Build a new spreadsheet from this raw data: AR2016ALL-1jwseiw
Name your new sheet ARDebt9_19  
It will have these fields:
      A) The private / public colleges field
      B) Latitude and Longitude
      C) The 8 data fields we have used so far in ARDebt.
      D) A data dictionary
 
3) Make Charts in Tableau
Import ARDebt9_19 into Tableau.
Create a chart of the overall median debt, median private, median public college debt.
Create another one, public, private for graduates and discontinued students.
Revel in your nerd powers

Add Context

Context
Overall averages, medians for particular topics.
What is the median student loan debt in 2016?
What is the median student loan debt for for-profit colleges in 2016?
What is the average default rate?


Context #1
Add the Quick Facts for city population, demographics.
Little Rock: African American comprise 42 percent of Little Rock’s population. https://www.census.gov/quickfacts/fact/table/littlerockcityarkansas,US/PST045217
Add typical salary from Occupational Employment Statistics database for Arkansas
https://www.bls.gov/oes/current/oes_ar.htm

Data Adventure

Data Adventure
 
 
FY 2015 national cohort default rate is 10.8%, down from 11.5% in the previous year. 
 
By comparison, Arkansas’ cohort default rates dropped from 12.2% to 11.2%, inching closer to the national average. Five years ago, when ASLA initiated efforts to lower the state’s default rate, Arkansas ranked near the bottom of all 50 states at 49th. During that period, Arkansas’ default rate had declined from 19% in 2010 to the more manageable 12.2% in 2014.
 
Arkansas’ 351,000 student loan borrowers carried more than $10.8 billion in debt at the end of 2017
 
According to USDE data, Arkansas’ 351,000 college tuition borrowers carried average of $26,799 in student loan debt per student, slightly less that the $27,857 average balanced held by the other 43.6 million Americans that are still paying for college.
 
By far, college-goers attending the University of Arkansas at Fayetteville received the largest portion of federal student loan proceeds in the 2017-2018 academic years, when $107.6 million was handed out to enrollees at the state’s largest university
 
Arkansas State University in Jonesboro received the second-largest share of federal student loans at nearly $91.9 million in the 2017-2018 academic year.
 
Question: Where would we find these numbers in our current dataset?
https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf
https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx
[/et_pb_text][et_pb_text admin_label=”New Data – or Is It?” _builder_version=”3.0.101″ background_layout=”light”]
New Data – or Is It?
https://collegescorecard.ed.gov/data/changelog/
 

 
Fine Print:
1 NSLDS -derived data elements will also be updated in the fall of 2018



More Student Loan Data
Dept of Ed Release
https://www.ed.gov/news/press-releases/national-student-loan-cohort-default-rate-falls
 
Instructions in Default Rates
https://www2.ed.gov/offices/OSFAP/defaultmanagement/instructions.html
 
Title IV Data
https://studentaid.ed.gov/sa/about/data-center/student/title-iv
 
College InSight – Detailed Debt Data
http://college-insight.org/#explore/go&h=6d38087e46818e79ac32f5fab17eea08
 
Details on Student Loan Delinquencies by Type of Loan
https://studentaid.ed.gov/sa/about/data-center/student/portfolio
Federal Student Loan Portfolio: The office of Federal Student Aid is responsible for directly managing or overseeing an outstanding federal student loan portfolio comprised of billions of dollars in Title IV loans and representing millions of borrowers. This federal student loan portfolio includes Direct Loans, Federal Family Education Loans (FFEL), and Perkins Loans with outstanding balances. The reports below provide information about the federal student loan portfolio.
 
 
Total disbursements by institution
https://studentaid.ed.gov/sites/default/files/fsawg/datacenter/library/DL_Dashboard_AY2017_2018_Q3.xls
 
PowerStats Tool
The National Center for Education Statistics has a PowerStats tool that requires registration. I got a password in a minute.
https://nces.ed.gov/datalab/powerstats/default.aspx
Once you log in, you can build your own tables with the National Postsecondary Student Aid Study: 2016 Undergraduates.

Portfolio-by-Location-by-Debt-Size
DL-by-Delinquency-Location
Portfolio-by-Location-by-Age***
Portfolio-by-Location
https://studentaid.ed.gov/sa/sites/default/files/fsawg/datacenter/library/Portfolio-by-Location-by-Age.xls

Data Updates

From Fall 2018
All — CollegeScorecard has put out updated data that requires us to update our data visualizations. Many of the charts were built with Sept. 26 data. There is an Oct. 30 dataset that makes enough changes on default rates, debt and other metrics that we should use it. These differences became apparent in the editing process. Everyone used College Scorecard data so we all have to deal with the update.
Second, anyone using Debt_Mdn should be using GRAD_DEBT_MDN instead in their charts. The reason? Debt_Mdn is artificially reducing the scale of the student loan debt.
Debt_Mdn is aggregated by institution. 
GRAD_DEBT_MDN is aggregated by individuals. So someone who transferred from Hendrix to UofA would have that Hendrix debt factored in when they graduate from UofA.
Below is a video on how to update your CollegeScorecard visualizations in Tableau.
Here are the basic steps:
1) Create a new Tableau workbook and import the Oct30 data. It is here:ARDebt10-30-18
2) Open your existing Tableau workbook that used the Sept 26 data. Select the visualization, select its tab, left click and copy
3) Move to the new Tableau workbook with the Oct30 data. Select a blank worksheet. Left click and paste
Your visualization has now been pasted with the Sept. 30 data
4) On the Tableau toolbar, Select Data | Replace Data Source. It will say existing source is Sept 26. Select Oct 30 data as replacement
5) BOOM. You are done.
Things that can go wrong:
–Ensure your data that you want to chart is listed in the measures pane. Grad_Debt_Mdn or CDR3 may need to be converted to measures for this to work properly.

Module #8: Data Cleaning

Census Data and Data Cleaning



Filtering
Reading Data Dictionaries


Data Cleaning in Excel:
=Trim, Paste Special, Values, Transpose, Find and Replace

Basic Population – Race Census Data Download Instructions

Census: data.census.gov


https://data.census.gov

Data Cleaning Exercises
Pivot Tables, Data Cleaning: =Trim, Paste Special, Values, Transpose, Find and Replace



1) Advanced Search
–Topics | Geography | Years | Surveys | Codes

2) Topics | Race and Ethnicity | White
–Note that the “White” filter displays below

3) Geography | County | Arkansas | All Counties in Arkansas
–Note that the “All counties in Arkansas” filter displays

4) Search!

5) Select Table Named RACE
American Community Survey
Total Population
TableID: B02001

6) Switch to 2016: ACS 5-Year Estimates Detailed Tables

7) Customize Table. Download. Make Sure to Download 2016: ACS 5-Year Estimates Detailed Tables


Clean Census Data
1) Create Data Dictionary
2) Duplicate Sheet
3) Four corners select and copy
4) New Sheet. Paste Special | Transpose
–the races are now the rows
–Filter by Estimate: Contains Estimate, Delete
5) Edit Headers: White, Black, Hispanic
6) Check totals – do they add up?
7) Two races including Some other race. Two races excluding Some other race, and three or more races (delete)
8) Save and Load to Tableau
9) Build a Arkansas Population Map by Race

Build Arkansas Population Map By Race
--Clean the County Field ", Arkansas"
--Create Calculations for Percentage Population by Race: Calculated Fields

Income by Race



https://data.census.gov

Advanced Search
Filters | Geography
Counties | Arkansas | All counties
Filters | Topics | Income and Poverty
Filters | Topics | Race and Ethnicity
Filters | Years | 2016

Filters | Text Search in Find a Filter: “Income” | Select “Income (Households, Families, Individuals)

Search


Download White Only, Black Only, Hispanic or Latino Householder

Your tables will say this:
HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2016 INFLATION-ADJUSTED DOLLARS) (WHITE ALONE HOUSEHOLDER)
Survey/Program: American Community Survey
Product:
2016: ACS 1-Year Estimates Detailed Tables


Tables: B19001A, B19001B, B19001I 

Download – select .csv


Tableau

Clean Data as described in previous lesson
Combine the three tables in Tableau linking to the income as a common field.
Create a chart

PAST TUTORIALS IN AMERICAN FACT FINDER.

NEED TO REVISE

https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Household income data for counties and state and national. Gender and demographics of low-wage workers
American FactFinder
https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
Advanced Search | Show Me All
Topics | People | Poverty
Geographies | County | Arkansas | All Counties Within Arkansas
Select Table S1701, Poverty Status in the Past 12 Months
Modify Table
—Select top Filter
—Total and Percent Below Poverty Level
—Select second Filter
—Keep Estimate, do not check margin of Error
Download

30:00 shows how to use the fact finder
https://www.census.gov/data/training-workshops/recorded-webinars/measuring-america.html

Selected Economic Characteristics DP03 2012-2016 American Community Survey 5-Year Estimates

Standard Data Cleaning

Data is: ACS_16_5YR_DP03 DP03 SELECTED ECONOMIC CHARACTERISTICS   2012-2016 American Community Survey 5-Year Estimates
Copy main data sheet and call copy wages below $25k
Delete all data fields except headers and these columns
HC01_VC74 Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households
HC01_VC75 Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – Less than $10,000 HC03_VC75 Percent; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – Less than $10,000 HC01_VC76 Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – $10,000 to $14,999 HC03_VC76 Percent; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – $10,000 to $14,999 HC01_VC77 Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – $15,000 to $24,999 HC03_VC77 Percent; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – $15,000 to $24,999 HC01_VC85 Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households – Median household income (dollars)

–Rotate header rows, wrap text.
Shrink verbiage from Estimate; INCOME AND BENEFITS (IN 2016 INFLATION-ADJUSTED DOLLARS) – Total households to “Total households”
Total households %Total households Total households – >$10k %Total households – >$10k Total households – $10kto $14,999 %Total households – $10kto $14,999 Total households – $15,000 to $24,999 %Total households – $15,000 to $24,999 Median household income$
–Specify Arkansas-state
Then find/replace to eliminate “County, Arkansas” from geography labels.
Create Total Under $25 column.
Add Total households – >$10k + Total households – $10k to $14,999 + Total households – $15,000 to $24,999
Create % Under $25k column (total Under $25k / total households)
Copy formulas down
Check math
When satisfied, copy and paste values

More on Data Cleaning Census spreadsheets
 
–Download the view and the data versions of large spreadsheets. One to guide you. the other to do the work.
–Merge / unmerge cells
–Find-Replace
— =CONCATENATE(B3, B4).

Census Demographic data


Household income data for counties and state and national. Gender and demographics of low-wage workers
American FactFinder
https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
Advanced Search | Show Me All
Topics | People | Poverty | Poverty (added to Your Selections)
Geographies | County | Arkansas | All Counties Within Arkansas
Geographies | United States
Geographies | Arkansas
Select Table S1701, Poverty Status in the Past 12 Months
Modify Table
—Select top Filter
—Total and Percent Below Poverty Level
—Select second Filter
—Keep Estimate, do not check margin of Error
Download
—Use the Data
Download Again
—View the Data
—Excel spreadsheet
Questions about categories and definitions:
See “Table Notes” to far right on factfinder website after you’ve generated a table.
https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2016_ACSSubjectDefinitions.pdf
Read: “Poverty Status in the Past 12 Months”
“Poverty Status of Households”
Definitions. working Poor
–Poverty thresholds:
The actual poverty thresholds vary with the makeup of the family. In 2015, the weighted average poverty threshold for a family of four was $24,257; for a family of nine or more people, the threshold was $49,177; and for one person (see Unrelated individuals), it was $12,082. Poverty thresholds are updated each year to reflect changes in the Consumer Price Index for All Urban Consumers (CPI-U). Thresholds do not vary geographically. (For more information, see “Income and poverty in the United States: 2015.”)
https://www.bls.gov/opub/reports/working-poor/2015/home.htm#unrelatedindividual
Weighted Average Poverty
Thresholds in 2015 by Size of
Family
(Dollars)
One person 12,082
Two people 15,391
Three people 18,871
Four people 24,257
Five people 28,741
Six people 32,542
Seven people 36,998
Eight people 41,029
Nine people or more 49,177
Source: U.S. Census Bureau.
https://www.census.gov/content/dam/Census/library/publications/2016/demo/p60-256.pdf
https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-poverty-thresholds.html

–Download the view and the data versions of large spreadsheets. One to guide you. the other to do the work.

–Merge / unmerge cells

–Find-Replace

— =CONCATENATE(B3, B4).

Cleaned and download 2011-2015 estimates with detailed poverty metrics
Ark Counties full income search 5-10-17 ACS_15_5YR_DP03

Students assigned geographical location for Census data.

Questions:

–Number and Percentage of Minimum Wage Households?

–Compare to National, State Averages

–Produce basic Tableau chart

Data Cleaning and Joining Exercise


--You will join the CollegeScorecard data with 2017 Census Data with the average household income
The goal is to compare the average student debt in a college to the income in the surrounding town.
This in-class task will stretch over two class sessions, unless you are a DATA KING OR QUEEN
This exercise builds on the data analysis, cleaning and visualization skills you learned this semester.


Task 1: Retrieve the Census Data
Use Excel
--Examine the data dictionary.
--Examine the data, the range of incomes and number of cities, towns and places
--Create a copy of the Census sheet for the data cleaning
Task 2: Data Cleaning
You will need to match the town in the Census to the city in College Scorecard.
Look at the "city" column in the College Scorecard data: ARDebt17_10_23.csv
--Tip: Data cleaning tools in Excel: Text to columns and find and replace
Task 3: Joining
--Join the Census data to the ARDebt17_10_23.csv in Tableau
--Chart the Income and the Grad debt by the 10 largest public schools
Task 4: Analysis
--Construct a Ratio of Grad Debt to Per Capita Income. Map it


 

Module #2: Organize Your Data

This module addresses:
–Best practices in data management
–Organizational tips for files
–Data documentation skills

Staying organized is a key problem for beginning data students.
You can’t find files.
You have duplicate files and struggle to find the latest version.
Your data software fails because it can’t find your files.
You can’t remember where you got the source data or what the headers mean.
You waste hours with this stuff when you really should be reporting.

I want to put an end to this nightmare. These organizational tools below are essential.

Storage

Create a folder on your hard drive, call it Dataspring21, and put all class materials in that folder. 
--Within Dataspring21, create the following folders.
     1) Final_Work
     2) Older_Files
Other people will have these standard folders in their projects. Up to you if you want to do this
--Data
--Scripts
--Storage
--Output

BACK IT UP. Early, often, always, constantly.

Purchase a 2 TB USB external storage drive, such as this

Organize Your Data: Finder

Finder, not always up for the job

1. Sort by grid, by date.
–This allows you to see the latest version of your files.

2. Path name.
–Follow this convention: Description of File With Some Detail, Date. If you are editing something, put your initials at the end.
–i.e.: Covid_Master_File_Jan_11_2021-rsw

3. Copying File Paths from the Mac Finder.
Navigate to the file or folder you wish to copy.
Right-click (or Control+Click, or a Two-Finger click on trackpads) on the file or folder in the Mac Finder
While in the right-click menu, hold down the OPTION key to reveal the “Copy (item name) as Pathname” option, it replaces the standard Copy option
Once selected, the file or folders path is now in the clipboard, ready to be pasted anywhere

Data Diary

This is a brief description of the source of your data, what actions you took, commands you ran, thinking behind what you are doing.
Data Biography Template
Interviewing Data
Sample Data Diary Entry

Data Dictionary

For all of your spreadsheets, create a separate tab and include the following:
--Full name data set
--Full URL
--Date of data
--Any code book to define the column or row headings

For your Tableau and R scripts, note the same in a Data Dictionary file that you can access 
easily in cloud storage. Suggestion: Evernote (free version, storage limits); Notepad in iOS; 
Note in Gmail; Google Doc; Notes in iOS; Stickies in iOS; etc.

Data Diary Examples
The following material was posted on NICAR-L, a listserv for data journalists. There are some great examples of how the pros use data diaries / data dictionaries in their workflow.

1) Geoff
This is a great question, and I’m finding as I think through my response that it’s helpful to remind myself of good practices.

I use Jupyter notebooks for when I’m doing analysis or exploration in Python or SQL and R Markdown for when I’m doing it in R. However, I would stress that any data diary you keep and keep in a detailed way that is useful to you and others later, regardless of format, is better than the one you don’t.

https://github.com/newsapps/public-notebooks/blob/master/Shooting%20victims%20by%20block.ipynb is an example of a representative but not great notebook for a small data task.

A few things that I try (but don’t always succeed) to do:

– Link to the source data, summary reports and codebooks near the top of my notebook. This is both a convenience to me, because I refer to these often, and especially to others who may not have seen those things before.
– Put a high level summary of why I’m interested in the data and what I’m trying to find at the top of the notebook. This keeps me focused as I’m doing my exploration and also is helpful for others who might be skimming.
– Keep a parking lot of questions (or potential concerns about validity or cleanliness of data) near the top of the notebook. That way I can quickly capture things I think about as I’m exploring or analyzing the data, while still staying focused.
– Near the end of my day (or the first thing the next morning), do a quick pass over a notebook I worked on during the day. Do my notes still make sense? Are they as clear as they could be? If not, try to clean them up.  If I don’t have time at the moment, I at least leave a “TODO” note to flag the section as needing some love.
– Share the notebook with someone else as early as possible, even if you’re still in-progress. This is the most helpful way to know if I’m capturing your process with enough granularity. Or maybe I’m getting too granular. If so, is there a way to summarize  process and findings at the top of a section?
– If using code, don’t give a play-by-play of the code in text. Instead, describe what I’m trying to find out, why it’s important and why I’m taking a particular approach. Also note any assumptions my code is making.

Hopefully this is helpful.

Best,
Geoff
2) Christian McDonald
Oh, do I have feelings about this one…
I keep a data diary for myself that has everything from notes about public information requests, notes about where I got data, descriptions of what I did, sql queries and all kinds of things. I sometimes also make a data report that is really RESULTS of what I learned, as opposed to how I got there in the data diary. The data report is more for other reporters, editors and maybe sources, but the diary is for me, so less formal.
These days I’m trying to script more of my work using Jupyter Notebooks, which then tends to be a mix of the two. It has info about where the data came from and the code that made the result. Sometimes it is written for future me, sometimes for the public. Generally, I’ll still keep a personal data diary just for my future self, ‘cause I can’t remember what I did yesterday much less last week.
Data diaries I tend to write in markdown files on my machine so code doesn’t get wigged with curly-quote translations. Data reports are typically Google Docs or Jupyter Notebooks on Github.

Module #1: Basic Data Analysis, Newsroom Math


Sections in this Module

  • Basics of Data Analysis
  • Numbers in the Newsroom
  • Excel Exercise: Transit Data and Calculating a Rate
  • Review: Mac OSX Basics

Basics of Data Analysis

Transparency
Reliability: How sure are we that we got the right answer? That we’ve done everything correctly?
Replicability: If we had to do it all again, would we get the same answer? If someone else did it, would they?
Transparency: If our results are challenged, can we show exactly what we’ve done to defend it?
–Matt Waite

Data Analysis

— Review methodology with one or more other data people
— Check results to other available comparable data
— Ensure all record counts are consistent across stages
— Check averages
— Examine outputs to ensure logical consistency (do things that should add up to 100% add up to 100%?)
— Recheck all coding line by line if possible or in aggregate if not
— Re-read all programs/scripts
— Re-run entire analysis from scratch
— Check each number against analysis or source material prior to publication
— Recheck each number against analysis or source material on each draft

Credit: Daniel Lathrop. Dallas Morning News

AP Stylebook Entry on Data Journalism

Data sources used in stories should be vetted for integrity and validity. When evaluating a data set, consider the following questions:
–What is the original source for the data? How reliable is it? Can we get answers to questions about it?
–“ Is this the most current version of the data set? How often is the data updated? How many years of data have been collected?
–Why was the data collected? Was it for purposes of advocacy? Might that affect the data’s reliability or completeness? Does the data make intuitive sense? Are there anomalies (outliers, blank values, different types of data in the same field) that would invalidate the analysis?
–What rules and regulations affect the gathering (and interpretation) of the data?
–Is there an alternative source for comparison? Does the data for a parallel industry, organization or region look similar? If not, what could explain the discrepancy?
–Is there a data dictionary or record layout document for the data set? This document would describe the fields, the types of data they contain and details such as the meaning of codes in the data and how missing data is indicated. If the data collectors used a data entry form, is the form available to review? For example, if the data entry was performed by inspectors, is it possible to see the form they used to collect the data and any directions they received about how to enter the data?
Data and the results of analysis must be represented accurately in stories and visualizations. Any limitations of the data must also be conveyed. If one point in the analysis is drawn from a subset of the data or a different data set altogether, explain why this was done.
Use statistics that include a meaningful base for comparison (per capita, per dollar). Data should reflect the appropriate population for the topic: for example, use voting-age population as a base for stories on demographic voting patterns. Avoid percentage and percent change comparisons from a small base. Rankings should include raw numbers to provide a sense of relative importance.
When comparing dollar amounts across time, be sure to adjust for inflation. When using averages (that is, adding together a group of numbers and dividing the sum by the quantity of numbers in the group), be wary of extreme, outlier values that may unfairly skew the result. It may be better to use the median (the middle number among all the numbers being considered) if there is a large difference between the average (mean) and the median.
Correlations should not be treated as a causal relationship. Where possible, control for outside factors that may be affecting both variables in the correlation. Use round numbers where possible, particularly to avoid a false appearance of precision. Be clear about limitations of sample size in reporting on data sets. See the polls and surveys section for more specific guidance on margin of error.
Try not to include too many numbers in a single sentence or paragraph.

A refresher on AP Stylebook on numbers

How to Lie With Statistics

https://www.datasciencecentral.com/profiles/blogs/how-to-lie-with-visualizations-statistics-causation-vs

 

Sheffo, Catherine. “How to Avoid 10 Common Mistakes in Data Reporting.” 
American Press Institute (blog), August 9, 2016. https://www.americanpressinstitute.org/publications/data-reporting-common-mistakes/

Writing Assignment

Write a minimum two paragraphs on the Basics of Data analysis readings. Discuss two items that impressed you the most and explain why.
Due XXXX, 11:59 pm on Blackboard. 

 Numbers in the Newsroom

Sarah Cohen, Math Diva

Sarah Cohen’s “Numbers in the Newsroom” is a classic in journalism numeracy. She is a Pulitzer-winning journalist at The Washington Post, a former Duke University professor, a data journalist at The New York Times., now a professor at Arizona State University. That’s why we read her book.

* Limit yourself to 8- 12 digits, including dates such as 2012, in a single paragraph.
–This allows us to stress the most important numbers

–Simplify your story using rates, ratios or percentages. “One in four” = ratio or rate. “Forty percent” = ratio or rate. 235 deaths per 100,000 is another. See pg. 11

*Memorize some common numbers on your beat: Population of Fayetteville. Population of Arkansas. Population of the U.S. Per capita income Arkansas and U.S.

*Round off! Unless you’re dealing with really small numbers, decimal points may not be meaningful. “I’m a big fan of rounding,” Cohen said.* To make a very small number more understandable, divide it into 1. For example, .0081 is the proportion of the U.S. population who die every year. 1/.0081 translates to 1 in every 124 Americans die each year.* If you have a story filled with numbers – and not people — it needs to be really, really short.

* Portion of whole – For example, at the time of the Million Man March in 1995, a turnout of 1 million black men would have represented 1/12th of all the black men in the country at the time.

Rates and Ratios

Numbers in the Newsroom: Rates and Ratios

Class exercise: Cohen: Think in ratios – construct a ratio on the poverty beat. Memorize common numbers on the beat:

Use the Census Poverty Data: US Ark Counties Poverty ACS_16_5YR_DP03_with_ann-1w6iwss

–In the spirit of “memorizing numbers on your beat,” find three statistics about poverty in this dataset

–Construct a rate or ratio about the number of households earning $15,000 to $24,999 for the U.S., Arkansas, and the counties with the highest and lowest percentages in this category. Remember – “percents are Fractions. Fractions are percents”

Cohen Numbers in Newsroom - Common mistakes.pdf 
Write a paragraph with at least two questions or observations.

Excel Exercise: Transit Data and Calculating a Rate

Basic Excel:
http://www.interhacktives.com/2015/11/02/quick-tips-excel-google-sheets/

This exercise involves calculating train rate fatalities.
Click here for the instructions:
Exercise4

Click here for the data:
transit

Notes:
–Create data dictionary, backup, do four corners test
–Be very careful about copying different block of data to a new sheet: mixups
–Copy labels over and then delete them just to be sure all is aligned
–Class walkthrough with 2008 – 2009 derailments
–Be very specific about the headers: Total Derailments 2009, Vehicle Revenue Miles
–Word Wrap for headers
–We are constructing two derail rates, one in 2009 and another in 2008.
–Results are 0? Wait, check the decimal tool
–Results to two decimals. Rarely more than that
–Copy of acronym definitions to data dictionary

Exercise #1:
–Calculate derailment rates for 2008-2013, determine the average rate, which agency had the highest average rate?

Exercise #2:
–Calculate the rate of fatalities (excluding suicides) by total miles (vehicle revenue miles)
–Copy all of the Total Heavy Rail Fatality Sum, excluding suicides and all of the Vehicle Revenue Miles (VRM)
–Create rates for each year, then average them

Which city has the highest rate of fatalities (excluding suicides) over the last six years
and where does Chicago rank?

Exercise #3:
Over the six years, did Chicago transit have more derailments than other major city
transit systems? Is it getting better or worse?

Which year was the worst for all major transit in terms of fatalities (excluding suicides)?

How many suicides happened at CTA in 2013?

What questions should I ask the DOT data clerks regarding the data?
What other data might be useful to mine after this story runs?

Resources: Excel Formulas in NICAR Coursepack

transit

Relative Risk

“Black applicants are denied mortgages at twice the rate of whites with similar incomes.”

If 20 smokers per thousand contract cancer, and yet non-smokers have a cancer rate of only 10 per thousand, the relative risk of smoking is 2.

“More than” or “less than” = compute difference between the smokers, an extra step

Example:  Relative RiskFiguring Rates – Numbers in NewsroomMathCrib-Doig

Excel Exercises

Click here for: Basics and Sorting in Excel
Click here for: CityBudget.xls

Make Two Folders: Original. Working.
Duplicate Spreadsheet: Right Click | Duplicate
Data Dictionary: Who are you and where did you come from
Copy sheets, Rename Tabs
Copying Formulas: The Black + Sign
Sorting
Brain Storm: Story Ideas from Sorting Difference
Formatting Data in Dollars
Percentage Change: NOO!
Part of Whole: Anchoring Values - i.e. $C$17
Basic Chart

Click here for the data: UrbanPop
Click here for assignment: Exercise #1 
Answer these questions:
Sorting
–Which urban agglomeration was the largest in 1950?
–Which is expected to be the largest in 2030?
 Percentage Change
Formula: (New number-Old Number)/Old Number * 100 and use % symbol
Create column
What is difference.
—copy forumula
What is percentage change
—copy formula
 Percentage Change
–Which had greatest rate of change between 1950-2015?
–Are any urban areas expected to lose population from 2010 to 2030?
–If so, how many and which one is expected to lose the most?
–Which United States urban area is expected to have the largest percent increase from 2015 to 2030?


Refresher on Mac OSX operating system
Here is a short video course that you can skim through and get up to speed on how to use the Apple operating system, OSX.
https://www.linkedin.com/learning/macos-mojave-essential-training/understand-macos-the-foundation-of-working-with-a-mac?u=50849081
I would hammer through the following as soon as possible.
Chs. 1, 3 are important
Chapter 2: Finder will be crucial.
Ch. 5 on downloading from the web is important
Ch. 4, 13 should be skimmed
Chs 6-11 aren’t important for our class

To Add:

Quiz
Excel Quiz Due Sept. 7, 11:59 p.m.
https://learn.uark.edu/webapps/assessment/take/launchAssessment.jsp?course_id=_244039_1&content_id=_7327798_1&mode=cpview
Quiz: Basic Excel. See Blackboard. Quiz due Saturday, Sept. 7


Excel Exercises:
NICAR Coursepack
Intro Excel w Exercise #1


Exercise Filtering: Crime Rates and Ratios
--Find Average Crime Rate Statewide
--Filter above and below average
--Find Average Population
--Filter above and below average
NICAR coursepack: Pivot Tables
In class exercise: MLB Salaries
QUESTIONS
1) Did the National League or the American League pay more in salaries? Who has the higher average salary?
2) Which division pays the most in salaries? The least?
3) Which team had the most players on the roster?
In-class exercise WorldBank
Using the WorldBank data , build a Pivot Table.
--Trick:  Shift+Ctrl+8 
--Produce a list ranking the countries with the most companies disbarred, sorted descending. Copy the results and paste into a new tab.
--Produce a list of the firms that have more than one disbarment, sorted descending. Copy the results and paste into a new tab.
What is the most common violation, and how many times did it occur?


Pivot Table
Class Exercise: Student Loans - Pivot Table
Analyze Student Loan Data
Sort the Data:
-Sort by Schools with Largest Enrollment. Write a text answer with the top five schools by enrollment.
-Sort by Schools with Highest Median Debt for Graduates. Write a text answer with the top five schools by Median Debt for Graduates.
-Sort by Schools with Highest Median Debt for Student Who Withdrew. Write a text answer with the top five schools byHighest Median Debt for Student Who Withdrew.




 


Megan Putney, Mike's Hard Lemonade
https://training.uark.edu/professional-development/courses/tableau.php