Module #2: Organize Your Data

This module addresses:
–Best practices in data management
–Organizational tips for files
–Data documentation skills

Staying organized is a key problem for beginning data students.
You can’t find files.
You have duplicate files and struggle to find the latest version.
Your data software fails because it can’t find your files.
You can’t remember where you got the source data or what the headers mean.
You waste hours with this stuff when you really should be reporting.

I want to put an end to this nightmare. These organizational tools below are essential.

Storage

Create a folder on your hard drive, call it Dataspring21, and put all class materials in that folder. 
--Within Dataspring21, create the following folders.
     1) Final_Work
     2) Older_Files
Other people will have these standard folders in their projects. Up to you if you want to do this
--Data
--Scripts
--Storage
--Output

BACK IT UP. Early, often, always, constantly.

Purchase a 2 TB USB external storage drive, such as this

Organize Your Data: Finder

Finder, not always up for the job

1. Sort by grid, by date.
–This allows you to see the latest version of your files.

2. Path name.
–Follow this convention: Description of File With Some Detail, Date. If you are editing something, put your initials at the end.
–i.e.: Covid_Master_File_Jan_11_2021-rsw

3. Copying File Paths from the Mac Finder.
Navigate to the file or folder you wish to copy.
Right-click (or Control+Click, or a Two-Finger click on trackpads) on the file or folder in the Mac Finder
While in the right-click menu, hold down the OPTION key to reveal the “Copy (item name) as Pathname” option, it replaces the standard Copy option
Once selected, the file or folders path is now in the clipboard, ready to be pasted anywhere

Data Diary

This is a brief description of the source of your data, what actions you took, commands you ran, thinking behind what you are doing.
Data Biography Template
Interviewing Data
Sample Data Diary Entry

Data Dictionary

For all of your spreadsheets, create a separate tab and include the following:
--Full name data set
--Full URL
--Date of data
--Any code book to define the column or row headings

For your Tableau and R scripts, note the same in a Data Dictionary file that you can access 
easily in cloud storage. Suggestion: Evernote (free version, storage limits); Notepad in iOS; 
Note in Gmail; Google Doc; Notes in iOS; Stickies in iOS; etc.

Data Diary Examples
The following material was posted on NICAR-L, a listserv for data journalists. There are some great examples of how the pros use data diaries / data dictionaries in their workflow.

1) Geoff
This is a great question, and I’m finding as I think through my response that it’s helpful to remind myself of good practices.

I use Jupyter notebooks for when I’m doing analysis or exploration in Python or SQL and R Markdown for when I’m doing it in R. However, I would stress that any data diary you keep and keep in a detailed way that is useful to you and others later, regardless of format, is better than the one you don’t.

https://github.com/newsapps/public-notebooks/blob/master/Shooting%20victims%20by%20block.ipynb is an example of a representative but not great notebook for a small data task.

A few things that I try (but don’t always succeed) to do:

– Link to the source data, summary reports and codebooks near the top of my notebook. This is both a convenience to me, because I refer to these often, and especially to others who may not have seen those things before.
– Put a high level summary of why I’m interested in the data and what I’m trying to find at the top of the notebook. This keeps me focused as I’m doing my exploration and also is helpful for others who might be skimming.
– Keep a parking lot of questions (or potential concerns about validity or cleanliness of data) near the top of the notebook. That way I can quickly capture things I think about as I’m exploring or analyzing the data, while still staying focused.
– Near the end of my day (or the first thing the next morning), do a quick pass over a notebook I worked on during the day. Do my notes still make sense? Are they as clear as they could be? If not, try to clean them up.  If I don’t have time at the moment, I at least leave a “TODO” note to flag the section as needing some love.
– Share the notebook with someone else as early as possible, even if you’re still in-progress. This is the most helpful way to know if I’m capturing your process with enough granularity. Or maybe I’m getting too granular. If so, is there a way to summarize  process and findings at the top of a section?
– If using code, don’t give a play-by-play of the code in text. Instead, describe what I’m trying to find out, why it’s important and why I’m taking a particular approach. Also note any assumptions my code is making.

Hopefully this is helpful.

Best,
Geoff
2) Christian McDonald
Oh, do I have feelings about this one…
I keep a data diary for myself that has everything from notes about public information requests, notes about where I got data, descriptions of what I did, sql queries and all kinds of things. I sometimes also make a data report that is really RESULTS of what I learned, as opposed to how I got there in the data diary. The data report is more for other reporters, editors and maybe sources, but the diary is for me, so less formal.
These days I’m trying to script more of my work using Jupyter Notebooks, which then tends to be a mix of the two. It has info about where the data came from and the code that made the result. Sometimes it is written for future me, sometimes for the public. Generally, I’ll still keep a personal data diary just for my future self, ‘cause I can’t remember what I did yesterday much less last week.
Data diaries I tend to write in markdown files on my machine so code doesn’t get wigged with curly-quote translations. Data reports are typically Google Docs or Jupyter Notebooks on Github.