I thought that the first point in the American Press Institute article was very important – you should not give to much credit to the data you have, nor should you rely on it too much. I’m definitely guilty of assuming statistics I read in academia or journalism are perfect, and taking them at face value. This tip is a good reminder to take data with a grain of salt, and consider where it came from and how it was collected.
Tip #3 in the same article was also immensely helpful. I’ve heard of cleaning data before but never know what that actually meant or how it worked. I had no idea that you can simply run your file through something OpenRefine, which will clear discrepancies between cells written differently for the same meaning. Last semester I took a project-based class that involved analyzing massive spreadsheets of donations to Super PACs. Something like OpenRefine would have been great, because we were essentially told to fix discrepancies by hand or just disregard them.
There is no inherent factual purity in numbers. We have to take everything at face value. My statistics professor in graduate school once put it this way: statistics help support your rhetoric. That was a bold statement but he made his point: numbers, like everything else, are subject to spin.
Data cleaning is a real chore and having tools like Google Refine makes it easier. It’s also something you have to approach very carefully since you can inject errors in the “cleaning” process. We will be using Google Refine in the coming weeks.