by Larry Felton Johnson, Editor and Publisher, Cobb County Courier
This is a rant about a pretty technical topic, but I think it’s an important one if we think government transparency and the citizens right of easy access to information is a virtue.
In addition to being the Editor and Publisher at the Courier, I spend a lot of my time reading and analyzing government documents. Under any reasonable definition of transparency departments and agencies of all levels of government should want to make access to information as easy as possible, not just for journalists, but for all residents.
This is the entire basis of Georgia’s Open Records Act, which although it could use some strengthening, is a good and effective law.
Which brings me to PDF. PDF, or Portable Document Format was developed by Adobe in 1993.
Here’s what the Adobe web page states about PDF:
PDF is an abbreviation that stands for Portable Document Format. It’s a versatile file format created by Adobe that gives people an easy, reliable way to present and exchange documents – regardless of the software, hardware, or operating systems being used by anyone who views the document.
If all you’re interested in is the appearance of documents, and the ability to create documents that look the same no matter what device you are using, PDF is a wonderful format.
But for the dissemination of useful information so that both journalists and residents can become the watchdogs in the public interest, it’s a nightmare.
In this column I’m going to focus on the worst aspect of PDF: its lack of a consistent way of creating numeric tables. These include everything from budgets, to tables of public employee salaries, to crime statistics.
Bear with me as I describe the fatal flaw in PDF with respect to tables.
I work with census data in a programming language called R, which was developed to work with numbers. I recently decided to look for a way of extracting tables from PDF documents with R, and was told something I’d already figured out:
In pdftools, an R library for working with PDF, the documentation has this comment on PDF tables:
Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with
pdftools
.
[An earlier version of this article misformatted the above quote to include the paragraph below … apologies for the error]
Since the Adobe documentation claims that tables are a structured data type, I decided to see if Adobe’s own PDF tool, Adobe Acrobat, would handle tables any better than various alternatives I’ve tried, since there is an option within Acrobat of exporting documents to Excel format.
So I downloaded the PDF file that’s been the most ongoing source of frustration to me, Cobb County’s new business listings. To take a look at the original PDF on the county website, follow this link.
Here’s a screenshot of how Adobe’s rendering of the table looked when opened in a spreadsheet:
This is manageable with about a half hour of work on my part in terms of simple display on the Courier. All I’d have to do is insert lines between the fields to create enough spacing to make it clear where one entry ended and another began (in this case a little over 40 inserted lines).
Here’s what it looks like if I don’t insert the lines.
ATLANTA, GA 30339 | |||
OCC038184 | COBB MKT LLC | MAHENDI KOTADIYA | 12/15/2023 |
COBB MKT LLC | 1890 S COBB DR | GROCERY STORE | |
MARIETTA, GA 30068 | |||
OCC038088 | DESIGN TEAM COMPANY | CHRIS PIERCE | 12/13/2023 |
LLC | 255 OLD MORRIS CHAPEL RD | SIGN ERECTION | |
ADAMSVILLE, TN 38310 | |||
OCC038159 | DIANA MARSH LEGACY | DIANA MARSH | 12/12/2023 |
REAL ESTATE LLC | 2692 BEECHER DR | REAL ESTATE BROKERS | |
DIANA MARSH LEGACY | AUSTELL, GA 30106 | ||
REAL ESTATE LLC | |||
OCC038182 | DIGITAL BUSINESS | RENEE DUCRE | 12/15/2023 |
CONSULTING GROUP LLC | 1510 WATERFORD CT | CONSULTANT – EDUCATION | |
DIGITAL BUSINESS | MARIETTA, GA 30068 | ||
CONSULTING GROUP LLC | |||
License # | D.B.A / Business Name | Owner / Business Address | Issue Date / SIC Description |
OCC038169 | FINDLEY ROOFING | ADAM BUCZEK | 12/13/2023 |
FINDLEY VERTEX ROOFING | 4181 JVL INDUSTRIAL PARK | ROOFING CONTRACTOR |
It’s more-or-less readable if you’re careful, but it’s headache-inducing without spaces.
And if you’re visually impaired and use a screenreader, the audio output would be an indecipherable mess as it reads each line straight across. Think about the first line of the Findley roofing entry. It would read “OCC038169 Findley Roofing Adam Buczek twelve slash thirteen slash 2023.”
But what if I wanted to extract the names and addresses of business owners? I could spend an afternoon writing a script that recognizes how many columns over the addresses are, and that does matches on the addresses the way they’re usually formatted, but if this table were a real table, in a well-designed spreadsheet, the name and address would have its own cell (or maybe several depending on how you wanted to structure it), rather than in adjacent rows with an irregular number of lines.
Those of you who use spreadsheets a lot would also realize that it would be possible to view and manipulate this spreadsheet in dozens of different ways using pivot tables if it began as a well-structured spreadsheet, not as a bad export of a PDF.
This isn’t just a minor flaw if viewed with the lens of government transparency. It should be a disqualifying characteristic for any format used for distributing public information.
This isn’t 1993, when Adobe released PDF. We’re about to enter 2024 and governments at all levels are using a format that sabotages the ability to go to an agency or departmental website, retrieve a document, load the relevant tables into a spreadsheet program, sort the data, view the data filtered in various ways, and build graphs and charts from them.
PDF is an antiquated, hopelessly flawed format that is the enemy of government transparency. Agencies and departments should have tabular data available online in either Excel format or CSV (comma separated values).
Maybe next time I’ll write about the other transparency crimes committed with PDF (PDFs of jpgs or pngs of text, where you can’t cut and paste or read it with a screen reader).