Skip to main content

Redact interview transcripts before you share them — destructively, on your own machine

Qualitative researchers face a different redaction problem than lawyers do. Direct identifiers (names, addresses, phone numbers) are the easy half. The harder half is the quasi-identifier: a small town plus an occupation plus a year that, in your study population, points at one person. FileHop handles the file-layer step. You decide what to mark — or your IRB-approved data-management protocol does. Mac and Windows desktop. No upload.

Why qualitative data is harder to de-identify than most other data

Most de-identification advice is written for structured data — a row in a table with a 'name' column, an 'SSN' column, an 'address' column. The recipe is mechanical: drop the columns, generalise the date, truncate the ZIP. Qualitative data is structured differently. The identifying information is woven into the prose — a participant mentions her supervisor in passing, references the strike year, names the small city she moved to in graduate school. Removing 'her name' from the transcript header does not help if the surrounding two sentences are unique to her.

Researchers call this the deductive-disclosure problem (Kaiser, 2009, Qualitative Health Research). The risk is not that someone reads a name and recognises it. The risk is that someone who already has context — a colleague at the same institution, a member of the same small community, a journalist with public-record access — pieces together quasi-identifiers from the surrounding text and arrives at a named person. The UK Data Service uses a canonical example: a participant who says she is 'living in the city of Preston in Lancashire' can be generalised to 'living in a countryside location in the North West of England' without losing analytical purchase. The original phrasing, in some study contexts, would not survive deposit.

Worked illustration

One sentence, one person

Consider a transcript line: 'I was the only Latina dean at any community college in the Mountain West region when I was hired in 2021.' Remove the name from the transcript header. The line still points at one person — and that one person can be identified from a few minutes of LinkedIn or Chronicle-of-Higher-Education public-record searching. The mistake is not the redaction; the mistake is treating redaction as a search-and-replace pass on direct identifiers without engaging the quasi-identifier problem.

Quasi-identifiers in interview transcripts typically include: an occupation in a small institutional context (the only X at Y), a year combined with a regional event (the 2019 strike, the post-Katrina cohort), a distinctive verbal tic that travels (catchphrases, regional dialect markers a colleague would recognise), a third-party mention (the participant's supervisor's name, a co-worker's surname, a sibling's profession), an organisational affiliation (a specific community college, a named clinic, a particular research consortium), or a temporal sequence specific to the participant (entered the field in 1987, moved to administration in 2003, retired in 2024). The redaction list is the researcher's expertise — there is no regex that catches 'small town plus occupation plus year'.

What 'destructive' redaction means in a transcript PDF (and why find-and-replace in Word is not enough)

A PDF page is not a picture. It is a small program that tells a PDF reader how to draw the page: set font to Times, move to position X/Y, show this string of glyphs, draw this clip path, render this image. The text underneath every black bar you see on screen lives in that program as a Tj or TJ text-showing operator carrying the actual character codes. If you only draw a black rectangle on top, the original characters are still in the file. Anyone who selects the redacted region with the cursor and copies it into a text editor gets the original text back.

Find-and-replace in Word, then 'Save As PDF', is the most common qualitative-research version of this failure. If the document has change-tracking on, the replaced text is preserved in the revision history. If the document has comments, the original phrase may be quoted in a comment thread. If the export 'flattens' annotations, some flattens preserve the text layer and others rasterize the page — and you cannot tell which from looking at the output. The visible page looks right; the file does not.

Destructive redaction rewrites the program. The text-showing operator is found, the characters inside the redaction rectangle are deleted from the operator, and the operator is rewritten with the surviving (non-redacted) characters. If the entire run is inside a redaction, the operator is removed. If half the run is inside, the operator is split. FileHop's redactor goes one step further: after the destructive pass, it re-opens the output file and walks it from scratch, looking for any glyph that survived under any redaction rectangle. If anything survives, the function fails closed: it returns an error AND deletes the output file before you can save it. (Source: services/pdf/redactor.rs, function verify_redaction.) The on-disk source transcript is never modified; it stays where it was while a new file is written.

There is one additional protection that matters specifically for transcripts. Most redactions in a transcript are short tokens — a first name, a surname, an employer name, a hometown. Even some redactors that destroy the underlying glyphs leak the redacted token's WIDTH as a side channel (the PDF text operators carry per-glyph advance widths; if a redactor replaces the redacted run with a single advance of the original width, the residual width can be cross-referenced against font metrics to reconstruct the characters). This is documented in Bland 2023 (PETS) and is especially informative for short name-tokens — the worst case for the glyph-width side channel. FileHop addresses this by collapsing each redacted run to a single advance AND quantizing it to a 500-unit grid (the REDACTION_TJ_QUANTUM constant in the source). Most consumer redactors do not address this attack.

“Destructive PDF redaction: permanently removes text glyphs, image pixels, contained vector paths, and inline images inside redaction regions, then re-walks the output to confirm nothing redactable survives. Fails closed when content cannot be redacted faithfully.”
Source comment — services/pdf/redactor.rs

The glyph-position side channel — and FileHop's structural answer

When a redacted text run is rewritten, its total advance width is collapsed to one number AND quantized to a 500-unit grid (the REDACTION_TJ_QUANTUM constant in the source). The comment in the code reads: "Collapsing a run to one number already defeats per-glyph recovery; quantizing the total coarsens the residual length side channel so the exact width of the removed text cannot be measured from the stream." This addresses the Bland 2023 attack directly — and it matters most for transcripts, because the average redacted token is a short proper noun (first name, surname, place name, employer) where width-based reconstruction is most informative.

In plain English: most consumer redactors that DO remove glyphs still leave behind a number that says, in effect, "insert this much horizontal space here." That number is the width of the redacted text in font units. With the font and the width, an attacker can often guess the original word. FileHop collapses the whole redacted run to a single number and rounds that number to a coarse grid, so the residual width is no longer a faithful measurement of what used to be there.

The FileHop transcript redaction workflow (4 steps, on your own machine)

All four steps run inside the FileHop desktop app on your computer. The transcript file does not transit our servers at any point during this workflow. Mac and Windows.

  1. 1

    Step 1: Open the transcript (PDF, or DOCX converted locally to PDF)

    Drag the transcript PDF into FileHop or use File → Open. If your transcript is a .docx (typical export from Word, Otter.ai, Rev, Trint, or a manually-typed transcription), convert it to PDF inside FileHop first (File → Convert → DOCX to PDF; the conversion runs locally on your machine, no upload). The original .docx and the converted PDF both stay on your computer; FileHop writes a new redacted PDF at save time and never overwrites either source. If your transcript is an export from a qualitative-coding tool (NVivo, ATLAS.ti, MAXQDA, Dedoose, Quirkos, QualCoder), export it as a PDF or DOCX from the QDAS tool first, then bring it in. FileHop does not reach into QDAS project files directly — it works on exports.

  2. 2

    Step 2: Mark what to redact — by drawing, or by text search

    You have two ways to mark redaction regions. (a) Draw: select the Redact tool and drag a rectangle over each region you want destroyed. Use this mode for arbitrary content — a sentence carrying a quasi-identifier, a paragraph that names a third party, a passage that combines a location and a year and an occupation in a way your IRB protocol requires you to remove. (b) Search-and-mark: type a string and FileHop finds every occurrence in the transcript and queues a redaction mark at each match. Searches are case-insensitive. Use this mode for repeated direct identifiers — a participant's name, a recurring employer name, an institution, a third party who is named 18 times across the transcript. What FileHop does NOT do at this step is pattern-based auto-detection — FileHop will not 'find all names' or 'find all employer mentions' or 'find all quasi-identifiers' for you. Quasi-identifier identification is a researcher / IRB judgment call; there is no regex for it. If you want pattern-based direct-identifier auto-detection across a large corpus, look at QualCoder de-identification plugins, Reduct, or commercial de-id services — most of which are SaaS (accept the upload posture trade). FileHop's lane is the careful, surgical, local redaction of a transcript a researcher is about to share or deposit.

  3. 3

    Step 3: Apply the redactions

    When every region is marked, click Apply Redactions. FileHop rewrites the page content streams: each text operator that falls under a redaction rectangle is rewritten with the redacted glyphs removed; the residual run's total advance width is collapsed to a single quantized number (500-unit grid) so the side-channel attack documented in Bland 2023 does not leak the redacted token's original width; image XObjects under any redaction rectangle have their underlying pixels painted to solid black; inline images entirely inside a redaction are dropped; vector paths fully inside a redaction are dropped. If a page uses a Type3 font (rare in modern transcripts but possible with older oral-history archive scans), FileHop aborts the redaction with an error rather than reporting a partial redaction as success. FileHop also sanitizes the Info dictionary in the same pass — the redacted transcript does not need a separate metadata-strip step to remove the transcription-vendor's name or the QDAS project metadata from the file properties.

  4. 4

    Step 4: Save (and let FileHop verify the result automatically)

    Choose Save As and pick an output path. FileHop writes the new PDF, then automatically re-opens that output file from scratch and re-walks every page that had a redaction. If any text glyph survives under any redaction rectangle, OR any inline image survives, OR any pixel in a painted image is not solid black inside the painted region, the function returns an error AND deletes the output file before you can use it. You will see an error in the app and no redacted file will be saved. (Source: services/pdf/redactor.rs, verify_redaction at line 1543: "text is still present under a redaction" / "an inline image is still present under a redaction" / "image content is still present under a redaction".) The unredacted master transcript on disk is untouched; if verification fails you still have the master. If verification passes, you have a redacted file you can share with a co-author, attach to a data appendix, deposit to an archive, or include in an IRB amendment — with the file-layer guarantee that the marked content is destructively gone.

Verify it actually worked (5-minute checklist — works on any redactor's output)

FileHop's automatic re-walk verification gives you fail-closed assurance for the FileHop path. The habits below are tool-agnostic and work on a redacted transcript PDF from any source — Word's redaction tool, Acrobat, ATLAS.ti's anonymize export, FileHop, anything. Run them on every redacted transcript before you send, share, or deposit. Five minutes; do it cold; do it on every transcript, not just a sample.

  1. 1 Copy-paste test. Open the redacted transcript in your default PDF reader. Use Cmd+A (Mac) or Ctrl+A (Windows) to select all. Then paste into a plain-text editor (TextEdit on Mac, Notepad on Windows — NOT Word, which can hide structure, and NOT Notes which renders link text). If any redacted name, location, or identifier appears in the text editor, the redaction is overlay-only and the file is broken. Throw it away.
  2. 2 pdftotext / save-as-text test. From your PDF reader, use File → Export As → Text (or, on the command line, pdftotext file.pdf out.txt — pdftotext is the canonical extraction tool and ships with Xpdf / Poppler; many qualitative researchers will already have it from a previous corpus-analysis workflow). Open the .txt file. Search (Cmd+F / Ctrl+F) for the specific strings you redacted — every participant name, every employer name, every place name. If any redacted string appears, the redaction is broken.
  3. 3 Second-reader test. Open the PDF in a DIFFERENT reader than the one you used to redact it (if you redacted in Acrobat, open in Preview or Edge; if you redacted in FileHop, open in Acrobat or Preview). Repeat the copy-paste test in the second reader. Different readers extract text via different paths; if one passes and another fails, you have an overlay-only redaction.
  4. 4 Search-the-file test (advanced — for archive-deposit-grade transcripts). Use a command-line tool to dump the raw object stream (qpdf --qdf --object-streams=disable file.pdf out.pdf; then grep the out.pdf for the redacted strings). If they appear in the raw stream, the redaction is broken. This step is overkill for casual co-author sharing but is the only definitive test for deposit-grade transcripts going to a public archive (ICPSR, Syracuse QDR, UK Data Archive).
  5. 5 Visual cold-open test. Close the file. Reopen it. Scroll every page. Confirm no leftover comment bubbles, no track-change residue, no highlight annotation that escaped flattening, no sidebar markup. A real destructive redaction will show opaque black or empty whitespace where the redactor intended; an overlay redaction sometimes reveals annotation handles when the redaction layer is selected.

What your redacted transcript needs around it (the file is one piece of an IRB-protocol companion)

The redacted transcript is one file. An IRB-approved data-management protocol typically expects three more, alongside it. (1) The UNREDACTED master transcript, stored in IRB-restricted storage with access controls (encrypted-at-rest, institutional secure-research-storage, restricted-access shared drive, or — in some clinical workflows — a HIPAA-compliant institutional server). The unredacted master is the source of truth for verifying any future query about what was redacted, what was preserved, and whether the redacted file matches the original on every page. (2) The REDACTION LOG: a per-transcript record of what was marked and a brief reason. Some IRB offices accept a single corpus-level log ('all participant names, all employer names, all geographic references at the city level or below — generalised to region per UK Data Service convention'); others want a per-transcript line-numbered log. FileHop does not auto-generate this log — you maintain it. (3) The DATA-MANAGEMENT PROTOCOL document itself — the IRB-approved statement of how transcripts will be created, redacted, stored, shared, and (if applicable) deposited. The redacted transcript is the OUTPUT; the protocol is the procedural justification.

Some jurisdictions and contexts add fourth and fifth files. Education researchers handling student interview data under FERPA may need to map each redaction to a category-of-record reason. Clinical / health researchers under HIPAA's Safe Harbor or Expert Determination methods may need a determination memo from a qualified statistician confirming the de-identification meets the relevant standard. EU-based or EU-data research under GDPR Article 4 / Article 29 Working Party Opinion 05/2014 may need a documented data-protection-impact-assessment (DPIA) reference. Archive deposit to ICPSR, Syracuse Qualitative Data Repository, the UK Data Service, or a journal's data-supplement repository typically has a deposit-specific de-identification checklist on top — read the repository's policy before the redaction pass, because the policy informs what to mark.

FileHop handles the FILE-LAYER step. The redaction log is your job. The unredacted master storage is your institution's job (encrypted-at-rest secure research storage, restricted-access drives, institutional HIPAA-compliant servers). The IRB-approved protocol is the prior step — usually established at the consent-and-protocol-design phase, not bolted on afterward. This guide is COMPLEMENTARY to your IRB protocol, not a replacement for it.

What this workflow does NOT do

  • Not IRB / FERPA / HIPAA / GDPR certified. FileHop does not certify that any specific redaction meets any specific institution's IRB-approved data-management protocol, any specific FERPA de-identification standard, any specific HIPAA Safe Harbor or Expert Determination method, or any specific GDPR Article 4 anonymisation standard. The redaction is destructive at the PDF content-stream level, output is automatically re-verified before save, and the tool-agnostic verification checklist above is the researcher-side belt-and-braces. The combination is engineering-defensible at the file layer; whether it satisfies your specific institution, IRB, repository, or regulator is a question for you, your IRB office, and (if applicable) your data-protection officer.
  • Not a substitute for an IRB-approved data-management protocol. This article describes a file-layer hygiene step. The decision of WHAT to mark — which quasi-identifiers count as identifying in your study population, how to generalise locations, when to pseudonymise instead of redact, what the redaction log needs to record, how to store the unredacted master — is set by your IRB-approved protocol and your researcher expertise. The article will NOT make those decisions for you, and FileHop will NOT make them for you either.
  • Not pattern-based auto-detection. FileHop redacts what you mark — by drawing rectangles or by literal-string search. There is no AI 'find all names / all phone numbers / all employers' pass. For pattern-based direct-identifier auto-detection across a large corpus, look at QualCoder de-id plugins, Reduct, or commercial de-identification services (most of which are SaaS — accept the upload posture trade).
  • Not quasi-identifier detection. Quasi-identifier identification is fundamentally a researcher / IRB judgment call and cannot be solved by pattern detection of any kind. No regex catches 'small town plus occupation plus year'. The article teaches the framework so you can make the call; the tool does not make the call for you.
  • Not pseudonymisation. FileHop destroys the marked identifier; it does NOT replace 'Dr. Patel' with 'Dr. P1' and consistently use 'Dr. P1' for every subsequent mention. If your protocol requires consistent pseudonym replacement (the convention in many longitudinal-interview studies that want to preserve narrative coherence), the right tools are ATLAS.ti's anonymize feature, QualCoder's deidentification plugin, or a careful find-and-replace workflow with a pseudonym-mapping document maintained alongside.
  • Not for redacting audio or video interview recordings. FileHop works on text transcripts (PDF, or DOCX → PDF). It does NOT redact audio file content or video frames. For audio / video redaction, look at Reduct, Trint's redaction features, Sonix's PII detection, or Descript's overdub-based redaction.
  • Not on iPad / Linux / web. FileHop runs on macOS and Windows desktop only. Linux is common in some research subfields (computational social science, digital humanities, computational linguistics). For those workflows, run the redaction step in a Mac/Windows VM on the Linux host, or on a colleague's institutional Mac/Windows machine. The verification checklist above still applies on any platform.
  • Not a replacement for NVivo / ATLAS.ti / MAXQDA / Dedoose / QualCoder. Those tools own qualitative ANALYSIS. FileHop is the file-hygiene step BETWEEN analysis-export and sharing/depositing — not a replacement for the analysis software.

Why this workflow runs locally

The 'no upload' line in the headline is the wedge for this guide. Here is what it means in practice for IRB-restricted material, with the limits stated honestly.

  • All four redaction steps run inside the FileHop desktop app on your computer. The transcript, the marked redaction rectangles, the search strings (if you used search-and-redact), and the output file all stay on your machine. Nothing transits our servers during the redaction itself.
  • No telemetry on file contents. We do not log which transcript you redacted, what strings you searched, or what regions you marked.
  • No AI training on your files.
  • Open output format. FileHop writes standard PDF that opens in any reader your co-author, repository, or IRB reviewer uses.
  • The on-disk source transcript is never modified. FileHop writes a new file at save time, and the verification re-walk operates on that new file; if verification fails, the new file is deleted and your master is untouched.
  • Honest scope on cloud features: cloud OCR is opt-in and clearly labelled in the app. If you don't turn it on, no part of the file leaves your computer. OCR is only relevant to this workflow if you're redacting a scanned transcript with no text layer (rare — almost all modern transcripts are text-native; relevant mainly for archival oral-history work with typewritten transcripts that were scanned).

FAQs

What is the difference between anonymisation and pseudonymisation in qualitative research?
Anonymisation destroys the identifier — the participant's name is gone from the file, irretrievable. Pseudonymisation replaces the identifier with a consistent stand-in — 'Dr. Patel' becomes 'Dr. P1' everywhere, and the mapping is recorded in a separate key file that lives in IRB-restricted storage. Both are common in qualitative research; the right choice depends on your IRB protocol, your analytical needs (consistent narrative coherence across the corpus typically calls for pseudonymisation), and your archive-deposit policy. FileHop does anonymisation (destructive redaction) only. For pseudonymisation, ATLAS.ti's anonymize feature, QualCoder's deidentification plugin, or a careful find-and-replace with a pseudonym key document are the standard tools.
How do I share an interview transcript with a co-author without revealing the participant?
Decide what to mark based on your IRB-approved protocol and the quasi-identifier framework above (direct identifiers like names + indirect identifiers / quasi-identifiers like small-population context). Run the FileHop 4-step workflow. Run the tool-agnostic verification checklist on the output. Send the redacted file (not the master) to your co-author. Keep the unredacted master in IRB-restricted storage. Keep a redaction log. If your protocol restricts co-author access further than the redacted file does, share via your institution's secure-research-file-transfer system rather than email.
What are quasi-identifiers and why do they matter for transcripts?
Quasi-identifiers are pieces of information that, individually, do not identify a participant — a year, a job title, a neighbourhood, a school district, a regional event — but in combination, in a small enough population, point at a single person. Karen Kaiser (Qualitative Health Research, 2009) calls this the deductive-disclosure risk. Qualitative data is uniquely exposed because the genre is rich-contextual: a participant's three-sentence answer can carry an occupation, a year, a location, a third-party reference, and a distinctive verbal tic all at once. The mistake is treating de-identification as a name-redaction pass. The harder work is the quasi-identifier judgment — your researcher expertise (or your IRB's) is the only thing that does it.
Does find-and-replace in Word actually remove the original text from the exported PDF?
Not always. If the document has change-tracking enabled (Review → Track Changes), the original text is preserved in the revision history and can survive the PDF export. If the document has comments, the original phrase may be quoted in a comment thread. If the export 'flattens' annotations, some flatten implementations preserve the text layer and others rasterize the page — and you cannot tell which from looking at the output. The safer pattern: finalize the .docx (accept all changes, remove comments via Word's Document Inspector under File → Info → Check for Issues → Inspect Document), export to PDF, then run destructive redaction on the PDF and verify with the copy-paste / pdftotext checklist.
Can ATLAS.ti / NVivo / MAXQDA's anonymize feature produce a fully de-identified PDF for sharing?
Their anonymize features handle the analytical-tool-internal anonymisation (consistent pseudonym assignment within the project, codebook-aware text replacement). When you export to PDF from those tools, what survives in the export depends on the export pipeline. Run the tool-agnostic verification checklist (copy-paste, pdftotext) on any QDAS export before you share or deposit it — that is how you find out whether the export preserved the underlying text layer or destructively replaced it. FileHop's role for QDAS users is as the post-export file-hygiene step: re-mark anything the QDAS export missed and run the destructive redaction with verification.
What does FERPA require for redacting student interview transcripts?
FERPA regulates personally identifiable information from education records and requires either de-identification (so that a reasonable person cannot identify the student with reasonable certainty) or written consent for disclosure. The U.S. Department of Education's Privacy Technical Assistance Center (PTAC) publishes a de-identification terms reference. FileHop handles the file-layer step; whether your specific redaction satisfies FERPA's de-identification standard is a determination your IRB office (and, in some cases, the school's FERPA officer) makes. The article is complementary to that determination, not a substitute.
What does HIPAA require for redacting clinical interview transcripts?
HIPAA's de-identification methods are Safe Harbor (remove the 18 specified identifiers per 45 CFR § 164.514(b)(2)) or Expert Determination (a qualified statistician determines that the risk of re-identification is very small). Safe Harbor is a checklist; Expert Determination is a methodology. FileHop handles the file-layer destruction of marked content; the Safe Harbor checklist is the IRB or privacy officer's mapping job, and the Expert Determination memo (if used) comes from a qualified statistician. FileHop does not certify HIPAA compliance and does not replace either method.
What about GDPR — am I anonymising or pseudonymising under EU law?
Under GDPR Article 4 and the Article 29 Working Party Opinion 05/2014 on Anonymisation Techniques, anonymisation must be IRREVERSIBLE — the data subject cannot be re-identified. Pseudonymisation is still personal data because the mapping key can re-identify the subject. Many qualitative-research workflows that call their output 'anonymised' are technically pseudonymised under GDPR. If your study is EU-based or includes EU data subjects, your data-protection officer's determination on which standard applies is the controlling document. FileHop handles the file-layer destruction; the legal determination is your DPO's job.
How do I redact a transcript that has track changes still in it?
First, accept or reject all tracked changes in Word (Review → Accept All / Reject All), or run Word's Document Inspector (File → Info → Check for Issues → Inspect Document) which removes change-tracking metadata, comments, and document properties. THEN export to PDF. THEN run the FileHop destructive redaction. Skipping the Document Inspector step risks leaving the revision history in the PDF, which can survive even after the visible content is redacted.
Can I do this on an iPad or Linux machine?
FileHop runs on macOS and Windows desktop only. On iPad, the closest equivalents are Adobe Acrobat or PDF Expert (both have real redaction tools — verify with the copy-paste test). On Linux, the closest equivalents are Master PDF Editor or qpdf-based CLI workflows; the verification checklist above (especially the pdftotext and qpdf steps) is the most important habit to preserve on Linux.
What about scanned transcripts from oral-history archives?
If the transcript is a scan with no text layer (typewritten in 1978, scanned in 2002, no OCR), search-and-redact will not work — there are no characters to search. You can still draw rectangles manually to redact the image regions; FileHop paints the underlying pixels to solid black and verifies the result on the image-XObject path. If you want search-and-redact on a scanned transcript, you need OCR first. FileHop's OCR is currently cloud-based (opt-in); if that posture is not acceptable for IRB-restricted material, OCR upstream in a tool you trust (Tesseract locally, Adobe Acrobat's OCR locally) and bring the searchable PDF back.

Before you share, deposit, or attach

Mark the direct identifiers your IRB protocol requires. Mark the quasi-identifiers your researcher expertise tells you are at risk in your study population. Apply. Save. Run the copy-paste / pdftotext / second-reader checklist before you share, deposit, or attach — every time, on every transcript, regardless of which tool produced the redaction. The redacted file is one piece; your unredacted master, your redaction log, and your IRB-approved data-management protocol are the rest. If you do this work regularly, the persona page at /for/researchers/ walks the broader workflow set.