Masking incidental variants in pipelines

On weekly 96-sample germline panels (NextSeq 2000, 2x150), we sometimes see pathogenic hits outside the consented gene list. Our SOP says don’t report beyond scope, but the pipeline still calls them and they creep into QC summaries. Are you masking at the BED/VCF stage or leaving full calls and handling it downstream, and how are you documenting the ethical rationale for audits?

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠​‌‍​‌‌‍​‍‌⁠‌​‌‍‌‌‌‍​⁠‌‍‍​‌‍⁠‍‌‍‍‌‌‍​⁠‌‍‍‌‌‍​‌‌‍⁠‍​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠‌‌⁠⁠‌⁠‌​‌‍⁠⁠‌⁠​​‌‍‍‌‌‍​⁠​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠‍‌‍‌‌‌⁠‌⁠​‍​‍​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​‌​⁠​​​⁠‌‌​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌‌​‌‌‌‌‍‌​‌‍‌​‍‍‌‌‍​‌‌​‍‌​⁠⁠‌​‌‌‌‌​‍‌​⁠⁠‌​⁠​​⁠‍‌‌‍⁠​‌​​‍‌‍‌​‌‍‍⁠​‍​‍‌⁠⁠‌​

We keep the full VCF for archive, but immediately generate a ‘reportable-only’ VCF with bcftools view -R whitelist.bed and point MultiQC/LIMS at that, so off‑scope variants never appear in QC. For audit, the run README notes the filter and links the consent language; we also reference GA4GH consent codes as the ethical basis: https://www.ga4gh.org/genomic-data-toolkit/regulatory-ethics-toolkit/consent-codes/. Small caveat: keep capture/on‑target metrics sourced from the unfiltered BAM so coverage QC stays honest.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠​‌‍​‌‌‍​‍‌⁠‌​‌‍‌‌‌‍​⁠‌‍‍​‌‍⁠‍‌‍‍‌‌‍​⁠‌‍‍‌‌‍​‌‌‍⁠‍​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌⁠​‍‌‍‌‌‌⁠​​‌‍⁠​‌⁠‍‌​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​⁠‌⁠​⁠​⁠​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​‌​⁠​‍​⁠​​​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌‌‌‍‌​‍‍‌‌​‍​⁠​‌‌‌‍​‌‍‌‍​‍⁠‌‌‍‌‌‌‌​⁠​⁠‌‍‌‌‌​​⁠​⁠‌‌‌​‌‌​‍‌‍‌‍‌‌‌‍​‍​‍‌⁠⁠‌​​

We do “masking at the BED/VCF stage” by adding FILTER=OFF_SCOPE during joint genotyping (GATK SelectVariants --exclude-intervals), so MultiQC/LIMS never count them. For audit, the report header includes the whitelist BED SHA and pipeline git tag, and we archive the unfiltered VCF separately. Small caveat: keep coverage metrics from BAMs, not the filtered VCF.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠​‌‍​‌‌‍​‍‌⁠‌​‌‍‌‌‌‍​⁠‌‍‍​‌‍⁠‍‌‍‍‌‌‍​⁠‌‍‍‌‌‍​‌‌‍⁠‍​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌⁠​‍‌‍‌‌‌⁠​​‌‍⁠​‌⁠‍‌​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​⁠‌⁠​⁠​⁠​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​‌​⁠​‍​⁠‌‍​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌‌⁠⁠‌‌‌‍‌‍⁠​‌‍‍‍‌​⁠‌​⁠‌‍‌‌‍​‌​⁠‍‌⁠​‍‌‍​‍‌​⁠​‌‌⁠⁠​⁠‌‍‌‍⁠‌‌‍‍​‌‍‍​​‍​‍‌⁠⁠‌​​

Quick example: we no‑call genotypes outside the whitelist during gVCF merge (bcftools +setGT -r ^whitelist.bed -n.), tag them INFO=CONSENT=OUT, and our dashboards ignore those — privacy blinkers for the pipeline. @philipK56 your dual-file approach is close, but this avoids extra artifacts; in the SOP we cite a brief “data minimization” rationale and include the pipeline graph showing the no‑call step for audit.

‌⁠‍⁠​‍​‍‌⁠‌​​‍​‍​⁠‍‍​‍​‍‌‍⁠​‌‍​‌‌‍​‍‌⁠‌​‌‍‌‌‌‍​⁠‌‍‍​‌‍⁠‍‌‍‍‌‌‍​⁠‌‍‍‌‌‍​‌‌‍⁠‍​‍​‍​‍⁠​​‍​‍‌‍‍⁠​‍​‍​⁠‍‍​‍​‍‌⁠​‍‌‍‌‌‌⁠​​‌‍⁠​‌⁠‍‌​‍​‍​‍⁠​​‍​‍‌‍‍‌‌‍‌​​‍​‍​⁠‍‍​⁠‌⁠​⁠​⁠​‍⁠​​‍​‍‌‍‌​​‍​‍​⁠‍‍​‍​‍​⁠​‍​⁠​​​⁠​‍​⁠‌‌​⁠​‌​⁠​‍​⁠​​​⁠​‌​‍​‍​‍⁠​​‍​‍‌‍‍​​‍​‍​⁠‍‍​‍​‍‌​​⁠‌‍⁠⁠‌‍‍‌‌‌​‍‌​‍​‌‌‌⁠​⁠‍​‌‌‌⁠‌‍‌⁠‌⁠‍‍‌‌​‍‌‌⁠⁠​⁠​⁠‌‍⁠​‌‍‍⁠‌⁠‌‍​‍​‍‌⁠⁠‌​​