AFI Terra Pipeline

Pipeline version: 0.4.1  |  Post-pipeline filter: V4  |  Workflows: AFI_16S_Main / AFI_16S_Batch  |  Platform: Terra (Broad Institute)  |  Last updated: May 13, 2026

What's new in this revision: Sections 13 and 14 are new — the V4 post-pipeline decontamination filter (including the species-level Burkholderia pseudomallei safeguard and the positive-control spike-in bypass) and an analytical validation summary against a 48-sample reference panel plus a 86-sample AFI study cohort. Section 12 has been expanded to document Tier 2 "Probable" thresholds. All other sections are unchanged from the v0.4.1 user guide.

1. Overview

The afi_terra pipeline detects and identifies Rickettsiales organisms — primarily Orientia, Rickettsia, and related genera — from 16S rRNA metagenomic sequencing data. It runs on Terra, the Broad Institute's cloud execution platform, using WDL 1.0 workflows backed by Docker images.

Target organisms

The pipeline is designed to detect genera in the order Rickettsiales. For Orientia and Rickettsia specifically, a confirmatory 16S rRNA alignment step provides higher specificity. All other genera detected by the classifier (e.g., Leptospira, Burkholderia, Anaplasma) are reported from Centrifuger read counts alone.

Two workflows

WorkflowWDL fileUse case
AFI_16S_Mainwdl/AFI_16S_Main.wdlSingle sample
AFI_16S_Batchwdl/AFI_16S_Batch.wdlFull sequencing run (scatter/gather)

For clinical use, run the batch workflow. It processes all samples in a run together, automatically computing a per-run NTC background, and producing a consolidated run_summary.tsv. The Terra Sheet Builder desktop app (Section 8) generates the required sample sheet TSV without manual editing.

Two sample modes

ModeUse casePrimary output
routineClinical samples with unknown organismsroutine_summary.tsv
validationSamples with known expected organismsvalidation_summary.tsv

A single batch run can contain both modes simultaneously.

2. Architecture and Data Flow

Processing steps

Each sample passes through up to nine steps. Steps 1–5 run in Phase 1; steps 6–8 run in Phase 2 after per-run NTC backgrounds are computed; step 9 is the final batch-level gather.

Step 1  Human read dehosting        NCBI SRA Scrubber (optional, default on)
Step 2  QC / adapter trimming       fastp v0.23.4
Step 3  Taxonomic classification    Centrifuger -> kreport -> genus counts TSV
Step 4  16S alignment               minimap2 -> BAM
Step 5  Alignment metrics           extract_rick16s_metrics.py -> align_metrics.tsv
        ---- batch gather: BuildNTCBackground (NTC/NC samples only; per run_id) ----
        ---- batch gather: MatchNTCBackground (assigns each sample its run's NTC) ----
Step 6  NTC-aware interpretation    call_taxa.py -> calls.tsv + taxa_evidence.tsv
Step 7a Validation summary          compare_expected (validation mode only)
Step 7b Routine summary             summarize_routine (routine mode only)
Step 8  Run summary                 run_summary.tsv (batch only)

Batch workflow phases

Phase 1 scatter — runs steps 1–5 for all samples in parallel.

BuildNTCBackground — collects all NTC/NC metrics and computes one ntc_background_<run_id>.tsv per distinct run_id.

MatchNTCBackground — maps each sample to the NTC background for its run_id. This task fails if any run_id has no NTC/NC sample.

Phase 2 scatter — runs steps 6–7 for all samples using matched NTC backgrounds.

BuildRunSummary — merges all per-sample summaries into a single run_summary.tsv.

3. Reference Files and Databases

16S Rickettsiales panel

FileDescription
rickettsiales_panel_16S.fa16S reference sequences, original
rickettsiales_panel_16S.clean.faCleaned version (recommended)
rickettsiales_panel_16S.mmiPre-built minimap2 index
Recommendation: Use rickettsiales_panel_16S.clean.fa or the pre-built .mmi index. The pre-built index avoids index-build overhead at runtime.

Centrifuger database

The database (~67 GiB compressed) is stored in GCS as a TAR.GZ archive. Provide two inputs:

InputExampleDescription
centrifuger_dbcentrifuger_bact_arch_plus_rickettsialesIndex prefix string
centrifuger_db_archives["gs://bucket/...tar.gz"]Array of GCS paths
VM sizing: The Centrifuger task requires centrifuger_memory = "96G" and centrifuger_disks = "local-disk 375 HDD". The default Terra VM (1 CPU / 2 GB RAM) is far too small and will cause the task to fail.

4. Importing Workflows into Terra

Option 1: Dockstore (recommended)

  1. In Dockstore, link the GitHub repository PHemarajata/afi_terra.
  2. Create or refresh a workflow version from the desired Git tag or branch.
  3. In Terra, navigate to Workflows and click Find a Workflow.
  4. Select Dockstore as the source and import via TRS. Terra resolves all imports automatically.

Option 2: Direct ZIP upload

If Dockstore is not available, create a ZIP preserving relative directory paths:

zip -r afi_terra_wdl.zip \
  wdl/AFI_16S_Batch.wdl \
  wdl/AFI_16S_Main.wdl \
  wdl/tasks/ \
  NCBI_scrub_PE/tasks/quality_control/read_filtering/task_ncbi_scrub.wdl
Important: Uploading only AFI_16S_Batch.wdl alone will fail. It imports AFI_16S_Main.wdl and all task WDLs via relative paths.

5. Running the Single-Sample Workflow

Note: For production clinical runs, use the batch workflow. The single-sample workflow requires a pre-computed ntc_background.tsv to be supplied manually.

Required inputs

InputTypeDescription
sample_idStringUnique sample identifier
r1_fastqFileGCS path to R1 FASTQ (.fastq.gz)
r2_fastqFileGCS path to R2 FASTQ (.fastq.gz)
rickettsiales_panelFileGCS path to 16S reference FASTA or .mmi index
ntc_backgroundFileGCS path to NTC background TSV; use placeholder for first pass
centrifuger_dbStringCentrifuger index prefix
centrifuger_db_archivesArray[File]GCS paths to TAR.GZ archives

Mode and type inputs

InputDefaultAllowed values
moderoutineroutine, validation
sample_typeclinicalNTC, NC, PC_MIX8, PC_SINGLE, MIXED4, clinical, PC
use_human_scrubtruetrue, false

Launching in Terra

  1. Navigate to Workflows and select AFI_16S_Main.
  2. Choose Run workflow with inputs defined by file paths.
  3. Upload or paste your input JSON, then click Run Analysis.

6. Running the Batch Workflow

The batch workflow (AFI_16S_Batch) processes one or more sequencing runs in a single Terra submission. It scatters each sample through processing, automatically builds a per-run NTC background, and produces a consolidated run summary.

Multi-run support: A single batch submission can include samples from multiple sequencing runs. Assign each sample a run_id matching its run; NTC backgrounds are computed independently per run_id.

Per-sample arrays

InputTypeDescription
run_idsArray[String]Sequencing run identifier per sample
sample_idsArray[String]Sample identifiers
r1_fastqsArray[File]R1 FASTQ GCS paths
r2_fastqsArray[File]R2 FASTQ GCS paths
sample_typesArray[String]Sample type for each sample
modesArray[String]routine or validation per sample
expected_taxaArray[String]Expected taxa string; "" for routine samples

Sample types in a batch run

sample_typeDescriptionContributes to NTC background
NTCNo-template controlYes
NCNegative control (alias for NTC)Yes
PC_MIX88-organism positive controlNo
PC_SINGLESingle-organism positive controlNo
clinicalClinical patient sampleNo
PCGeneric positive controlNo
NTC requirement: Every distinct run_id in the batch must have at least one sample with sample_type of NTC or NC; otherwise MatchNTCBackground will fail.

7. The Two-Pass Pattern

The batch workflow handles NTC backgrounds entirely automatically — no manual intervention is needed. The two-pass procedure below applies only to the single-sample workflow or exceptional reprocessing scenarios.

Step 1: First pass — placeholder NTC background

Supply wdl/inputs/ntc_background.placeholder.tsv (all-zero thresholds) as ntc_background. Collect the align_metrics output for each NTC/NC sample from Terra.

Step 2: Build run-specific NTC background

python3 scripts/build_ntc_background_from_metrics.py \
  --ntc-metrics path/to/NTC_S11.align_metrics.tsv \
  --ntc-metrics path/to/NTC_S12.align_metrics.tsv \
  --out wdl/inputs/run3.ntc_background.tsv

Step 3: Upload and re-run

gsutil cp wdl/inputs/run3.ntc_background.tsv gs://YOUR_BUCKET/ref/run3_ntc_background.tsv

Re-submit the sample with ntc_background pointing to the real file. The second pass produces NTC-corrected calls.

8. Terra Sheet Builder (GUI)

The Terra Sheet Builder is a desktop GUI application that generates a Terra-compatible sample-set TSV for AFI_16S_Batch without manually editing JSON or TSV files. It is the recommended way to prepare batch inputs for routine clinical use.

Getting the application

Pre-built binaries are built automatically by GitHub Actions. Download the artifact for your platform from Actions → Build Terra Sheet Builder on the repository page:

PlatformArtifact
Linux (amd64)Single-file ELF executable
Windows (x64).exe, no console window
macOS Intel.app bundle, zipped
macOS Apple Silicon (arm64).app bundle, zipped

Run from source (requires Python 3.10+):

pip install PySide6
python3 tools/terra_sheet_builder/terra_sheet_builder.py

Step-by-Step Walkthrough

The following walkthrough covers generating a Terra metadata sheet for the AFI_16S_Batch workflow for a two-run batch.

1

Specify the number of runs in this batch

On the first screen, enter how many sequencing runs you want to process together. In this example, there are two runs to process. Use the spinner to set the count.

Step 1 – run count screen
2

Click Continue

After setting the run count, click Continue to advance to the run-naming step.

Step 2 – continue button
3

Name each run

In Step 2, enter a unique name for each run. Names are used as the run_id value for all samples in that run. No spaces are allowed — use underscores or hyphens (e.g., run_2024_11). In this example, runs are named Run 1 and Run 2. Click Next when done.

Step 3 – run names
4

Select FASTQ files for each run

In this step, add FASTQ files to each run. You can import from pre-generated TSV files or browse to the folder where R1 and R2 FASTQ files are located. Click Browse Folder to select a directory — the app auto-discovers all R1/R2 pairs using the _R1_/_R2_ naming convention.

Step 4 – FASTQ selection options
Three methods to add samples:
  • Browse folder — auto-discovers all R1/R2 FASTQ pairs in a directory
  • Add files… — multi-file dialog; app auto-pairs R1+R2 by name pattern
  • Import from TSV… — global import with run_id and sample_id columns
5

Browse to the FASTQ folder and click Open

Navigate to the folder containing the paired read files. Select the folder and click Open. The app will scan for all valid R1/R2 FASTQ pairs automatically.

Step 5 – folder browser dialog
6

Files are sorted into the metadata table automatically

The app sorts discovered files into the metadata table based on the specified runs. Repeat the folder selection for the remaining runs to be processed.

Step 6 – files auto-sorted into table
7

Verify sample names and file names, then click Next

Tables with sample names and file names are populated automatically. Review the table to confirm all entries are correct. Once ready, click Next to proceed to sample metadata entry.

Step 7 – populated sample table
8

Review the sample metadata page

This page requires additional information. Note that the Analysis Date field has already been pre-filled automatically with today's date.

Step 8 – sample metadata page
9

Enter the Terra table name and operator initials

Fill in the Terra metadata table name (lowercase alphanumeric, ≤32 chars, must start with a letter) and the operator's initials. These are recorded in the analysis_comments column of the output TSV for traceability.

Step 9 – table name and initials
10

Designate the NTC (negative control) for each run

Below the header fields, the sample table lists all samples for both runs. Each run must have at least one NTC and one positive control. In the sample_type column, find the NTC sample for Run 1 and select NTC from the dropdown. The sample_type for clinical samples remains clinical.

Step 10 – selecting NTC sample type
11

Confirm the NTC designation

The NTC sample for Run 1 is now designated. The sample_type dropdown shows NTC for that row. Continue to designate the positive control for Run 1.

Step 11 – NTC confirmed
12

Begin changing the positive control sample type

For the positive control sample in Run 1, click the sample_type dropdown to change it from clinical to the appropriate control type.

Step 12 – opening PC dropdown
13

Open the sample type dropdown

Click the dropdown control for the positive control row to open the list of available sample types.

Step 13 – dropdown open
14

Select the positive control type

Change the sample_type to the type of control used. In this example, PC_MIX8 is selected — a positive control containing DNA from 8 bacterial species.

Step 14 – PC_MIX8 selected
sample_typemode (auto-filled)expected_taxa (auto-filled)
PC_MIX8validationBacillus;Listeria;Staphylococcus;Enterococcus;Limosilactobacillus;Salmonella;Escherichia;Pseudomonas
clinicalroutine(empty)
NTC / NCroutine(empty)
15

Expected taxa auto-populate for PC_MIX8

Once PC_MIX8 is selected, the expected_taxa field auto-fills with the 8 bacterial species expected from that control. This tells the validation workflow which taxa should be identified.

Step 15 – expected taxa auto-filled
16

Validation catches missing controls before export

If an operator forgets to designate the NTC and positive control for Run 2 and clicks Validate, the Export TSV button remains greyed out. The app has detected missing required controls.

Step 16 – Export TSV greyed out
17

Review and fix the validation error

The app generates a descriptive error message stating that Run 2 has no NTC and no positive control specified. Click OK to acknowledge. Correct the missing designations for Run 2 before proceeding.

Step 17 – validation error dialog
Validation checks performed:
  • No duplicate sample_id values across all runs
  • Each run has at least one NTC or NC sample
  • Each run has at least one positive control (PC_MIX8, PC_SINGLE, MIXED4, or PC)
  • All validation-mode rows have a non-empty expected_taxa field
  • Table name matches ^[a-z][a-z0-9_]{0,31}$
18

Validate — all checks must pass before export

After correcting all issues, click Validate again. If all checks pass, the Export TSV button becomes clickable. The TSV files are now ready to be uploaded to Terra.bio.

Step 18 – Export TSV button enabled
19

Export TSV and save the file

Click Export TSV and save the file to a location you can easily locate. The file will be named using the table name you entered in the header fields.

Step 19 – Export TSV dialog
20

Verify the exported file in Microsoft Excel

Open the exported TSV in Microsoft Excel to confirm the format is correct and compatible with Terra.bio. Check that the entity:<table_name>_id column is the first column, all sample fields are present, and the analysis_comments column appears as the last column.

Step 20 – TSV verified in Excel

Importing the TSV into Terra

  1. In your Terra workspace, go to Data → Import Data → Upload TSV.
  2. Select the exported TSV file.
  3. Terra will create or update a sample set entity table with the name you provided.

The exported file uses entity:<table_name>_id as the first column (required by Terra) followed by all sample fields. An analysis_comments column appended as the last column contains <table_name> | <date> | <initials> for traceability.

9. Helper Scripts

All helper scripts are in the scripts/ directory. They run locally (not in Terra) and are used to prepare inputs, post-process outputs, and manage Docker images.

build_batch_inputs_json.py

Converts a sample sheet TSV or AFI mapping TSV into a Terra-ready batch input JSON.

python3 scripts/build_batch_inputs_json.py \
  --sample-sheet wdl/inputs/batch_samples.template.tsv \
  --out-json wdl/inputs/batch_run1.json \
  --rickettsiales-panel gs://YOUR_BUCKET/ref/rickettsiales_panel_16S.clean.fa \
  --centrifuger-db centrifuger_bact_arch_plus_rickettsiales

build_ntc_background_from_metrics.py

Builds a run-specific NTC background TSV from one or more NTC align_metrics.tsv output files.

python3 scripts/build_ntc_background_from_metrics.py \
  --ntc-metrics path/to/NTC1.align_metrics.tsv \
  --ntc-metrics path/to/NTC2.align_metrics.tsv \
  --out wdl/inputs/run_ntc_background.tsv

make_terra_import_sheet.py

CLI companion to the Terra Sheet Builder GUI. Generates or validates a Terra-compatible sample import TSV and optionally writes an annotated Excel workbook.

python3 scripts/make_terra_import_sheet.py --template --output my_run.tsv
python3 scripts/make_terra_import_sheet.py --input samples.csv --output terra_import.tsv --excel

10. Output Files Reference

Per-sample outputs

FileDescription
<id>.genus_counts.tsvGenus-level Centrifuger read counts
<id>.bam / .baiminimap2 alignment, sorted and indexed
<id>.align_metrics.tsvPer-genus alignment metrics (mapped_reads, max_breadth)
calls.tsvNTC-aware taxa calls — the core interpretive output
taxa_evidence.tsvDetailed per-genus evidence with justification strings
routine_summary.tsvDetected taxa list (routine mode only)
validation_summary.tsvConcordance summary (validation mode only)

Batch-only outputs

FileDescription
ntc_background_<run_id>.tsvAuto-computed NTC background per distinct run_id
run_summary.tsvBatch-level summary with per-run PC8 validity — one row per sample

11. Interpreting Calls and Summaries

Call values — alignment source (Orientia and Rickettsia)

CallMeaning
ConfirmedStrong positive: reads and breadth meet confirmation thresholds and are well above NTC
ProbableModerate positive: lower thresholds met and reads exceed NTC
Not_ConfirmedEquivocal: sufficient reads but at or below NTC level; cannot distinguish from noise
NegativeInsufficient reads

Call values — Centrifuger source (all other genera)

CallMeaning
DetectedReads exceed floor threshold and are well above NTC
Not_DetectedNegative or below threshold

PC8 validity

For PC_MIX8 samples, a pass/fail flag is computed. At least 6 of the 8 expected taxa must be detected for a passing result. In run_summary.tsv, run_pc8_valid carries this value for every row, allowing immediate run-level QC assessment.

12. Detection Thresholds

Configurable thresholds (Tier 1 "Confirmed" and Centrifuger "Detected")

InputDefaultDescription
align_confirm_reads100Minimum mapped reads for Confirmed alignment call
align_confirm_breadth0.25Minimum breadth of coverage for Confirmed call
align_fold5.0Minimum fold over NTC reads for Confirmed call
cfr_floor500Minimum Centrifuger reads for Detected call
cfr_fold5.0Minimum fold over NTC reads for Detected call

Hardcoded thresholds (Tier 2 "Probable" and equivocal calls)

The Tier 2 order-level Rickettsiales rescue and the equivocal-call thresholds are defined in scripts/call_taxa.py. These are not currently exposed as workflow inputs but are documented here for transparency.

CallReadsBreadthNTC requirement
Confirmedalign_confirm_reads (100)align_confirm_breadth (0.25)align_fold × NTC reads (5×)
Probable≥ 50≥ 0.20> NTC reads (no fold requirement)
Not_Confirmed≥ 50(any)≤ NTC reads
Negative< 50
How the two tiers map to clinical reporting. Confirmed calls report the specific Rickettsiales genus (Orientia or Rickettsia). Probable calls report "Rickettsiales detected (genus uncertain; confirmatory species-specific qPCR recommended)" — clinically actionable for empirical doxycycline coverage in endemic regions where V1–V3 cannot reliably discriminate Rickettsiales at the genus level.

13. V4 Decontamination Filter (post-pipeline)

What this is. The pipeline's per-genus NTC subtraction (Section 2) and Centrifuger/alignment thresholds (Section 12) handle run-specific contamination. The V4 filter is an additional post-processing step that targets cross-study reagent / skin / water contaminants documented in landmark low-biomass microbiome reviews (Salter 2014, Glassing 2016, Lauder 2016, de Goffau 2018, Tan 2023). It runs on the pipeline's .calls.tsv outputs and does not modify pipeline behavior or thresholds.

The filter is implemented as a standalone Python script (afi_decontamination_filter_v4.py) that ingests .calls.tsv rows with positive calls (Detected / Confirmed / Probable), applies the tiered rules below, and emits a report of which detections were kept and which were removed.

Tier A — High-confidence kit / skin / water contaminants

The following 11 genera are removed on detection in clinical and study samples. Membership is documented in five or more independent low-biomass contamination reviews and confirmed in this dataset's NTC profiles.

GenusMost common source
PseudomonasKits, water, Taq, airborne
RalstoniaKits, ultrapure water biofilm
BradyrhizobiumKits, ultrapure water
SphingomonasKits, water systems
StenotrophomonasKits, water, PCR reagents
MethylobacteriumKits, ultrapure water systems
AcinetobacterKits, PCR reagents, foot traffic
CutibacteriumSkin (alcohol-resistant), airborne
StaphylococcusSkin, airborne (coagulase-negative spp.)
CorynebacteriumKits, skin, airborne
BrevundimonasKits, water; up to 285,740 NTC reads observed in run 6_and_7

Tier A exception — Burkholderia species-level safeguard

The genus Burkholderia is a Tier-A-equivalent kit contaminant (the genus is dominated in low-biomass samples by B. cepacia complex species), but it also contains B. pseudomallei, the etiologic agent of melioidosis — a "cannot-miss" diagnosis endemic in Thailand. For every sample with a Burkholderia genus detection, the filter:

  1. Opens the Centrifuger kreport (<sample>.centrifuger.kreport.tsv) and parses rows at the species rank (S).
  2. Sums reads assigned to Burkholderia pseudomallei (NCBI taxonomy ID 28450) at species rank.
  3. Retrieves the run's NTC species-level B. pseudomallei read count for comparison.
  4. Preserves the detection only if both: (i) species reads ≥ 500, AND (ii) species reads > the run-NTC species max.
  5. When preserved, the retained record stores the species-level read count, not the genus total.
Why this matters. Earlier versions of the filter (V3) used a substring match on the literal string "pseudomallei" anywhere in the kreport, which triggered on the parent clade pseudomallei_group and then erroneously reported the genus-level Burkholderia read count. In one study sample, this conflated 54,126 genus-level reads (dominated by B. cepacia complex species) with only 8 reads of B. pseudomallei at species rank. V4 parses the species rank correctly and gates preservation on species reads ≥ 500 against the same-run NTC species count.

Tier B — NTC-only organisms

Nine genera observed only in NTC samples across all sequencing runs are removed globally: Cereibacter, Thioclava, Bdellovibrio, Saltatorellus, Pseudogemmobacter, Minisyncoccus, Rhodoluna, Microbacterium, Arcanobacterium.

Tier 1 — Ultra-low-abundance environmental noise

Sixteen genera with median per-sample abundance < 0.5% across the dataset are removed: Shigella, Metapseudomonas, Stutzerimonas, Capsulimonas, Chamaesiphon, Chloroflexus, Flavihumibacter, Hymenobacter, Limnoglobus, Methylovirgula, Microvirga, Pelagovum, Pseudonocardia, Rufibacter, Salmonella, Spirosoma.

Tier 1 audit note. Mycoplasmopsis and Nitrospira were removed from this list in V4 after review of their actual abundance distributions revealed both at substantially higher abundance than the <0.5% threshold (Mycoplasmopsis mean 39.78%, max 70.55% across 4 study samples). Both are now retained as candidate signal.

Tier 2 — Marginal organisms (conditional removal)

Twenty-seven genera with median 0.5–2.0% abundance are retained only if detected in ≥ 2 samples AND each detection is ≥ 1.0% abundance. Below those thresholds, they are removed as likely contaminants.

Positive-control spike-in bypass

For samples designated as positive controls, Tier A removal is bypassed for any organism that is a documented spike-in for that sample. Without this bypass, the P-aeru_S5_L001 PC would fail because Pseudomonas is in Tier A — but Pseudomonas aeruginosa is precisely the organism the PC is designed to detect. The bypass restores PC interpretability without weakening Tier A removal in clinical samples.

PC sampleTypeExpected genera (filter bypassed for these)
E-coli_S4_L001PC_SINGLEEscherichia
P-aeru_S5_L001PC_SINGLEPseudomonas
S-pneumo_S2_L001PC_SINGLEStreptococcus
S-suis_S3_L001PC_SINGLEStreptococcus
PC_MIX8 replicates (5)PC_MIX8Bacillus, Enterococcus, Escherichia, Limosilactobacillus, Listeria, Pseudomonas, Salmonella, Staphylococcus (ZymoBIOMICS Microbial Community Standard)
Mixed_S6_L001MIXED4Escherichia, Pseudomonas, Streptococcus

Under the bypass, PC concordance is defined as all expected spike-in organisms must be detected AND retained by the filter.

Running the filter

python3 afi_decontamination_filter_v4.py

The script reads .calls.tsv and Centrifuger kreport files from a configured run directory and writes DECONTAMINATION-FILTER-REPORT-V4.txt with a per-sample / per-detection breakdown of filter actions. A companion appendix generator (generate_appendices.py) produces Markdown and Excel tables with biomass distribution, NTC cross-checks, and confidence labels.

14. Analytical Validation Summary

Scope. The pipeline (with V4 filter applied to clinical samples and PC bypass applied to positive controls) was validated against a 48-sample reference panel and then applied to a study cohort of 86 patient blood specimens from acute febrile illness (AFI) cases with positive blood culture but failed subculture recovery. This section summarizes performance and key findings; per-sample tables are in the companion appendix files.

Validation panel composition

CategorynDescription
Clinical (reference-lab confirmed)33E. coli (5), O. tsutsugamushi (6), Rickettsia (4), Leptospira (4), B. pseudomallei (5), S. pneumoniae (3), S. suis (3), C. burnetii (2), Yersinia (1)
Positive controls (PC_SINGLE / PC_MIX8 / MIXED4)104 PC_SINGLE, 5 PC_MIX8 (ZymoBIOMICS Standard), 1 MIXED4
Negative template controls (NTC)5Water through extraction + library prep
Total48Across 9 sequencing runs

Sample-level analytical performance

CategoryConcordantTotalRate95% CI
Clinical (target detected AND retained by V4)213363.6%46.6–77.8%
Positive controls (all expected spike-ins detected + retained)1010100%72.2–100%
NTCs (no TAC bacterial target genus retained)55100%56.6–100%
Sample-level analytical performance (clinical + PC)314372.1%57.3–83.3%
Overall validation accuracy364875.0%61.2–85.1%

The 72.1% sample-level analytical performance is the appropriate headline figure for regulatory documentation and matches the legacy APHL bioinformatic validation report for the same panel.

AFI study cohort (n=86)

  • 71 / 86 samples (82.6%) carried ≥ 1 positive call (Centrifuger Detected, Tier 1 Confirmed, or Tier 2 Probable).
  • 15 / 86 samples (17.4%) had zero detected taxa — consistent with pre-sequencing failure (DNA extraction, library prep, or sequencing depth).
  • V4 removed 70 / 217 detections (32.3%): predominantly Cutibacterium, Staphylococcus, Brevundimonas, Acinetobacter, Corynebacterium.
  • 147 detections retained across the cohort.

Key cohort findings

Rickettsiales rescue evidence in 11 / 86 samples (~13%). One Tier 1 genus-level Orientia detection (16901195_S5_L001, breadth 0.3235, ~12× NTC headroom — the highest-confidence Rickettsiales call in the cohort), and ten Tier 2 order-level "Rickettsiales detected" rescues (breadth 0.21–0.24). This pattern aligns with the expected endemic epidemiology of northeastern Thailand, where Rickettsiales (obligate intracellular, cannot be cultured on routine blood agar) account for a substantial fraction of AFI presentations. All 11 samples warrant species-specific qPCR confirmation.
Mycoplasmopsis retained in 4 samples (mean 39.78%, max 70.55%). Mycoplasma-class organisms are cell-wall-deficient and require sterol-supplemented media plus 1–3 weeks of incubation — exactly the profile that explains positive blood culture signal with failed aerobic subculture. Confirmation by Mycoplasma-specific PCR is recommended.
No study sample contains B. pseudomallei above the species-level threshold. The genus-level Burkholderia signal in the cohort is dominated by B. cepacia complex species (kit contaminants). The validation-panel B. pseudomallei samples (13,744–65,016 species-level reads) remain authentic positives — this specific cohort simply does not contain melioidosis as detected by 16S.

Additional candidate detections (require orthogonal confirmation): Leptospira in one sample at 22.78% (with a same-run NTC contamination caveat — the NTC carried 78,691 Leptospira reads, ~10× the study sample); Brucella in three samples at near-noise abundance (mean 0.98%, max 1.96%); Streptococcus in five samples (V1–V3 cannot resolve species).

Companion documents (manuscript-ready package)

FileContent
MANUSCRIPT-FINAL-DRAFT.mdLong-form manuscript draft (~7,800 words): Title, Abstract, Introduction, Methods, Results, Discussion, Limitations, Conclusion, References, Appendices
MANUSCRIPT-CONDENSED-DRAFT.mdJCM-style condensed draft (~3,900 words)
MANUSCRIPT-WALKTHROUGH-THAI.mdThai-language walkthrough for the AFI wet-lab team
APPENDIX-VALIDATION-PANEL.md / APPENDIX-STUDY-SAMPLES.mdPer-sample detection tables (Markdown)
APPENDICES.xlsxSame per-sample tables in Excel (4 sheets: validation detections, validation summary, study detections, study summary)
figure_a_sankey.html / figure_a_sankey.pngFigure 1: Sankey diagram of pre- vs. post-V4 filter genus distribution
DECONTAMINATION-FILTER-REPORT-V4.txtRaw V4 filter output with summary statistics and Burkholderia species evidence per sample

Methodological revision log

The following corrections were identified during a critical review and are reflected throughout this documentation and the manuscript drafts:

ItemPre-revision (V3)Post-revision (V4)
B. pseudomallei safeguardSubstring match on "pseudomallei" → reported genus-level read countSpecies-rank kreport parsing with ≥ 500 read floor and NTC comparison
MycoplasmopsisRemoved under Tier 1 "ultra-low abundance" (mis-categorized)Retained (mean 39.78% across 4 samples)
BrevundimonasNot in any tier; retained at mean 16% across 9 samplesAdded to Tier A; removed
Study cohort size"56 samples" (counted only Centrifuger-Detected entries)86 samples (71 with detections, 15 zero-detection pre-seq failures)
Rickettsiales in cohort"0 samples" (Centrifuger-only view)11 samples (with Minimap2 rescue rows included)
Sample 00618_S7_L001"Failed rescue → Discordant"Tier 2 Probable rescue → Concordant
P-aeru_S5_L001 PCFailed (Pseudomonas removed by Tier A)Passes with PC spike-in bypass
Reference citations (Glassing, Lauder, Tan)Garbled author namesVerified against PubMed (PMID 27239228, 27338728, 36997797)

15. Docker Images

InputDefault imageDescription
afi_core_dockerphemarajata614/afi-terra:0.4.1Core analysis image (Python, samtools, minimap2)
fastp_dockerstaphb/fastp:0.23.4QC trimming
centrifuger_dockerphemarajata614/centrifuger:1.1.0Centrifuger classifier

Building and pushing the AFI core image

bash scripts/build_push_afi_core_image.sh phemarajata614 0.4.1 linux/amd64

16. Troubleshooting

Centrifuger task fails or segfaults
The default Terra VM is too small. Set centrifuger_memory = "96G" and centrifuger_disks = "local-disk 375 HDD" in your input JSON.
MatchNTCBackground fails
Every run_id in the batch must include at least one sample with sample_type of NTC or NC. Check the Sheet Builder validation output for missing controls.
Import resolution errors in Terra
Do not upload only the main WDL file. Either use Dockstore (recommended) or package all WDL files and task imports into a ZIP archive before uploading.
Export TSV button stays greyed out
Click Validate first. The button enables only after all validation checks pass. Review the error messages and correct missing controls, duplicate sample IDs, or invalid table name format.