AFI Terra Pipeline

Pipeline version: 0.4.1 | Post-pipeline filter: V4 | Workflows: AFI_16S_Main / AFI_16S_Batch | Platform: Terra (Broad Institute) | Last updated: May 13, 2026

What's new in this revision: Sections 13 and 14 are new — the V4 post-pipeline decontamination filter (including the species-level Burkholderia pseudomallei safeguard and the positive-control spike-in bypass) and an analytical validation summary against a 48-sample reference panel plus a 86-sample AFI study cohort. Section 12 has been expanded to document Tier 2 "Probable" thresholds. All other sections are unchanged from the v0.4.1 user guide.

1. Overview

The afi_terra pipeline detects and identifies Rickettsiales organisms — primarily Orientia, Rickettsia, and related genera — from 16S rRNA metagenomic sequencing data. It runs on Terra, the Broad Institute's cloud execution platform, using WDL 1.0 workflows backed by Docker images.

Target organisms

The pipeline is designed to detect genera in the order Rickettsiales. For Orientia and Rickettsia specifically, a confirmatory 16S rRNA alignment step provides higher specificity. All other genera detected by the classifier (e.g., Leptospira, Burkholderia, Anaplasma) are reported from Centrifuger read counts alone.

Two workflows

Workflow	WDL file	Use case
`AFI_16S_Main`	`wdl/AFI_16S_Main.wdl`	Single sample
`AFI_16S_Batch`	`wdl/AFI_16S_Batch.wdl`	Full sequencing run (scatter/gather)

For clinical use, run the batch workflow. It processes all samples in a run together, automatically computing a per-run NTC background, and producing a consolidated run_summary.tsv. The Terra Sheet Builder desktop app (Section 8) generates the required sample sheet TSV without manual editing.

Two sample modes

Mode	Use case	Primary output
`routine`	Clinical samples with unknown organisms	`routine_summary.tsv`
`validation`	Samples with known expected organisms	`validation_summary.tsv`

A single batch run can contain both modes simultaneously.

2. Architecture and Data Flow

Processing steps

Each sample passes through up to nine steps. Steps 1–5 run in Phase 1; steps 6–8 run in Phase 2 after per-run NTC backgrounds are computed; step 9 is the final batch-level gather.

Step 1  Human read dehosting        NCBI SRA Scrubber (optional, default on)
Step 2  QC / adapter trimming       fastp v0.23.4
Step 3  Taxonomic classification    Centrifuger -> kreport -> genus counts TSV
Step 4  16S alignment               minimap2 -> BAM
Step 5  Alignment metrics           extract_rick16s_metrics.py -> align_metrics.tsv
        ---- batch gather: BuildNTCBackground (NTC/NC samples only; per run_id) ----
        ---- batch gather: MatchNTCBackground (assigns each sample its run's NTC) ----
Step 6  NTC-aware interpretation    call_taxa.py -> calls.tsv + taxa_evidence.tsv
Step 7a Validation summary          compare_expected (validation mode only)
Step 7b Routine summary             summarize_routine (routine mode only)
Step 8  Run summary                 run_summary.tsv (batch only)

Batch workflow phases

Phase 1 scatter — runs steps 1–5 for all samples in parallel.

BuildNTCBackground — collects all NTC/NC metrics and computes one ntc_background_<run_id>.tsv per distinct run_id.

MatchNTCBackground — maps each sample to the NTC background for its run_id. This task fails if any run_id has no NTC/NC sample.

Phase 2 scatter — runs steps 6–7 for all samples using matched NTC backgrounds.

BuildRunSummary — merges all per-sample summaries into a single run_summary.tsv.

3. Reference Files and Databases

16S Rickettsiales panel

File	Description
`rickettsiales_panel_16S.fa`	16S reference sequences, original
`rickettsiales_panel_16S.clean.fa`	Cleaned version (recommended)
`rickettsiales_panel_16S.mmi`	Pre-built minimap2 index

Recommendation: Use rickettsiales_panel_16S.clean.fa or the pre-built .mmi index. The pre-built index avoids index-build overhead at runtime.

Centrifuger database

The database (~67 GiB compressed) is stored in GCS as a TAR.GZ archive. Provide two inputs:

Input	Example	Description
`centrifuger_db`	`centrifuger_bact_arch_plus_rickettsiales`	Index prefix string
`centrifuger_db_archives`	`["gs://bucket/...tar.gz"]`	Array of GCS paths

VM sizing: The Centrifuger task requires centrifuger_memory = "96G" and centrifuger_disks = "local-disk 375 HDD". The default Terra VM (1 CPU / 2 GB RAM) is far too small and will cause the task to fail.

4. Importing Workflows into Terra

Option 1: Dockstore (recommended)

In Dockstore, link the GitHub repository PHemarajata/afi_terra.
Create or refresh a workflow version from the desired Git tag or branch.
In Terra, navigate to Workflows and click Find a Workflow.
Select Dockstore as the source and import via TRS. Terra resolves all imports automatically.

Option 2: Direct ZIP upload

If Dockstore is not available, create a ZIP preserving relative directory paths:

zip -r afi_terra_wdl.zip \
  wdl/AFI_16S_Batch.wdl \
  wdl/AFI_16S_Main.wdl \
  wdl/tasks/ \
  NCBI_scrub_PE/tasks/quality_control/read_filtering/task_ncbi_scrub.wdl

Important: Uploading only AFI_16S_Batch.wdl alone will fail. It imports AFI_16S_Main.wdl and all task WDLs via relative paths.

5. Running the Single-Sample Workflow

Note: For production clinical runs, use the batch workflow. The single-sample workflow requires a pre-computed ntc_background.tsv to be supplied manually.

Required inputs

Input	Type	Description
`sample_id`	String	Unique sample identifier
`r1_fastq`	File	GCS path to R1 FASTQ (.fastq.gz)
`r2_fastq`	File	GCS path to R2 FASTQ (.fastq.gz)
`rickettsiales_panel`	File	GCS path to 16S reference FASTA or .mmi index
`ntc_background`	File	GCS path to NTC background TSV; use placeholder for first pass
`centrifuger_db`	String	Centrifuger index prefix
`centrifuger_db_archives`	Array[File]	GCS paths to TAR.GZ archives

Mode and type inputs

Input	Default	Allowed values
`mode`	`routine`	`routine`, `validation`
`sample_type`	`clinical`	`NTC`, `NC`, `PC_MIX8`, `PC_SINGLE`, `MIXED4`, `clinical`, `PC`
`use_human_scrub`	`true`	`true`, `false`

Launching in Terra

Navigate to Workflows and select AFI_16S_Main.
Choose Run workflow with inputs defined by file paths.
Upload or paste your input JSON, then click Run Analysis.

6. Running the Batch Workflow

The batch workflow (AFI_16S_Batch) processes one or more sequencing runs in a single Terra submission. It scatters each sample through processing, automatically builds a per-run NTC background, and produces a consolidated run summary.

Multi-run support: A single batch submission can include samples from multiple sequencing runs. Assign each sample a run_id matching its run; NTC backgrounds are computed independently per run_id.

Per-sample arrays

Input	Type	Description
`run_ids`	Array[String]	Sequencing run identifier per sample
`sample_ids`	Array[String]	Sample identifiers
`r1_fastqs`	Array[File]	R1 FASTQ GCS paths
`r2_fastqs`	Array[File]	R2 FASTQ GCS paths
`sample_types`	Array[String]	Sample type for each sample
`modes`	Array[String]	`routine` or `validation` per sample
`expected_taxa`	Array[String]	Expected taxa string; `""` for routine samples

Sample types in a batch run

sample_type	Description	Contributes to NTC background
`NTC`	No-template control	Yes
`NC`	Negative control (alias for NTC)	Yes
`PC_MIX8`	8-organism positive control	No
`PC_SINGLE`	Single-organism positive control	No
`clinical`	Clinical patient sample	No
`PC`	Generic positive control	No

NTC requirement: Every distinct run_id in the batch must have at least one sample with sample_type of NTC or NC; otherwise MatchNTCBackground will fail.

7. The Two-Pass Pattern

The batch workflow handles NTC backgrounds entirely automatically — no manual intervention is needed. The two-pass procedure below applies only to the single-sample workflow or exceptional reprocessing scenarios.

Step 1: First pass — placeholder NTC background

Supply wdl/inputs/ntc_background.placeholder.tsv (all-zero thresholds) as ntc_background. Collect the align_metrics output for each NTC/NC sample from Terra.

Step 2: Build run-specific NTC background

python3 scripts/build_ntc_background_from_metrics.py \
  --ntc-metrics path/to/NTC_S11.align_metrics.tsv \
  --ntc-metrics path/to/NTC_S12.align_metrics.tsv \
  --out wdl/inputs/run3.ntc_background.tsv

Step 3: Upload and re-run

gsutil cp wdl/inputs/run3.ntc_background.tsv gs://YOUR_BUCKET/ref/run3_ntc_background.tsv

Re-submit the sample with ntc_background pointing to the real file. The second pass produces NTC-corrected calls.

8. Terra Sheet Builder (GUI)

The Terra Sheet Builder is a desktop GUI application that generates a Terra-compatible sample-set TSV for AFI_16S_Batch without manually editing JSON or TSV files. It is the recommended way to prepare batch inputs for routine clinical use.

Getting the application

Pre-built binaries are built automatically by GitHub Actions. Download the artifact for your platform from Actions → Build Terra Sheet Builder on the repository page:

Platform	Artifact
Linux (amd64)	Single-file ELF executable
Windows (x64)	`.exe`, no console window
macOS Intel	`.app` bundle, zipped
macOS Apple Silicon (arm64)	`.app` bundle, zipped

Run from source (requires Python 3.10+):

pip install PySide6
python3 tools/terra_sheet_builder/terra_sheet_builder.py

Step-by-Step Walkthrough

The following walkthrough covers generating a Terra metadata sheet for the AFI_16S_Batch workflow for a two-run batch.

Specify the number of runs in this batch

On the first screen, enter how many sequencing runs you want to process together. In this example, there are two runs to process. Use the spinner to set the count.

Click Continue

After setting the run count, click Continue to advance to the run-naming step.

Name each run

In Step 2, enter a unique name for each run. Names are used as the run_id value for all samples in that run. No spaces are allowed — use underscores or hyphens (e.g., run_2024_11). In this example, runs are named Run 1 and Run 2. Click Next when done.

Select FASTQ files for each run

In this step, add FASTQ files to each run. You can import from pre-generated TSV files or browse to the folder where R1 and R2 FASTQ files are located. Click Browse Folder to select a directory — the app auto-discovers all R1/R2 pairs using the _R1_/_R2_ naming convention.

Three methods to add samples:

Browse folder — auto-discovers all R1/R2 FASTQ pairs in a directory
Add files… — multi-file dialog; app auto-pairs R1+R2 by name pattern
Import from TSV… — global import with run_id and sample_id columns

Browse to the FASTQ folder and click Open

Navigate to the folder containing the paired read files. Select the folder and click Open. The app will scan for all valid R1/R2 FASTQ pairs automatically.

Files are sorted into the metadata table automatically

The app sorts discovered files into the metadata table based on the specified runs. Repeat the folder selection for the remaining runs to be processed.

Verify sample names and file names, then click Next

Tables with sample names and file names are populated automatically. Review the table to confirm all entries are correct. Once ready, click Next to proceed to sample metadata entry.

Review the sample metadata page

This page requires additional information. Note that the Analysis Date field has already been pre-filled automatically with today's date.

Enter the Terra table name and operator initials

Fill in the Terra metadata table name (lowercase alphanumeric, ≤32 chars, must start with a letter) and the operator's initials. These are recorded in the analysis_comments column of the output TSV for traceability.

Designate the NTC (negative control) for each run

Below the header fields, the sample table lists all samples for both runs. Each run must have at least one NTC and one positive control. In the sample_type column, find the NTC sample for Run 1 and select NTC from the dropdown. The sample_type for clinical samples remains clinical.

Confirm the NTC designation

The NTC sample for Run 1 is now designated. The sample_type dropdown shows NTC for that row. Continue to designate the positive control for Run 1.

Begin changing the positive control sample type

For the positive control sample in Run 1, click the sample_type dropdown to change it from clinical to the appropriate control type.

Open the sample type dropdown

Click the dropdown control for the positive control row to open the list of available sample types.

Select the positive control type

Change the sample_type to the type of control used. In this example, PC_MIX8 is selected — a positive control containing DNA from 8 bacterial species.

sample_type	mode (auto-filled)	expected_taxa (auto-filled)
`PC_MIX8`	`validation`	Bacillus;Listeria;Staphylococcus;Enterococcus;Limosilactobacillus;Salmonella;Escherichia;Pseudomonas
`clinical`	`routine`	(empty)
`NTC` / `NC`	`routine`	(empty)

Expected taxa auto-populate for PC_MIX8

Once PC_MIX8 is selected, the expected_taxa field auto-fills with the 8 bacterial species expected from that control. This tells the validation workflow which taxa should be identified.

Validation catches missing controls before export

If an operator forgets to designate the NTC and positive control for Run 2 and clicks Validate, the Export TSV button remains greyed out. The app has detected missing required controls.

Review and fix the validation error

The app generates a descriptive error message stating that Run 2 has no NTC and no positive control specified. Click OK to acknowledge. Correct the missing designations for Run 2 before proceeding.

Validation checks performed:

No duplicate sample_id values across all runs
Each run has at least one NTC or NC sample
Each run has at least one positive control (PC_MIX8, PC_SINGLE, MIXED4, or PC)
All validation-mode rows have a non-empty expected_taxa field
Table name matches ^[a-z][a-z0-9_]{0,31}$

Validate — all checks must pass before export

After correcting all issues, click Validate again. If all checks pass, the Export TSV button becomes clickable. The TSV files are now ready to be uploaded to Terra.bio.

Export TSV and save the file

Click Export TSV and save the file to a location you can easily locate. The file will be named using the table name you entered in the header fields.

Verify the exported file in Microsoft Excel

Open the exported TSV in Microsoft Excel to confirm the format is correct and compatible with Terra.bio. Check that the entity:<table_name>_id column is the first column, all sample fields are present, and the analysis_comments column appears as the last column.

Importing the TSV into Terra

In your Terra workspace, go to Data → Import Data → Upload TSV.
Select the exported TSV file.
Terra will create or update a sample set entity table with the name you provided.

The exported file uses entity:<table_name>_id as the first column (required by Terra) followed by all sample fields. An analysis_comments column appended as the last column contains <table_name> | <date> | <initials> for traceability.

9. Helper Scripts

All helper scripts are in the scripts/ directory. They run locally (not in Terra) and are used to prepare inputs, post-process outputs, and manage Docker images.

`build_batch_inputs_json.py`

Converts a sample sheet TSV or AFI mapping TSV into a Terra-ready batch input JSON.

python3 scripts/build_batch_inputs_json.py \
  --sample-sheet wdl/inputs/batch_samples.template.tsv \
  --out-json wdl/inputs/batch_run1.json \
  --rickettsiales-panel gs://YOUR_BUCKET/ref/rickettsiales_panel_16S.clean.fa \
  --centrifuger-db centrifuger_bact_arch_plus_rickettsiales

`build_ntc_background_from_metrics.py`

Builds a run-specific NTC background TSV from one or more NTC align_metrics.tsv output files.

python3 scripts/build_ntc_background_from_metrics.py \
  --ntc-metrics path/to/NTC1.align_metrics.tsv \
  --ntc-metrics path/to/NTC2.align_metrics.tsv \
  --out wdl/inputs/run_ntc_background.tsv

`make_terra_import_sheet.py`

CLI companion to the Terra Sheet Builder GUI. Generates or validates a Terra-compatible sample import TSV and optionally writes an annotated Excel workbook.

python3 scripts/make_terra_import_sheet.py --template --output my_run.tsv
python3 scripts/make_terra_import_sheet.py --input samples.csv --output terra_import.tsv --excel

10. Output Files Reference

Per-sample outputs

File	Description
`<id>.genus_counts.tsv`	Genus-level Centrifuger read counts
`<id>.bam` / `.bai`	minimap2 alignment, sorted and indexed
`<id>.align_metrics.tsv`	Per-genus alignment metrics (`mapped_reads`, `max_breadth`)
`calls.tsv`	NTC-aware taxa calls — the core interpretive output
`taxa_evidence.tsv`	Detailed per-genus evidence with justification strings
`routine_summary.tsv`	Detected taxa list (routine mode only)
`validation_summary.tsv`	Concordance summary (validation mode only)

Batch-only outputs

File	Description
`ntc_background_<run_id>.tsv`	Auto-computed NTC background per distinct `run_id`
`run_summary.tsv`	Batch-level summary with per-run PC8 validity — one row per sample

11. Interpreting Calls and Summaries

Call values — alignment source (Orientia and Rickettsia)

Call	Meaning
Confirmed	Strong positive: reads and breadth meet confirmation thresholds and are well above NTC
Probable	Moderate positive: lower thresholds met and reads exceed NTC
Not_Confirmed	Equivocal: sufficient reads but at or below NTC level; cannot distinguish from noise
Negative	Insufficient reads

Call values — Centrifuger source (all other genera)

Call	Meaning
Detected	Reads exceed floor threshold and are well above NTC
Not_Detected	Negative or below threshold

PC8 validity

For PC_MIX8 samples, a pass/fail flag is computed. At least 6 of the 8 expected taxa must be detected for a passing result. In run_summary.tsv, run_pc8_valid carries this value for every row, allowing immediate run-level QC assessment.

12. Detection Thresholds

Configurable thresholds (Tier 1 "Confirmed" and Centrifuger "Detected")

Input	Default	Description
`align_confirm_reads`	100	Minimum mapped reads for Confirmed alignment call
`align_confirm_breadth`	0.25	Minimum breadth of coverage for Confirmed call
`align_fold`	5.0	Minimum fold over NTC reads for Confirmed call
`cfr_floor`	500	Minimum Centrifuger reads for Detected call
`cfr_fold`	5.0	Minimum fold over NTC reads for Detected call

Hardcoded thresholds (Tier 2 "Probable" and equivocal calls)

The Tier 2 order-level Rickettsiales rescue and the equivocal-call thresholds are defined in scripts/call_taxa.py. These are not currently exposed as workflow inputs but are documented here for transparency.

Call	Reads	Breadth	NTC requirement
Confirmed	≥ `align_confirm_reads` (100)	≥ `align_confirm_breadth` (0.25)	≥ `align_fold` × NTC reads (5×)
Probable	≥ 50	≥ 0.20	> NTC reads (no fold requirement)
Not_Confirmed	≥ 50	(any)	≤ NTC reads
Negative	< 50	—	—

How the two tiers map to clinical reporting. Confirmed calls report the specific Rickettsiales genus (Orientia or Rickettsia). Probable calls report "Rickettsiales detected (genus uncertain; confirmatory species-specific qPCR recommended)" — clinically actionable for empirical doxycycline coverage in endemic regions where V1–V3 cannot reliably discriminate Rickettsiales at the genus level.

13. V4 Decontamination Filter (post-pipeline)

What this is. The pipeline's per-genus NTC subtraction (Section 2) and Centrifuger/alignment thresholds (Section 12) handle run-specific contamination. The V4 filter is an additional post-processing step that targets cross-study reagent / skin / water contaminants documented in landmark low-biomass microbiome reviews (Salter 2014, Glassing 2016, Lauder 2016, de Goffau 2018, Tan 2023). It runs on the pipeline's .calls.tsv outputs and does not modify pipeline behavior or thresholds.

The filter is implemented as a standalone Python script (afi_decontamination_filter_v4.py) that ingests .calls.tsv rows with positive calls (Detected / Confirmed / Probable), applies the tiered rules below, and emits a report of which detections were kept and which were removed.

Tier A — High-confidence kit / skin / water contaminants

The following 11 genera are removed on detection in clinical and study samples. Membership is documented in five or more independent low-biomass contamination reviews and confirmed in this dataset's NTC profiles.

Genus	Most common source
Pseudomonas	Kits, water, Taq, airborne
Ralstonia	Kits, ultrapure water biofilm
Bradyrhizobium	Kits, ultrapure water
Sphingomonas	Kits, water systems
Stenotrophomonas	Kits, water, PCR reagents
Methylobacterium	Kits, ultrapure water systems
Acinetobacter	Kits, PCR reagents, foot traffic
Cutibacterium	Skin (alcohol-resistant), airborne
Staphylococcus	Skin, airborne (coagulase-negative spp.)
Corynebacterium	Kits, skin, airborne
Brevundimonas	Kits, water; up to 285,740 NTC reads observed in run 6_and_7

Tier A exception — Burkholderia species-level safeguard

The genus Burkholderia is a Tier-A-equivalent kit contaminant (the genus is dominated in low-biomass samples by B. cepacia complex species), but it also contains B. pseudomallei, the etiologic agent of melioidosis — a "cannot-miss" diagnosis endemic in Thailand. For every sample with a Burkholderia genus detection, the filter:

Opens the Centrifuger kreport (<sample>.centrifuger.kreport.tsv) and parses rows at the species rank (S).
Sums reads assigned to Burkholderia pseudomallei (NCBI taxonomy ID 28450) at species rank.
Retrieves the run's NTC species-level B. pseudomallei read count for comparison.
Preserves the detection only if both: (i) species reads ≥ 500, AND (ii) species reads > the run-NTC species max.
When preserved, the retained record stores the species-level read count, not the genus total.

Why this matters. Earlier versions of the filter (V3) used a substring match on the literal string "pseudomallei" anywhere in the kreport, which triggered on the parent clade pseudomallei_group and then erroneously reported the genus-level Burkholderia read count. In one study sample, this conflated 54,126 genus-level reads (dominated by B. cepacia complex species) with only 8 reads of B. pseudomallei at species rank. V4 parses the species rank correctly and gates preservation on species reads ≥ 500 against the same-run NTC species count.

Tier B — NTC-only organisms

Nine genera observed only in NTC samples across all sequencing runs are removed globally: Cereibacter, Thioclava, Bdellovibrio, Saltatorellus, Pseudogemmobacter, Minisyncoccus, Rhodoluna, Microbacterium, Arcanobacterium.

Tier 1 — Ultra-low-abundance environmental noise

Sixteen genera with median per-sample abundance < 0.5% across the dataset are removed: Shigella, Metapseudomonas, Stutzerimonas, Capsulimonas, Chamaesiphon, Chloroflexus, Flavihumibacter, Hymenobacter, Limnoglobus, Methylovirgula, Microvirga, Pelagovum, Pseudonocardia, Rufibacter, Salmonella, Spirosoma.

Tier 1 audit note. Mycoplasmopsis and Nitrospira were removed from this list in V4 after review of their actual abundance distributions revealed both at substantially higher abundance than the <0.5% threshold (Mycoplasmopsis mean 39.78%, max 70.55% across 4 study samples). Both are now retained as candidate signal.

Tier 2 — Marginal organisms (conditional removal)

Twenty-seven genera with median 0.5–2.0% abundance are retained only if detected in ≥ 2 samples AND each detection is ≥ 1.0% abundance. Below those thresholds, they are removed as likely contaminants.

Positive-control spike-in bypass

For samples designated as positive controls, Tier A removal is bypassed for any organism that is a documented spike-in for that sample. Without this bypass, the P-aeru_S5_L001 PC would fail because Pseudomonas is in Tier A — but Pseudomonas aeruginosa is precisely the organism the PC is designed to detect. The bypass restores PC interpretability without weakening Tier A removal in clinical samples.

PC sample	Type	Expected genera (filter bypassed for these)
`E-coli_S4_L001`	PC_SINGLE	Escherichia
`P-aeru_S5_L001`	PC_SINGLE	Pseudomonas
`S-pneumo_S2_L001`	PC_SINGLE	Streptococcus
`S-suis_S3_L001`	PC_SINGLE	Streptococcus
`PC_MIX8` replicates (5)	PC_MIX8	Bacillus, Enterococcus, Escherichia, Limosilactobacillus, Listeria, Pseudomonas, Salmonella, Staphylococcus (ZymoBIOMICS Microbial Community Standard)
`Mixed_S6_L001`	MIXED4	Escherichia, Pseudomonas, Streptococcus

Under the bypass, PC concordance is defined as all expected spike-in organisms must be detected AND retained by the filter.

Running the filter

python3 afi_decontamination_filter_v4.py

The script reads .calls.tsv and Centrifuger kreport files from a configured run directory and writes DECONTAMINATION-FILTER-REPORT-V4.txt with a per-sample / per-detection breakdown of filter actions. A companion appendix generator (generate_appendices.py) produces Markdown and Excel tables with biomass distribution, NTC cross-checks, and confidence labels.

14. Analytical Validation Summary

Scope. The pipeline (with V4 filter applied to clinical samples and PC bypass applied to positive controls) was validated against a 48-sample reference panel and then applied to a study cohort of 86 patient blood specimens from acute febrile illness (AFI) cases with positive blood culture but failed subculture recovery. This section summarizes performance and key findings; per-sample tables are in the companion appendix files.

Validation panel composition

Category	n	Description
Clinical (reference-lab confirmed)	33	E. coli (5), O. tsutsugamushi (6), Rickettsia (4), Leptospira (4), B. pseudomallei (5), S. pneumoniae (3), S. suis (3), C. burnetii (2), Yersinia (1)
Positive controls (PC_SINGLE / PC_MIX8 / MIXED4)	10	4 PC_SINGLE, 5 PC_MIX8 (ZymoBIOMICS Standard), 1 MIXED4
Negative template controls (NTC)	5	Water through extraction + library prep
Total	48	Across 9 sequencing runs

Sample-level analytical performance

Category	Concordant	Total	Rate	95% CI
Clinical (target detected AND retained by V4)	21	33	63.6%	46.6–77.8%
Positive controls (all expected spike-ins detected + retained)	10	10	100%	72.2–100%
NTCs (no TAC bacterial target genus retained)	5	5	100%	56.6–100%
Sample-level analytical performance (clinical + PC)	31	43	72.1%	57.3–83.3%
Overall validation accuracy	36	48	75.0%	61.2–85.1%

The 72.1% sample-level analytical performance is the appropriate headline figure for regulatory documentation and matches the legacy APHL bioinformatic validation report for the same panel.

AFI study cohort (n=86)

71 / 86 samples (82.6%) carried ≥ 1 positive call (Centrifuger Detected, Tier 1 Confirmed, or Tier 2 Probable).
15 / 86 samples (17.4%) had zero detected taxa — consistent with pre-sequencing failure (DNA extraction, library prep, or sequencing depth).
V4 removed 70 / 217 detections (32.3%): predominantly Cutibacterium, Staphylococcus, Brevundimonas, Acinetobacter, Corynebacterium.
147 detections retained across the cohort.

Key cohort findings

Rickettsiales rescue evidence in 11 / 86 samples (~13%). One Tier 1 genus-level Orientia detection (16901195_S5_L001, breadth 0.3235, ~12× NTC headroom — the highest-confidence Rickettsiales call in the cohort), and ten Tier 2 order-level "Rickettsiales detected" rescues (breadth 0.21–0.24). This pattern aligns with the expected endemic epidemiology of northeastern Thailand, where Rickettsiales (obligate intracellular, cannot be cultured on routine blood agar) account for a substantial fraction of AFI presentations. All 11 samples warrant species-specific qPCR confirmation.

Mycoplasmopsis retained in 4 samples (mean 39.78%, max 70.55%). Mycoplasma-class organisms are cell-wall-deficient and require sterol-supplemented media plus 1–3 weeks of incubation — exactly the profile that explains positive blood culture signal with failed aerobic subculture. Confirmation by Mycoplasma-specific PCR is recommended.

No study sample contains B. pseudomallei above the species-level threshold. The genus-level Burkholderia signal in the cohort is dominated by B. cepacia complex species (kit contaminants). The validation-panel B. pseudomallei samples (13,744–65,016 species-level reads) remain authentic positives — this specific cohort simply does not contain melioidosis as detected by 16S.

Additional candidate detections (require orthogonal confirmation): Leptospira in one sample at 22.78% (with a same-run NTC contamination caveat — the NTC carried 78,691 Leptospira reads, ~10× the study sample); Brucella in three samples at near-noise abundance (mean 0.98%, max 1.96%); Streptococcus in five samples (V1–V3 cannot resolve species).

Companion documents (manuscript-ready package)

File	Content
`MANUSCRIPT-FINAL-DRAFT.md`	Long-form manuscript draft (~7,800 words): Title, Abstract, Introduction, Methods, Results, Discussion, Limitations, Conclusion, References, Appendices
`MANUSCRIPT-CONDENSED-DRAFT.md`	JCM-style condensed draft (~3,900 words)
`MANUSCRIPT-WALKTHROUGH-THAI.md`	Thai-language walkthrough for the AFI wet-lab team
`APPENDIX-VALIDATION-PANEL.md` / `APPENDIX-STUDY-SAMPLES.md`	Per-sample detection tables (Markdown)
`APPENDICES.xlsx`	Same per-sample tables in Excel (4 sheets: validation detections, validation summary, study detections, study summary)
`figure_a_sankey.html` / `figure_a_sankey.png`	Figure 1: Sankey diagram of pre- vs. post-V4 filter genus distribution
`DECONTAMINATION-FILTER-REPORT-V4.txt`	Raw V4 filter output with summary statistics and Burkholderia species evidence per sample

Methodological revision log

The following corrections were identified during a critical review and are reflected throughout this documentation and the manuscript drafts:

Item	Pre-revision (V3)	Post-revision (V4)
B. pseudomallei safeguard	Substring match on "pseudomallei" → reported genus-level read count	Species-rank kreport parsing with ≥ 500 read floor and NTC comparison
Mycoplasmopsis	Removed under Tier 1 "ultra-low abundance" (mis-categorized)	Retained (mean 39.78% across 4 samples)
Brevundimonas	Not in any tier; retained at mean 16% across 9 samples	Added to Tier A; removed
Study cohort size	"56 samples" (counted only Centrifuger-Detected entries)	86 samples (71 with detections, 15 zero-detection pre-seq failures)
Rickettsiales in cohort	"0 samples" (Centrifuger-only view)	11 samples (with Minimap2 rescue rows included)
Sample 00618_S7_L001	"Failed rescue → Discordant"	Tier 2 Probable rescue → Concordant
P-aeru_S5_L001 PC	Failed (Pseudomonas removed by Tier A)	Passes with PC spike-in bypass
Reference citations (Glassing, Lauder, Tan)	Garbled author names	Verified against PubMed (PMID 27239228, 27338728, 36997797)

15. Docker Images

Input	Default image	Description
`afi_core_docker`	`phemarajata614/afi-terra:0.4.1`	Core analysis image (Python, samtools, minimap2)
`fastp_docker`	`staphb/fastp:0.23.4`	QC trimming
`centrifuger_docker`	`phemarajata614/centrifuger:1.1.0`	Centrifuger classifier

Building and pushing the AFI core image

bash scripts/build_push_afi_core_image.sh phemarajata614 0.4.1 linux/amd64

16. Troubleshooting

Centrifuger task fails or segfaults
The default Terra VM is too small. Set centrifuger_memory = "96G" and centrifuger_disks = "local-disk 375 HDD" in your input JSON.

MatchNTCBackground fails
Every run_id in the batch must include at least one sample with sample_type of NTC or NC. Check the Sheet Builder validation output for missing controls.

Import resolution errors in Terra
Do not upload only the main WDL file. Either use Dockstore (recommended) or package all WDL files and task imports into a ZIP archive before uploading.

Export TSV button stays greyed out
Click Validate first. The button enables only after all validation checks pass. Review the error messages and correct missing controls, duplicate sample IDs, or invalid table name format.