Acoustic Pattern Matching

22 minute read

“Acoustic pattern matching is search—except your ‘strings’ are waveforms and your distance metric is learned.”

1. Problem Statement

Acoustic pattern matching is the speech version of “find occurrences of a pattern in a long sequence”: the pattern is audio, the sequence is audio, and “match” means similar enough under real-world variability.

Common product needs:

Keyword spotting (KWS) without full ASR (wake words, commands, brand names).
Acoustic event detection (sirens, alarms, gunshots, glass breaking).
Audio deduplication / near-duplicate detection (re-uploads, content ID, spam).
Phonetic search (“find clips that sound like X” even when spelling differs).
Template matching (“did the agent read the compliance disclosure?”).

1.1 What makes acoustic matching hard?

Audio is “high entropy input” with multiple nuisance factors:

Time variability: the same word stretches/compresses; silence insertions happen.
Channel variability: microphone frequency response, codec artifacts, packet loss concealment.
Noise variability: background, reverberation, overlapping speakers.
Speaker variability: accent, pitch, speaking style.

So the real problem is not exact equality; it’s:

Find target segments whose representation is close to the query after discounting allowable distortions.

1.2 Output contract (what you should return)

In production, you rarely want just a boolean. You want:

Best match span: start/end time (or frame indices).
Score: similarity/distance, plus a calibrated confidence if possible.
Evidence: which model/version, which template ID, which thresholds.
Failure reason: “no_speech”, “too_noisy”, “low_confidence”, etc.

1.3 Constraints you should design around

Different deployments push you into different architectures:

Offline search: accuracy and throughput matter; latency is less strict.
Interactive query: p95 latency must be low (hundreds of ms to a couple seconds).
On-device streaming: compute/energy budgets dominate; false alarms are expensive.

2. Fundamentals

2.1 Representation: what exactly are we matching?

Matching raw waveforms is fragile (phase, channel). Most systems match either:

Handcrafted features: MFCC, log-mel spectrograms.
Learned embeddings: self-supervised speech/audio representations (frame-level or segment-level).
Hybrid pipelines: embeddings for fast retrieval, alignment for verification/localization.

This mirrors the idea behind “Wildcard Matching”: you define a state space (time positions) and compute whether/how they align, except the “character comparison” is a distance function.

2.2 Two broad approaches

1) Classical signal matching

log-mel/MFCC + Dynamic Time Warping (DTW) for elastic alignment
template matching / correlation variants

2) Embedding-based matching

map each clip/window into a vector space
retrieve top-K via vector search (ANN)
optionally refine/localize with alignment or a second-stage scorer

2.3 What DTW is doing (intuition)

Think of the query feature sequence (Q) and a candidate sequence (X). We build a cost matrix (C[i,j] = d(Q_i, X_j)) and find a monotonic path from ((0,0)) to ((T_q,T_x)) minimizing cumulative cost.

Why this helps:

if the user says the keyword slower, the DTW path “lingers” on some frames
if the user says it faster, the path advances more quickly

DTW is basically a DP over a state machine: state = (i, j), transitions = advance one or both sequences.

2.4 A practical comparison table (what to use when)

Approach	Strengths	Weaknesses	Best for
MFCC/log-mel + DTW	interpretable, no training required, handles time warping	alignment cost, sensitive to channel unless normalized	prototyping, small corpora, compliance templates
Segment embeddings + ANN	fast at scale, good robustness if trained well	needs data + training, localization is non-trivial	large corpus search, event retrieval
Hybrid (ANN → refine)	scale + precision + localization	more moving parts	production search/detection systems

2.5 Distance metrics, normalization, and “what invariance do you want?”

When you say “distance between two audio patterns”, you are really choosing:

the feature space (MFCC/log-mel/embeddings)
the distance in that space (L2, cosine, learned scorer)
the invariances you want to tolerate (tempo, channel, noise)

Practical guidance:

Cosine similarity is common for embeddings (after L2 normalization).
L2 / Euclidean distance is common for handcrafted features and some embeddings.
A learned verifier (small cross-attention model) can outperform DTW, but raises complexity and needs labeled verification data.

Normalization matters more than people expect:

per-utterance mean/variance normalization (CMVN) can stabilize MFCC/log-mel distances
consistent resampling (e.g., always 16kHz) avoids “spectral shift” bugs

If you see “mysterious drift”, suspect the frontend: AGC/VAD changes, resampling changes, or codec changes will shift distributions even if the model weights didn’t change.

2.6 Subsequence matching vs whole-sequence matching

There are two different matching questions:

Whole-sequence: “Does this entire clip match the query template?”
Subsequence: “Where inside this long clip does the query occur?”

Subsequence matching is the common production need (search and detection), and it changes your algorithmic choices:

DTW variants that allow “free start/end”
windowing (slide a window and score)
or hybrid retrieval: retrieve candidate regions, then align locally

This distinction is easy to miss in interviews and in production designs, but it determines whether your system can return an actionable time span.

3. Architecture

It helps to separate two archetypes:

3.1 Offline corpus search (query vs an archive)

Query audio | v Feature/Embedding extraction | v Candidate retrieval (ANN / coarse index) | v Verification + localization (alignment / rescoring) | v Result: (clip_id, span, score, evidence)

Key knobs:

K (how many candidates you refine)
coarse embedding window length (recall vs speed)
refinement scorer type (DTW vs learned cross-attention)

3.2 Streaming detection (continuous audio)

Mic stream | v Framing (10–20ms) + log-mel | v Lightweight gate (VAD + tiny detector) | v Trigger + verify (optional) | v Action + telemetry (privacy-safe)

Streaming constraints force discipline:

bounded runtime per second of audio
stable false alarm rate (FAR) across device segments
robust handling of transport artifacts (packet loss, jitter)

3.3 Production components (what you actually need to ship)

If you want this to run reliably, the “matching model” is only one component. A practical production system usually includes:

Frontend/feature service
consistent resampling, framing, normalization
versioned configs (treat as a release artifact)
Embedding service (optional)
batch embedding jobs for offline corpora
streaming embedding for live audio (on-device or server-side)
Index store
ANN index + metadata store (clip_id → timestamps, segment labels, tenant ACLs)
sharding by tenant/time to control blast radius
Verifier
alignment scorer (DTW) or learned verifier
returns localization spans + evidence
Policy + thresholding
per-segment thresholds (device/route/codec)
warn/block/fallback behaviors (especially for streaming triggers)
Observability + RCA tooling
score distribution dashboards
false alarm sampling (privacy-safe)
change-log correlation (frontend/model/index updates)

If you skip these, you often end up with a “model that works in a notebook” but is not operable in a product.

4. Model Selection

4.1 Feature choice: MFCC vs log-mel (rule of thumb)

MFCC compresses spectral structure; historically strong for speech tasks.
Log-mel is simpler, closer to the raw spectral energy, and often pairs better with neural models.

If you want a strong baseline quickly, log-mel is usually enough.

4.2 DTW baselines (when they win)

Pros:

requires no labels
alignment path is interpretable (great for debugging)

Cons:

cost grows with sequence length
can be brittle if the frontend changes (resampling/AGC/VAD)

DTW wins when:

you have a handful of templates
you need explainability (“why did it match?”)
your corpus is small enough to afford alignment

4.3 Embeddings: frame-level vs segment-level

Two common design patterns:

Segment embeddings: one vector per window/clip (fast retrieval).
Pros: ANN search scales well.
Cons: localization requires sliding windows or second-stage scoring.
Frame embeddings: one vector per frame (alignment still needed).
Pros: great for localization and phonetic-ish matching.
Cons: heavier compute and storage.

Many mature systems do:

segment-level retrieval → frame-level or alignment refinement.

4.4 Verification scorers: DTW vs learned “cross” models

Once you have candidates, you need a verifier that answers:

“Is this candidate truly a match, and where is the match span?”

Options:

DTW on log-mel/MFCC: strong, interpretable, no training required.
Siamese embedding + threshold: fast but can be overconfident on confusables.
Learned cross-attention verifier: best accuracy, but needs labeled verification data and careful safety/latency engineering.

A common progression:

ship DTW verifier first (fastest path to something reliable)
later replace or augment with a learned verifier when false alarms become the bottleneck

4.5 Keyword spotting vs “query-by-example” (QbE)

Two different product modes:

KWS (fixed set of keywords)
you can train a classifier specifically for the target keywords
usually the best latency/accuracy/cost trade-off for wake words
QbE (user provides an example query)
you can’t pretrain on every possible query
embeddings + alignment methods become attractive

If your problem statement is “match this arbitrary query sound”, you’re in QbE land, and your system will look more like retrieval than like classification.

4.6 When you need phonetic awareness

Some “sounds-like” searches are really phonetic:

names with multiple spellings
accents shifting phoneme realizations

In these cases, systems sometimes incorporate:

phoneme posteriorgrams (PPGs) or ASR-derived intermediate features
hybrid approaches: ASR lattice search + acoustic verification

This is where the boundary between “acoustic pattern matching” and “ASR” becomes blurry: you may choose to use a lightweight recognizer as a feature extractor rather than doing full decoding.

5. Implementation (Practical Building Blocks)

The goal here is to show the primitives. In production, you’ll optimize and batch heavily.

5.1 Log-mel extraction (strong default)

``python import numpy as np import librosa

def log_mel(x: np.ndarray, sr: int, n_mels: int = 64) -> np.ndarray: “”” Returns log-mel spectrogram with shape (n_mels, frames). Uses a 10ms hop by default. “”” hop = int(0.01 * sr) S = librosa.feature.melspectrogram(y=x, sr=sr, n_mels=n_mels, hop_length=hop) return librosa.power_to_db(S, ref=np.max) ``

Production notes:

normalize consistently (per-utterance vs global mean/var)
keep the frontend versioned (small changes shift score distributions)

5.2 DTW distance baseline

``python import librosa import numpy as np

def dtw_distance(q: np.ndarray, x: np.ndarray) -> float: “”” DTW distance between feature sequences q and x (shape: D x T). Lower is better. “”” D, _ = librosa.sequence.dtw(X=q, Y=x, metric=”euclidean”) return float(D[-1, -1]) ``

Important: unconstrained DTW can be expensive. In production you typically add:

window constraints (band around diagonal)
early abandon (stop if already worse than best)
length caps or downsampled frames for long clips

5.3 Localization: turn “distance” into “time span”

A simple (but compute-heavy) baseline:

slide a window across the target
compute distance to the query per window
pick the best window as the match span

This baseline is incredibly useful for:

correctness testing
debugging learned systems (“is the embedding retrieval missing obvious matches?”)

5.4 Subsequence DTW (“find the query inside a long clip”)

Sliding windows are the simplest way to localize, but they’re expensive because you score many overlapping windows. Subsequence DTW variants let you answer:

“What is the best-aligned span of the target, and what is its cost?”

Intuition:

you allow the DTW path to start at any column in the target (free start)
and end at any column (free end)

In practice, many teams implement:

windowed DTW (only run DTW inside a small candidate region)
or use ANN to propose candidate spans, then DTW inside those spans

This is the same production principle as “Pattern Matching in ML”: do a cheap coarse match everywhere, then a precise verifier on a bounded candidate set.

5.5 Embedding retrieval (conceptual, but production-real)

At scale you do not DTW against everything. You do: 1) embed the query 2) ANN retrieve candidates 3) refine candidates (alignment / learned verifier)

The “gotchas” are mostly in chunking and metadata:

What window length do you embed? (0.5s, 1s, 2s)
How do you pool over time? (mean pooling, attention pooling)
How do you store span metadata so you can localize after retrieval?

Common pattern:

store embeddings for overlapping windows of each clip (with window_start_ms)
retrieve top-K windows
refine locally around that window_start_ms to get precise start/end

5.6 Calibration: turning raw scores into decisions

Raw distances/similarities are not probabilities. To ship reliable thresholds, you typically calibrate:

per segment (device/route/codec)
per model version

Simple calibration options:

choose thresholds by targeting a fixed FAR on a held-out “background” set
fit a lightweight calibrator (e.g., logistic regression) to map score → probability

Operationally, score calibration is what prevents “it worked last month” failures when the traffic mix shifts.

6. Training Considerations (for Embedding Models)

6.1 What labels mean

Be explicit about your similarity definition:

“same keyword” similarity
“same event class” similarity
“same speaker” similarity (different task!)

Mixing these objectives without care creates embeddings that are hard to threshold.

6.2 Loss functions that work

Common choices:

contrastive loss
triplet loss
classification head + embedding from penultimate layer

6.3 Hard negative mining (critical in practice)

Random negatives are too easy. You want negatives that trigger false alarms:

TV speech and music
confusable words (“Alexa” vs “Alexis”)
codec artifacts that mimic phonetic edges

Practical loop:

run the current model on a corpus
collect top-scoring wrong matches
train the next version using those as hard negatives

6.4 Augmentation: make robustness real

High-impact augmentations:

additive noise at varying SNRs
room impulse responses (reverberation)
codec simulation (OPUS/AAC, packet loss)
gain changes + clipping simulation

6.5 On-device vs server training constraints

If you must ship on-device:

quantization becomes part of the model design
you care about streaming inference and memory footprint
you often need tiny models + strong frontends

7. Production Deployment

7.1 Version and cache embeddings like features

Operationally, embeddings are “features for search”. Store:

embedding vector
model version
frontend config version (sr, hop, normalization)
timestamp / TTL

7.2 Indexing and scaling strategies

At 1M clips, one index can work. As you scale:

shard indexes by tenant/time/region
use multi-stage retrieval (coarse → fine)
keep re-embedding and index rebuilds as first-class ops workflows

7.3 Segment-aware thresholds (avoid one global threshold)

Thresholds drift by:

device bucket
input route (Bluetooth vs built-in mic)
codec
environment

A single global threshold usually guarantees:

too many false alarms on some segments
too many misses on others

Segment-aware calibration is often the difference between “model works” and “product ships”.

7.4 Privacy constraints

In many speech products:

raw audio cannot be uploaded by default

Design patterns:

on-device matching
privacy-safe telemetry (aggregates, histograms of scores)
opt-in debug cohorts for sample-level RCA

7.5 Monitoring and “score health” dashboards

If you ship this, you should expect regressions that look like “model got worse” but are actually:

frontend changes (resampling, AGC, VAD config)
device mix shifts (new phone releases)
codec rollouts

So you want dashboards that track:

score distributions (p50/p90/p99) over time
trigger rate (for streaming) per segment
top matched templates (for multi-template systems)
index recall proxies (are we retrieving good candidates?)
latency breakdown (feature extraction vs retrieval vs verification)

The most important chart is often a simple one:

score histogram by segment (device_bucket × route × codec)

If that shifts abruptly after a rollout, you have a concrete “what changed” signal.

7.6 A minimal RCA packet (what to attach to incidents)

When someone pages you with “false alarms spiked”, you want:

which segments changed
which model/frontend versions are live
which thresholds were used
a small privacy-safe sample of triggered cases (or aggregated feature snapshots)
recent change log events (app version rollout, codec change)

This is the same “RCA-first” design principle you use in ML data validation pipelines.

7.7 Case study: codec rollout that looked like a KWS regression

Scenario:

a Bluetooth codec profile update increases compression artifacts
false alarms rise sharply, but only on route=bluetooth

Without segment dashboards:

the team blames the KWS model and starts retraining

With segment-aware monitoring:

you see the spike localized to bluetooth + codec bucket
mitigation is a faster path: adjust thresholds for that segment or route to a safer verifier

This is a recurring theme in speech systems: input pipeline changes masquerade as model failures.

8. Streaming / Real-Time (When the Input Never Stops)

8.1 Sliding window detection

Streaming detection usually turns the problem into repeated window matching:

compute features every 10–20ms
maintain a rolling context (e.g., 1–2 seconds)
score each step and trigger when score crosses a threshold

8.2 False alarm rate is the product metric

For wake words and alerts, FAR matters more than “average accuracy”. In practice you tune to something like:

FAR per hour of audio
FAR per device segment

8.3 Budgeting compute

A simple but effective budget design:

VAD gate first (skip scoring on non-speech)
cheap detector first (tiny conv/transformer)
optional expensive verifier only on triggers

8.4 Streaming-specific failure sources (transport and buffering)

Streaming audio has failure modes that offline files don’t:

jitter buffer underruns (holes)
packet loss concealment (PLC) artifacts
time stretching/compression in real-time resamplers

If you see bursty false alarms:

check transport metrics (loss/PLC rate)
correlate with network conditions and app versions

8.5 Ring buffers and “trigger alignment”

When you trigger on a score threshold at time (t), you usually want to output audio from ([t-\Delta, t+\Delta]). That means you need:

a ring buffer of recent audio/features
a trigger alignment policy (how many frames before/after)

This detail matters for user experience:

too short and the verification stage misses the actual keyword
too long and you increase cost/latency and privacy risk

8.6 Streaming verification patterns

Two common production patterns:

two-stage: tiny streaming detector → on-trigger run a heavier verifier on buffered audio
multi-threshold: a lower “wake up” threshold and a higher “confirm” threshold to reduce false alarms

These patterns are simple but extremely effective at controlling FAR without sacrificing recall.

9. Quality Metrics

9.1 Retrieval/detection metrics

precision/recall at fixed FAR
ROC/DET curves
top-K retrieval recall (is the true match in your candidate set?)

9.2 Localization metrics

start/end time error (ms)
span IoU vs ground truth

9.3 Systems metrics

p95/p99 latency
CPU/GPU cost per query
embedding cache hit rate
index recall vs speed (ANN tuning knob)

9.4 Evaluation harness: how you prevent regressions

A robust evaluation suite usually includes:

background audio set (hours of non-keyword audio) to measure FAR
hard negative set (TV, music, confusable phrases)
device/route segments (bluetooth vs built-in mic, noisy vs quiet)
streaming simulation (packet loss + jitter + PLC artifacts)

For retrieval/search:

evaluate candidate retrieval recall separately from verifier accuracy
track “top-K recall” because K is a budget knob and often the true bottleneck

Operational rule:

If an incident happened, add a test case (or a segment slice) so it can’t surprise you again.

10. Common Failure Modes (and How to Debug Them)

10.1 Channel/codec mismatch

Symptoms:

regression concentrated in one input route or codec

Mitigations:

train with codec simulation
segment dashboards by route/codec/app version
fast rollback of model + frontend configs

10.2 Confusable negatives (false alarms)

Symptoms:

triggers on TV speech or music
specific “nearby” words cause most false alarms

Mitigations:

hard negative mining
two-stage verification
per-segment threshold calibration

10.3 Calibration drift

Symptoms:

score histograms shift over time
thresholds that worked last month stop working after frontend changes

Mitigations:

monitor score distributions
shadow-mode evaluations for new releases
treat “frontend changes” as model releases (version + canary + rollback)

10.4 Debugging playbook (fast triage)

When matching quality regresses, triage in this order:

Scope
- is it global or only certain segments (device, route, codec)?
- is it only streaming or also offline?
Frontend
- did sample rate distribution change?
- did VAD/AGC config change?
- did codec/packet loss metrics change?
Model/index
- did the embedding model version change?
- did the index rebuild happen (or fail partially)?
- did K or ANN parameters change (speed vs recall trade-off)?
Thresholds
- did any thresholds change (or did traffic shift so that a fixed threshold is now wrong)?

This order is practical because frontend and segmentation changes are the most common root causes.

10.5 Failure mode: candidate retrieval collapse (looks like “verifier got worse”)

In hybrid pipelines, a common incident is:

ANN recall drops (bad index, wrong normalization, mismatched embedding versions)
verifier never sees the true match
on-call sees “recall dropped” but blames the verifier

Mitigation:

separately monitor candidate recall (top-K recall on an eval set)
attach embedding version + index version to every score
build a “canary retrieval” job that runs continuously to detect index health issues

11. State-of-the-Art

Modern systems increasingly rely on:

self-supervised audio/speech embeddings as the representation layer
streaming small transformers for on-device KWS/event detection
hybrid retrieval + refinement for scalable search with localization

11.1 Practical implications of SSL embeddings

Self-supervised learning (SSL) changes the build-vs-buy calculus:

you can often get a strong representation without labeling millions of examples
fine-tuning can be lightweight (task-specific heads, small datasets)

But SSL doesn’t remove systems work:

you still need stable frontends
you still need calibration and drift monitoring
you still need hard negative mining for your product’s confusables

11.2 Hybrid search is still the “sweet spot”

Even with powerful embeddings, pure vector similarity often fails on edge cases:

short queries (wake words) can match many unrelated segments
channel/codec artifacts create spurious similarities

So hybrid pipelines persist:

ANN gives scale
alignment/verification gives precision and localization

11.3 Suggested starting point (if you’re building this from scratch)

If you want a reliable MVP:

Define the output contract (span + score + reason code).
Implement a DTW baseline on log-mel with window constraints.
Add segment dashboards (route/codec/device buckets).
Only then add embeddings + ANN to scale.

The main lesson:

The hard parts are not only modeling; they are indexing, thresholds, budgets, monitoring, and safe rollouts.

12. Key Takeaways

Acoustic pattern matching is pattern matching over time-series: you must handle time warping, channel, and noise.
DTW on log-mel/MFCC is a powerful baseline: interpretable and surprisingly strong.
Embeddings + ANN are the scaling story; localization usually requires refinement or windowing.
Production success is dominated by segment-aware thresholds, calibration monitoring, and runtime budgets.

12.1 Connections to other topics (the shared theme)

The shared theme today is pattern matching and state machines:

“Wildcard Matching” frames matching as a state machine solved by DP; DTW is the acoustic analog of DP over alignment states.
“Pattern Matching in ML” highlights safe, budgeted matching at scale; acoustic systems use the same coarse→fine pattern (ANN retrieval → verification).
“Scaling Multi-Agent Systems” emphasizes budgets, observability, and rollback; those same control-plane ideas are what make matching systems shippable and reliable.

If you can describe your matcher as “states + transitions + budgets + observability”, you’ll build systems that behave predictably under real traffic.

Originally published at: arunbaby.com/speech-tech/0054-acoustic-pattern-matching

If you found this helpful, consider sharing it with others who might benefit.

1. Problem Statement

1.1 What makes acoustic matching hard?

1.2 Output contract (what you should return)

1.3 Constraints you should design around

2. Fundamentals

2.1 Representation: what exactly are we matching?

2.2 Two broad approaches

2.3 What DTW is doing (intuition)

2.4 A practical comparison table (what to use when)

2.5 Distance metrics, normalization, and “what invariance do you want?”

2.6 Subsequence matching vs whole-sequence matching

3. Architecture

3.1 Offline corpus search (query vs an archive)

3.2 Streaming detection (continuous audio)

3.3 Production components (what you actually need to ship)

4. Model Selection

4.1 Feature choice: MFCC vs log-mel (rule of thumb)

4.2 DTW baselines (when they win)

4.3 Embeddings: frame-level vs segment-level

4.4 Verification scorers: DTW vs learned “cross” models

4.5 Keyword spotting vs “query-by-example” (QbE)

4.6 When you need phonetic awareness

5. Implementation (Practical Building Blocks)

5.1 Log-mel extraction (strong default)

5.2 DTW distance baseline

5.3 Localization: turn “distance” into “time span”

5.4 Subsequence DTW (“find the query inside a long clip”)

5.5 Embedding retrieval (conceptual, but production-real)

5.6 Calibration: turning raw scores into decisions

6. Training Considerations (for Embedding Models)

6.1 What labels mean

6.2 Loss functions that work

6.3 Hard negative mining (critical in practice)

6.4 Augmentation: make robustness real

6.5 On-device vs server training constraints

7. Production Deployment

7.1 Version and cache embeddings like features

7.2 Indexing and scaling strategies

7.3 Segment-aware thresholds (avoid one global threshold)

7.4 Privacy constraints

7.5 Monitoring and “score health” dashboards

7.6 A minimal RCA packet (what to attach to incidents)

7.7 Case study: codec rollout that looked like a KWS regression

8. Streaming / Real-Time (When the Input Never Stops)

8.1 Sliding window detection

8.2 False alarm rate is the product metric

8.3 Budgeting compute

8.4 Streaming-specific failure sources (transport and buffering)

8.5 Ring buffers and “trigger alignment”

8.6 Streaming verification patterns

9. Quality Metrics

9.1 Retrieval/detection metrics

9.2 Localization metrics

9.3 Systems metrics

9.4 Evaluation harness: how you prevent regressions

10. Common Failure Modes (and How to Debug Them)

10.1 Channel/codec mismatch

10.2 Confusable negatives (false alarms)

10.3 Calibration drift

10.4 Debugging playbook (fast triage)

10.5 Failure mode: candidate retrieval collapse (looks like “verifier got worse”)

11. State-of-the-Art

11.1 Practical implications of SSL embeddings

11.2 Hybrid search is still the “sweet spot”

11.3 Suggested starting point (if you’re building this from scratch)

12. Key Takeaways

12.1 Connections to other topics (the shared theme)

Related across topics

Wildcard Matching

Pattern Matching in ML

Scaling Multi-Agent Systems

Share on