Webeleon

Software

Real-time pitch detection on a laptop mic: what actually works

By Webeleon · 8 min read · Jul 28 2026

If you want a practice tool to hear what note you just played on a laptop mic, reach for a time-domain autocorrelation method — the YIN and McLeod families — running in an AudioWorklet, gated by a clarity score, and smoothed across a few frames. The obvious first instinct, "take an FFT and find the tallest peak", is the one that will waste your afternoon. Here is the reasoning behind that, and where the honest limits sit.

The problem is narrower than transcription

Full polyphonic transcription — hand it a recording of a band and get back a score — is an open research problem. That is not the problem a practice tool has. The tool has a much smaller one: a single player, one note at a time, on an instrument whose pitch range is known in advance, in a context where the tool often already knows which note it is hoping to hear. That narrowing is the whole game. Every technique below gets easier the moment you stop pretending you need a general solution and accept the constrained one.

So the question is not "what is the state of the art in pitch tracking" — it is "what is the cheapest thing that reliably estimates the fundamental frequency of one monophonic, pitched sound, fast enough to feel live, from a mediocre mic in a noisy room." Those adjectives — cheap, reliable, live, monophonic, noisy — are what actually decide the design.

FFT peak-picking is the obvious approach, and it disappoints

The textbook mental model is: sound is a sum of sinusoids, the fundamental is the lowest one, so transform to the frequency domain and read off the lowest strong peak. In practice two things go wrong.

The first is resolution. An FFT's frequency spacing is the sample rate divided by the window length. To tell a low guitar note from its neighbour you need fine spacing, which means a long window, which means more latency and less responsiveness. You are trading the exact thing a live tool cares about to buy accuracy at the bottom of the range, where you need it most.

The second, and worse, is that the loudest partial is frequently not the fundamental. Plucked and bowed strings pour energy into their overtones; on a guitar the second or third harmonic often towers over the fundamental, and a thin laptop mic rolls off the low end further. Point a "tallest peak" detector at that and it confidently reports a note an octave (or a fifth-plus-octave) too high. You can bolt on harmonic-product-spectrum tricks to fold the overtones back onto the fundamental, and they help, but you are now patching a method that was fighting the signal from the start.

Autocorrelation, and why the naive version lies

The method that fits the problem works in the time domain. A pitched sound is roughly periodic, so slide the signal against a delayed copy of itself and ask: at what delay does it line up best with itself? That delay is the period; invert it and you have the frequency. No frequency-bin resolution ceiling, and it keys directly off periodicity rather than off which harmonic happens to be loudest.

Naive autocorrelation has its own failure, and it is again an octave. The signal also correlates well at twice the true period, and sometimes the algorithm grabs that longer lag — reporting the note an octave down. The fixes that made this practical are worth knowing by name:

  • YIN replaces raw correlation with a difference function and a cumulative mean normalisation, then takes the first dip below an absolute threshold rather than the global best. That "first good enough dip, not the biggest peak" rule is precisely what kills most octave-down errors.
  • The McLeod Pitch Method (MPM) uses a normalised square difference function and picks among peaks with a threshold relative to the strongest, and it hands you a clarity value for free — a number saying how periodic the frame actually was.

That clarity number is the one to hold onto. It is the difference between a detector that guesses and one that knows when to stay quiet.

Latency is a window-size problem

You cannot estimate a period from less than a period of signal, and in practice you want a couple of them to be confident. Low notes have long periods — the low E on a guitar sits around 82 Hz, a period near twelve milliseconds — so honestly resolving the bottom of the range means an analysis window tens of milliseconds long before you have run a single instruction. Add input latency from the OS and mic, your processing block size, and the time to paint a result, and there is a floor on responsiveness that no clever code removes. You budget for it; you do not optimise it away.

The one lever that matters here is where the work runs. In the browser, do the DSP in an AudioWorklet, which executes on the audio render thread. The old ScriptProcessorNode ran on the main thread and glitched the moment the UI got busy — exactly when a musician is interacting with your tool. Capture the mic with getUserMedia, analyse inside the worklet, and pass only the small result (frequency, clarity) back to the UI. Keep the hot loop off the main thread and the tool stays live while the rest of the page does its job.

The laptop mic fights you

A built-in laptop mic is a far-field, mid-tuned capsule sitting next to a fan and a clicky keyboard. It attenuates the low fundamentals you most need, and it happily records the room. So the detector needs two defences. First an energy gate: below an RMS threshold, report nothing rather than trying to find pitch in noise. Second the clarity threshold from earlier: even above the noise floor, if the frame is not convincingly periodic — a transient, a chair scrape, a chord instead of a note — decline to answer.

Octave errors are the signature failure

If you only remember one failure mode, make it this: pitch detectors get the pitch class right and the octave wrong. Three cheap defences, stacked, handle most of it. Use a clarity/threshold rule (YIN or MPM) so you are not choosing the tallest harmonic. Constrain the search to the instrument's plausible range so an out-of-band octave is discarded before it is ever reported. And smooth over time — a short median filter across recent frames, or a small hysteresis that only commits to a new note once it has held for a few frames — so one bad frame cannot flip the displayed note. Each is a few lines; together they turn a jittery, octave-hopping readout into something a musician trusts.

Good enough for a practice tool

Here is the reframing that makes all of this tractable. A practice tool does not need to transcribe a performance — it needs to answer "did you play the note the exercise asked for, roughly in time?" That question tolerates tens of milliseconds of latency, tolerates the occasional honest "I did not catch that", and — crucially — comes with context. When the tool already knows the key, the range, and often the specific note it is expecting next, it can bias the detector toward that answer and reject the physically-implausible one. Detection stops being open-ended search and becomes verification, which is a far kinder problem.

That is exactly the job in the Sight Reading Gym, Webeleon's daily sight-reading trainer: it shows you an étude and listens on your mic while you play it. To turn a page into feedback it has to name the note you just produced, on whatever mic you happen to have, fast enough to feel like it is reading over your shoulder. Every constraint above — time-domain over FFT, a clarity gate, range-limited search, temporal smoothing, honest silence over confident nonsense — is there because that is what it takes for "your mic listens while you play" to feel like listening rather than guessing.