What makes for good conversations? Well, we know what makes for bad ones.
We’ve probably all had trouble “getting a word in” with someone unaware of the muffling effects of their monologue. We’ve likely also found ourselves in other instances, polite but still pressurized, when discussion dies from being riddled with gaps of silence.
Thankfully our social world isn’t one in which long monologues are punctuated by polite silence. Neither do we live in a silent world. Everyone gets a turn to speak—usually a quick and short one—and everyone intuits they hold the stage only as long as their listeners let them. Of course, generally speaking, we’re pretty considerate—we bid for longer-than-usual turns by saying things like “do I have a story for you” or “guess what happened today”, all ways to to get a sense of our audience’s appetite for diatribes well before they’re despotically delivered.
Though instances when communication breaks down are rarely as dramatic as I depict them, they still serve to show how a familiar ability can go awry. This ability is timing in conversation—knowing when to alternate between speech and silence—and it underlies many of our social interactions.
We usually speak when our interlocutor is silent and quiet ourselves when it’s their turn to speak, a habit so engrained that its violations convey social information. Whether we speak over each other or not at all, deviations of speech-timing can reflect rudeness or reservation; someone who frequently interrupts you is construed as disrespectful (as when they leave no other option but to interrupt), while someone who takes too long to reply must dislike you for some reason (or is perhaps struck dumb at the profundity of your words).
An excellent conversationalist, by contrast, responds quickly, but not so quickly that they interrupt. Talking with such a one confers a sort of electric ease on us—those conversations enliven us, our words come out less effortfully, and our trains of thought are tugged along as if pulled from their destination. These kinds of conversations seem so simple, so free of knots, that they can make the memories of bad ones recede into the background…
This “electric ease” may have a number of causes. Perhaps this ideal interlocutor prompts you during pauses when your thoughts meander, or maybe they carefully restate your words to uncover unacknowledged meanings. Both of these might very well make conversation proceed with fewer hitches. But while these certainly give us the impression that our partner is listening, they don’t guarantee understanding. And if a natural product of good conversation is feeling understood, this feeling often depends on more than the simple reception and recitation of your partner’s speech; it involves anticipating, even predicting it.
The pinnacle of this form, as you might imagine, are those fabled couples who finish each other’s sentences, who not only know when their partner will say something but also what they will say.1 You might imagine, further, that what goes on when we feel understood by another is that for certain moments, we almost inhabit each other’s minds, or rather, we internalize and model our cherished partner’s speech to such an extent that our imaginations utter their words before we even hear them.
Here I’ll leave the latter, somewhat magical, ability aside and instead discuss the former—how we predict response time. This is because in many cases, unless a given speaker is extremely familiar to us, we have a hard time predicting exactly what they’ll say. But not so when they will say it. This, most of us are pretty good at already.
It may even be the case, some think, that predicting when our partners will respond is an ability which evolutionarily precedes, and is necessary for, the prediction of each other’s words.
So by looking past the content of conversation—you know, language, the crown jewel of human evolution—to invisible, soundless timing, we can study a feature of communication that many creatures are sensitive to, even if we’re the only ones graced with grammar. We’ll start with timing in human turn-taking, view evidence of the same behavior in other branches of the tree of life, and eventually circle back to hoarse-throated Homo sapiens.
I. The Measure of Man
If your school was anything like mine, you would have only learned grammar there, not timing. Not that this is a skill school needed to teach us. Turn-taking emerges well before we learn to structure our babble with the rules of syntax, and seems to act as a sort of pre-linguistic order for communication.
In addition to being expressed early in human development, turn-taking also appears to be culturally universal. Irrespective of any specific language, humans the world over take speech turns at roughly the same speed, responding to one another with a latency of about 200-300 milliseconds, meaning that a well-running verbal exchange typically involves much less than a second of dead air time, often barely noticeable to speakers.
But even though turn-taking seems easy, so easy everyone does it almost before they can walk, its apparent ease belies the cognitive deftness required to to do it well. Think about how despite the uniformity of these inter-speech gaps, we often speak for varying lengths of time, with speech of a wide range of complexity. Many people—some of whom we never want to get going—can respond to simple queries with planned-seeming paragraphs, and possess a surprising degree of lexical forethought. But all the required speech planning has to be squeezed into the time it takes for one person to speak, plus the milliseconds-long gap when they’re done.
In short, what makes turn-taking a cognitive challenge, particularly for humans, is that speech production and comprehension must overlap.
Multi-tasking isn’t easy, even when the tasks we’re engaged in are very different, like driving and speaking on the phone (what really happens is rapid task-switching). But for turn-taking in conversation, we have to make use of the same modality, that is, largely overlapping circuits in the brain, to both understand and produce speech. Needless to say, this adds some processing overhead. So how do we do it?

As the graphic shows, both planning and comprehension do, in fact, tend to overlap. And they do so without complication because the stages of comprehension actually aid planning—we can get a sense of whether a person is making a request or proclamation early in their turn, wait for cues indicating whether they’re winding down or not, then use these details to guess how and when to launch our reply.
Of course, our guesses aren’t always accurate. We might first plan a negative reply when hearing a request from a person who’s slighted us, only for this judgment to give way to grudging acceptance once they’ve explained themselves. Putting too much stock in these predictive cues can make us prone to early judgments, which can cause early interruptions, which make for poor conversation.
But we can still recognize the usefulness of having features of communication which ease cognitive demands by allowing the listener to infer meaning more quickly. Some of these features are straightforwardly aspects of word order—languages might have different branching structures2, for example, but other aspects are decidedly non-verbal.
This is to say that we’re good at figuring out a speaker’s intention somewhat independently of the precise words they use; so much meaning is bound up in things like tone, posture, and facial expressions. And as I mentioned earlier, even babies learn to take turns before knowing the complexities of language.
Which leads us, quite naturally, to the question of animals which don’t have as many prediction-enabling cues as provided by language, and yet still take communicative turns. And it turns out there are plenty of them. Turn-taking is practiced by several of our nearest relatives, ranging from those as distant as lemurs to those as close as chimpanzees (which, oddly enough, perform gestural turn-taking). And it turns out that vocal turn-taking isn’t actually limited to mammals. In fact, mammalian practitioners of turn taking are far outnumbered by avian ones, even though our lineages diverged 300 million years ago.
II. Bird is the Word
Why are birds useful for understanding communication? For one, they’re highly social, and mainly rely on visual and auditory senses, much as we do (unlike most mammals, which heavily rely on smell). Songbirds in particular are also famed for the complexity of their vocalizations, much as we are. Not for nothing did Lucretius, the Roman philosopher-poet, and none other than Aristotle, believe human melody had its origins in the imitation of birdsong. If you were to hear some birdsong slowed-down, I think you’d kind of see where they were coming from.
There are also aspects of songbirds that make them exceptionally amenable to study. First, their vocalizations, the behavioral output we happen to be interested in, are readily quantifiable in the form of a spectrogram. Second, the organ which generates such complex behavior, the brain, happens to be helpfully arranged for the scientist to label and classify; in birds, brain areas with distinct functions often have significant differences in neuron density, making them appear nice and nucleated with minimal staining.3
Enter zebra finches. Males of this songbird species are closed-ended learners, meaning they come to know one courtship song really well throughout their lives, which they learn as juveniles from their fathers and perform for potential mates later on. Not to mean their mates are passive listeners, however—females are capable of a wide range of calls, and will often intersperse a male’s song performance with calls of her own, at certain selected syllables. (Although the questions of how birdsong is learned and how females evaluate it are both utterly fascinating, I’ll try to address them in future writing. For now, I’ll mostly stick to calls.)
The first thing one experiences when walking into a zebra finch colony—that is, after the momentary hush when they realize a mammal has lumbered into their space—is the noise. A cacophony of nasal squawks and chirps surrounds your ears. But even though finch vocalizations may not be very melodious, they’re still quite meaningful.
What forms the brunt of the cacophony isn’t actually song, but calls, and some of the most common ones are given the somewhat onomatopoeic, somewhat technical terms of “stack” and “tet” calls. These calls are produced an awful lot—they take less than a second to utter, and there are many empty seconds in a day…—and are thought to be useful for group cohesion and pair bonding. The meaning of these calls is pretty context-dependent; we might well think of these calls as basically a bunch of “HI!”s, “HOW ARE YOU!?”s and “CHECK ME OUT!”s. Males may use these calls towards other males, females to females, females to males…you get the idea.
So when one bird calls out “tet-tet” in the colony, rarely are they left to hear their own echo. Others will often respond, causing a pair of birds, or even a triad, to begin an energetic flurry of calls. Zebra finches are remarkably adroit at avoiding vocal overlap, and will naturally adjust their call timing to include more birds in their acoustic jamboree. It’s hard to overstate how intelligent this flexible back-and-forth sounds, despite the lack, perhaps, of these vocalizations’ complexity.
We can simulate these call-response conditions pretty well in the lab. This will often involve the use of an ethorobot (a device used to study animal behavior, typically by mimicking said behavior) to act as a conversation partner for a lone finch. The ethorobot might just be a speaker, a speaker paired with a 3-D printed bird—or, if we want to get advanced—an actual beak-moving, wing-fluttering, bird-shaped robot.
The basic form of these call-response experiments involve this robot producing calls at a fixed rate, to which a bird, quickly sensing the pattern, times its own calls in response. Once a bird is comfortable with this consistently timed back-and-forth, the robot will sound a jamming signal exactly when the bird begins its reply, interrupting it. After some consternation (perhaps wondering about the sanity of its partner) the bird realizes the robot won’t stop its interruptions, and will eventually shift its call a bit forward in time to avoid interference.
Although this behavioral test is cool on its own (just how many birds can a single one call with, exactly?), a deeper question remains: how is it that their brain is capable of this sort of flexible auditory maneuvering?
Remember those distinct brain nuclei I told you about? There’s one perched practically at the surface of the brain, a little towards the rear, and this region—once called the “high vocal center” but for certain anatomical reasons is now just called “HVC”—controls basically everything to do with vocal timing in songbirds.

Back in the early 2000s, experimenters in Zurich showed that specific HVC neurons fire at specific timepoints in a finch’s song, creating a “sparse code” representing which syllable of the song to sing when. A different set of experiments at MIT in 2008 found that lesioning HVC completely degrades song structure, making it resemble the “babbling” of younger birds, a form of gibberish akin to that of a human infant’s. And in another paper from the latter lab, researchers managed to physically cool HVC (slowing down its cellular processes) in a live, singing bird, and found its song lengthened in proportion to the cooling.
Given all this, it was only natural to look at HVC to try to understand the timing of turn-taking behavior. And this is exactly what some researchers in Germany did in 2016. They cut neural projections from HVC to a region (shown above) called RA—the “robust arcopallium”, a songbird’s equivalent of a motor cortex4 —to see what happened to birds interacting with a robot producing isochronous calls. The result of this transection was what you might expect: the finches completely lost all sense of rhythm, and could neither entrain with the robot nor avoid its jamming.5
How do these avian examples connect to human turn-taking? Perhaps the “extreme” form of the behavior might again be illuminating. Certain songbird species are known to duet so precisely it’s hard to distinguish whether it’s one or two birds singing. And to achieve such startling precision, these duetting couples, much like the human ones who finish each others’ sentences, don’t wait to hear what their partner says. Instead of reacting to their partner’s vocalizations, they predict it.
And those robot-entrained zebra finches in the above experiments are indeed being predictive when they take turns. I neglected to mention one aspect of the call-response paradigm—the researchers incorporated “catch” trials to rule out whether the finches were simply reacting to the jamming signal or actually predicting it. These catch trials consist of a sudden absence of jamming—1 out of 10 jamming trials will be missing a jamming signal, yet the bird will still behave as if it expected to hear it, and continue with its slightly shifted response.6
Lest I over-exaggerate the importance of prediction, I should say that listening is certainly important—it’s how we, and birds, build familiarity, and the familiarity that results from continual entrainment is what makes the predictions we make effective and improve over time.
One obvious question before we leave our avian turn-takers: is there an HVC equivalent for humans? Maybe. Exact comparisons between particular human and songbird brain regions are still being worked out, but there are some good ideas. On the large scale, at least, it seems both kinds of brains have evolved a similar network architecture for vocal communication, an architecture which actually seems crucial for other kinds of motor learning as well.
But despite the current lack of definitively known anatomical homologies between us and birds, we’re justified in thinking there are enough similarities that we might still be allowed some behavioral speculations. This is what we’ll turn to in the last section.
III. Social Rhythms and the Space of Conversation
So far we’ve covered some of the mechanics of conversation: the early emergence and universality of human turn-taking, the cognitive challenges involved, and how birds take turns by predicting when their partners will vocalize. As zebra finches showed us, it seems that deeper than language, deeper than grammar, what’s at the heart of turn-taking behavior is a sense of rhythm, an attunement to the behavioral patterns of another which enables us to know how and when to respond to them.
Does the natural-ness of conversation, its sometimes vibrant nature, rely on speed of response? I think it makes sense to think so. A lack of a speedy response—the honest signal of “getting it”— may indicate a lack of attunement, a lack of perception of another person’s rhythm.
Dancing is a particularly good illustration of the social benefits of good timing. After all, when dancing, moving after a beat drop is moving too late—a sign of being well-timed is knowing beforehand when a beat will occur and initiating movement sooner, accordingly. And while dancing alone may still be fun, doing it in a group is famously one of humanity’s favorite activities.
Consider the haka dance. Although its function is now mainly ritualistic, it may not have started out that way. Joseph Jordania is known for hypothesizing that music may have emerged out of an ancient recognition of the power of such synchronized displays to intimidate predators. When people move together in such a coordinated fashion, their collective singing and movement may appear to predators as that of a much larger organism, freaking them out and causing them to abandon their kill, gaining our dancing ancestors some extra calories. Puts new meaning to the phrase “sing for your supper”, doesn’t it?
So if coordinated movement might bring social benefits, might deficits in timing underlie social disorders, like autism? There’s a growing trend of thinking exploring this. Not having a good sense of timing might present obstacles to participating in activities that require it, whether dancing or conversation, like we’ve seen. And so much of the experience of personhood surely involves taking part in these sorts of quintessentially human activities.
I’ll close with one last bit of speculation. I’ve often found myself thinking of AI—in the form of large language models—as a kind of ethorobot for humans. Can the “electric ease” of conversation be produced by beings that are…well, electrical? Why not? A finch can become entrained to the consistent calls of a finch-shaped robot; a fish of flesh may undulate next to a mechanical compadre; so too, perhaps, can a human become linguistically entrained with a large language model, and thereby become convinced of its personhood. It’s not that far-fetched, actually, because it’s already happening.
If birds can be enamored by and call back to mechanical mates, is the equivalent for humans, with AI, relegated to mere chit-chat, or will it include deeper, more serious content, as takes place during a date or confession? I’m inclined to the latter. And despite the innocuous name of something like “Chat-GPT”, I think its main use will likely be anything but chit-chat.
So if one way we judge “natural” connections is through the quality of certain rhythmic interactions we have with one another—as I hope to have indicated—then the participation in those interactions of a kind of being that doesn’t resemble us ought to make us wary of our perceptions. That is, it may be unwise rely on pro-social, entrainment-based senses to decide whether an AI counts as a person or not, just because we can take-turns talking with it. We’ll need different indicators of personhood, though it’s beyond the scope of this piece to say what those are.
As I said in the beginning, turn-taking really is just the starting point for understanding language more broadly, bare scaffolding for the towers and arches it’s possible to build once we’ve outlined their basic structure. I hope to have made this outer structure at least a smidge more visible, and to have widened our conceptions of what sorts of creatures have some semblance of it, whether hairy, feathered, or mechanical. It really is true, as the song goes: “Everybody’s talkin’.”
In other words, the only “good” sort of interruption—as the jazz adage says—consists of knowing the rules well enough to break them, and in this case, knowing the person well enough to interrupt them with their own words. But its valence depends on context: finishing each other’s sentences can be seen as annoying or charming depending on who does it, who it’s done to, or who watches. Onlookers may tend to find this sort of thing charming, but not someone who’s getting interrupted; the latter might read it as if their thoughts are overly predictable—boring, in other words. Perhaps it’s the trickster in me, but I suspect that those interrupted in this way secretly love that someone knows them well enough to do it.
A mammalian brain, though it also has regions with distinct functional roles, has a brain in which many of those regions are much harder to visually distinguish, unlike the pockets of grey matter which simplify things for bird neuroscientists.
That is, the portion of avian motor cortex that chiefly has to do with song; nXII is a cranial nerve that goes to the bird’s syrinx, which actually produces the sound. Other parts of a bird’s “motor cortex”, of course, control other parts of the bird body. For its connections to RA, HVC is considered a premotor region.
Further research from this lab showed birds can entrain in triads consisting of two birds and a robot, and investigated HVC’s role even deeper than in their 2016 paper. There seem to be inhibitory neurons within HVC that “restrain” output when it’s not the bird’s turn to vocalize.
And interestingly enough, although female zebra finches don’t sing, they seem to slightly outperform males in this call-response assay.
The discussion of conversation timing from all these different angles is fascinating. Good stuff!
The part about entrainment reminds me of this ted talk:
- https://www.youtube.com/watch?v=1NG7FoC5XRo.
I find it really comforting to think about conversation as a sort of dance that you're having with the other person. It's like a dance where you express yourself, but also team up with your interlocutor(s) to match or complement their cadence, mood, tone, etc. Imagining conversation like a dance is a thrilling abstraction. It was nice to read your scientific breakdown of things!