Engineering Philosophy: Kyunghyun Cho

Kyunghyun Cho, deep learning researcher

Key Takeaways

  • He gave recurrent networks a learned memory gate. As first author of the 2014 paper “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” Kyunghyun Cho introduced the Gated Recurrent Unit (GRU) – a simplified gated cell that lets a network learn how much of its past to keep versus overwrite at each step, with fewer parameters than the LSTM it rivals.13
  • He helped put attention into machine translation. With Dzmitry Bahdanau (first author) and Yoshua Bengio, Cho co-authored “Neural Machine Translation by Jointly Learning to Align and Translate” (2014), which let a decoder softly align to any part of the source instead of squeezing the whole sentence through one fixed-length vector. That mechanism is the direct ancestor of the Transformer and every modern large language model.24
  • He is a leading voice for open, reproducible science. Cho has been an active proponent of open reviewing and careful empiricism – honest baselines, skepticism toward one’s own results, and research published in the open – documented through his work on peer review and his use of open platforms like OpenReview.56
  • From Aalto to NYU to drug design. Born in South Korea in 1985, he took his doctorate at Aalto University, did a postdoc with Yoshua Bengio at Université de Montréal, joined NYU in 2015, and co-founded Prescient Design – now part of Genentech – applying machine learning to antibody design.17

The Principle

“One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.” – Kyunghyun Cho et al., describing the RNN Encoder-Decoder that his later work would learn to outgrow1

The oldest instinct in engineering is to design the control structure yourself. You decide which inputs matter, how long the system should remember a value, where to look in a buffer. You encode those decisions in the architecture, and the system inherits your guesses. Cho’s body of work runs the other direction. His defining contributions – the GRU’s gates and attention’s soft alignment – share one move: stop hard-coding the control structure, and let the model learn it. Which past state to keep, which input to attend to – those become parameters the network tunes from data rather than rules the designer fixes by hand.12

The GRU is the cleanest case. A plain recurrent network carries a hidden state forward and, at every step, blends the old state with the new input in a fixed way. That fixed blend is a designer’s guess, and it is usually wrong: important signals from early in a sequence get diluted into noise within a few steps. Cho’s gated unit replaces the guess with a learned gate – a small sigmoid-valued control that decides, per step and per dimension, how much of the old memory to retain versus how much to overwrite with the new input.3 The network learns when to hold on and when to let go. Nobody hand-codes the memory policy; the data writes it.

Attention is the same idea applied to looking instead of remembering. And there is a second half to Cho’s principle that keeps the first half honest – a scientist’s discipline. The way you find out whether a learned mechanism actually helps is not enthusiasm; it is a fair fight against a strong baseline, results you are skeptical of until they hold up, and methods published openly so others can reproduce them.56 Let the model learn its own control structure – then verify, honestly and in the open, that it really worked. The first half is where the power comes from. The second half is what makes the power trustworthy.

Context

Kyunghyun Cho was born in South Korea in 1985.1 He took his bachelor’s in computer science at KAIST in 2009, then moved to Finland for graduate study at Aalto University, earning a master’s in machine learning and data mining in 2011 and his Doctor of Science in 2014, supervised by Professor Juha Karhunen (with Tapani Raiko and Alexander Ilin).17 Finland is an unlikely cradle for one of deep learning’s pivotal years, but the timing is the point: he finished his doctorate exactly as the field was about to turn.

From Aalto he went to Montreal for a postdoctoral fellowship with Yoshua Bengio at the Université de Montréal – the lab that would become MILA, then the densest concentration of deep-learning talent in the world.1 The two papers that anchor this essay – the GRU and the attention-based NMT model – both came out of that 2014 window, with Bengio as a co-author on each.12 It is worth pausing on how fast that was: a researcher arrives as a postdoc and, within roughly a year, is first author on the paper that introduces the GRU and a co-author on the paper that introduces attention to translation. Both ideas now sit underneath a large fraction of working AI.

In 2015 he joined New York University, where he is now a professor of computer science and data science at the Courant Institute and the Center for Data Science, the Glen de Vries Professor of Health Statistics, and co-director of NYU’s Global AI Frontier Lab alongside Yann LeCun.7 He spent 2017-2020 as a research scientist at Facebook AI Research, and – the chapter that shows his principle reaching past language – in early 2021 he co-founded Prescient Design, a lab-in-the-loop antibody-design startup acquired by Genentech later that year, where he led frontier research applying generative machine learning to drug design.78 Throughout, he has been a vocal advocate for open and reproducible science.56

The Work

The GRU: a memory the network learns to gate

To feel why gating matters, watch a recurrent network try to carry one important value across a long, noisy sequence. The network reads one step at a time, holding a single hidden “memory” that it updates at every step. Early on, a value worth keeping appears; everything after it is noise. A plain RNN blends old and new in a fixed proportion, so the early value bleeds away within a few steps – by the end, it is gone. The whole problem is that the keep-versus-overwrite decision was hard-coded. The widget below makes that tangible: a single gate decides, at each step, how much of the old memory survives. Slide it toward “keep” and the early value reaches the end intact; slide it toward “overwrite” and it washes out almost immediately.

That is exactly the lever Cho’s GRU adds, except the network learns the gate instead of you setting it. Introduced in his 2014 RNN Encoder-Decoder paper, the GRU uses two learned gates: an update gate that controls how much of the previous hidden state to carry forward versus replace with new information, and a reset gate that decides how much of the past should influence the new candidate state.3 Both are sigmoid-valued – smooth knobs between 0 and 1 – which means the whole thing is differentiable and trainable by gradient descent. The network discovers, from data, when to hold a value for a long time and when to flush it.3

The design is consciously economical. It arrived alongside Hochreiter and Schmidhuber’s older LSTM, which solves the same vanishing-gradient problem with more machinery – a separate cell state and an extra output gate. The GRU “lacks a context vector or output gate, resulting in fewer parameters than LSTM,” yet performs comparably across speech recognition, music modeling, and language tasks.3 That is the engineering taste worth naming: the simpler unit that captures the essential idea – learned gating – without the parts you cannot prove you need. For years the GRU and LSTM were the two default recurrent cells in every sequence-modeling toolkit.

Attention: soft alignment, and the end of the fixed-vector bottleneck

The RNN Encoder-Decoder had a structural flaw, and Cho’s own paper named it: the encoder crushes an entire input sentence into “a fixed-length vector representation,” and the decoder must reconstruct the whole translation from that single vector.1 For a short sentence that is fine. For a long one it is a catastrophe – you are forcing a paragraph through a keyhole, and recurrent networks tend to over-weight the words nearest the end while the beginning fades.4 The bottleneck is the fixed-length vector itself.

The fix, in “Neural Machine Translation by Jointly Learning to Align and Translate” (2014/2015) – Dzmitry Bahdanau as first author, with Cho and Bengio – was to remove the bottleneck entirely.2 Instead of compressing the source into one vector, the model keeps a representation of every source word and lets the decoder, at each output step, compute a set of weights over all of them – a learned soft alignment that says “to produce this word, look mostly here, a little there.”4 The decoder gets direct access to any part of the input rather than reaching it only through a single squeezed state.4 Crucially, the alignment is not hand-specified by a linguist; it is learned jointly with translation, end to end. The model decides for itself where to look – the same principle as the GRU’s gate, now applied to attention instead of memory.

The line from there to today is short and load-bearing. Soft alignment generalized into self-attention, and in 2017 the Transformer’s “Attention Is All You Need” formalized scaled dot-product attention and threw the recurrence away entirely – a design that became the foundation for BERT, T5, GPT, and the whole generation of large language models.4 Modern LLMs are, in a real sense, attention scaled up and stripped of the RNN around it. The mechanism Cho helped introduce to translation in 2014 is the one your chatbot runs on. (The attribution matters: Cho is first author on the GRU paper and a co-author with Bahdanau and Bengio on the attention paper – he did not introduce attention alone.)12

Kyunghyun Cho

Open and reproducible science

The second principle is quieter and, in a hype-prone field, harder. Cho has been a consistent advocate for open, reproducible science – research conducted and reviewed in the open, with honest baselines and a working skepticism toward one’s own results.56 He has used and championed open platforms like OpenReview, and has contributed research on peer review itself, studying questions like whether authors’ own assessments of their papers can inform the review process.56 The throughline is that a result is not a result until it survives scrutiny that the author did not control.

Kyunghyun Cho at NeurIPS 2025

Why it matters technically: deep learning is unusually easy to fool yourself in. A new architecture will almost always beat a weak baseline, a lucky seed, or an under-tuned competitor – and the literature is littered with “improvements” that evaporate when someone gives the old method a fair fight. The discipline Cho models is to assume your own gain is an artifact until proven otherwise, to tune the baseline as hard as you tune your method, and to publish openly enough that someone can prove you wrong. It is the empirical conscience that keeps the “let the model learn it” instinct from curdling into “the model learned it because I wanted it to.”

Machine learning for the sciences: Prescient Design

The clearest sign that Cho’s principle is general, not language-specific, is where he took it next: drug design. In early 2021 he co-founded Prescient Design, acquired by Genentech that year, to build a “lab-in-the-loop” platform for designing therapeutic antibodies.78 The loop couples generative ML models, multi-task property predictors, and active-learning selection with actual wet-lab experiments in an iterative cycle – the models propose antibody variants, the lab tests them, and the results retrain the models.8 Applied across clinically relevant targets, the system designed and tested over a thousand variants and engineered antibodies with dramatically stronger binding than the starting leads.8 The structure is the same one that runs through everything else he has built: don’t hand-design the answer, build a system that learns it from evidence – and close the loop with real measurements so the learning stays honest.

The Method

Read across the GRU, attention, the open-science advocacy, and the antibody work, and the same commitments recur. Cho’s method is less a slogan than a set of standing habits.

Make the control structure learnable. The defining move is to take a decision you would normally hard-code – how long to remember, where to look – and turn it into parameters the model tunes from data. The GRU’s gate and attention’s alignment are the same instinct in two domains. The lesson transfers far past sequence models: when you find yourself guessing a threshold, a weighting, or a routing rule, ask whether the system could learn it better than you can guess it.12

Prefer the simpler mechanism that captures the essential idea. The GRU keeps learned gating and drops the LSTM’s cell state and output gate, matching its performance with fewer parameters.3 That is minimum worthy product at the level of a neural cell – the smallest unit that still carries the load-bearing idea, the same economy of means that runs through Sophie Wilson’s ARM.

Name the bottleneck before you remove it. Cho’s own encoder-decoder paper named the fixed-length-vector limitation; the attention paper removed it.14 The discipline is to state the structural flaw precisely first – the paragraph through a keyhole – because a sharply named bottleneck is half-solved. It is the same instinct as Fei-Fei Li’s decision to attack the data, not the model: find the real constraint, not the convenient one.

Give the baseline a fair fight. Honest empiricism means tuning the method you are trying to beat as hard as the one you are proposing, and treating your own gain as suspect until it survives that test. This is the evidence gate made into a research practice – “it improved” is not evidence; “it improved against a strong, openly reproducible baseline” is.56

Work in the open. Open review, open platforms, reproducible results. The value of a method is not what its author claims but what survives independent scrutiny – which is why publishing openly enough to be checked is part of the engineering, not a courtesy after it. It is quality is the only variable applied to science itself: the only thing that counts is whether the result is real.56

Influence Chain

Who Shaped Him

Yoshua Bengio and the MILA group. Cho’s pivotal year happened as a postdoc in Bengio’s Montreal lab, and Bengio co-authored both the GRU and the attention papers.12 MILA was the environment where the recurrent-network and machine-translation ideas of the early 2010s were being pushed hardest, and Cho’s work is inseparable from it. (Direct influence)

Juha Karhunen and the Aalto school. His doctoral training at Aalto under Juha Karhunen, with Tapani Raiko and Alexander Ilin, grounded him in the unsupervised-learning and neural-network tradition before the Montreal years gave it a translation target.17 (Formative influence)

The deep-learning lineage. The vanishing-gradient problem the GRU addresses, and the recurrent and convolutional architectures it sits among, descend from Geoffrey Hinton’s and Yann LeCun’s decades of work establishing that deep networks can be trained at all. Cho now co-directs an NYU lab with LeCun.7 (Formative influence)

Who He Shaped

Modern sequence modeling. For years the GRU was one of the two default recurrent cells – the simpler half of the GRU/LSTM choice that every NLP, speech, and time-series practitioner made by reflex.3

The Transformer and the LLM era. Attention, introduced to translation in the paper Cho co-authored, generalized into the self-attention that the Transformer is built from – and the Transformer is what every modern large language model runs on.4 The mechanism is one of the most consequential in the field’s history.

Machine learning for the sciences. Through Prescient Design and Genentech, Cho helped push generative ML into therapeutic antibody design, an argument that the “learn it from evidence, close the loop with measurement” pattern belongs in biology as much as in language.8

The Throughline

Cho is the mechanism hinge of this series’ deep-learning branch – the bridge from foundations to the LLM era. Geoffrey Hinton and Yann LeCun established that deep networks can learn; Fei-Fei Li supplied the data they learned on. Cho’s gating and attention are the architectural step that turned those foundations into something that could handle language at scale – and his attention work flows directly into the Transformer that Andrej Karpathy would later teach a generation to build and that powers the modern LLM. Where Hinton says the learning machine works and Li says here is the world to learn from, Cho says: let the model learn its own control structure – what to remember, what to look at – and then prove, in the open, that it actually did. The first clause is the line to the Transformer; the second is the empirical conscience that keeps it honest – the same rigor Fei-Fei Li brought to benchmarks. (Series bridge)

What I Take From This

The lesson I keep from Cho is to stop hard-coding the decisions a system could learn. My instinct, like most engineers’, is to encode my own judgment – this threshold, that weighting, this routing rule – because my judgment feels like the trustworthy part. The GRU and attention are both arguments that the opposite often ages better: the model, given the right mechanism and enough data, will learn a memory policy or an alignment far better than I would guess one. So when I catch myself tuning a magic constant or hand-writing a branch that decides what matters, I now ask whether I am encoding a decision the system should be learning instead. The skill is not making the choice; it is building the structure that lets the choice be learned.

The second lesson is the discipline that keeps the first one from lying to me. Letting a system learn its own behavior is intoxicating precisely because it so easily looks like it worked – the demo passes, the metric ticks up, and I want to believe. Cho’s open-science ethic is the antidote: assume your own result is an artifact until it survives a fair fight against a strong baseline, and publish it openly enough that someone can prove you wrong. That is exactly why I treat the evidence gate as non-negotiable – “it improved” is a feeling, “it improved against a tuned baseline, reproducibly” is evidence. Build systems that learn their own control structure, then hold them to a standard of proof you did not get to set. The power and the honesty are two halves of one practice.

FAQ

What is the GRU (Gated Recurrent Unit)?

The GRU is a type of recurrent neural network cell that Kyunghyun Cho introduced as first author in the 2014 paper “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.”1 It uses learned gates – an update gate and a reset gate, both sigmoid-valued between 0 and 1 – to control how much of the previous hidden state to keep versus overwrite with new input at each step. This lets the network learn what to remember over long sequences, solving the vanishing-gradient problem. It performs comparably to the LSTM while using fewer parameters, because it omits the LSTM’s separate cell state and output gate.3

What did Kyunghyun Cho contribute to attention and neural machine translation?

Cho is a co-author – with Dzmitry Bahdanau (first author) and Yoshua Bengio – of “Neural Machine Translation by Jointly Learning to Align and Translate” (2014), the paper that introduced the attention mechanism to neural machine translation.2 Instead of compressing a source sentence into one fixed-length vector, the model keeps a representation of every source word and lets the decoder learn a soft alignment over all of them at each output step. That mechanism generalized into the self-attention of the Transformer in 2017, which is the foundation of modern large language models.4 Separately, Cho was first author on the 2014 paper that introduced the GRU.1

What encoder-decoder bottleneck did attention solve?

The original RNN Encoder-Decoder, described in Cho’s 2014 paper, encodes an entire input sentence into “a fixed-length vector representation,” from which the decoder must reconstruct the whole output.1 For long sentences this single vector is a bottleneck – a paragraph forced through a keyhole – and recurrent networks tend to lose information from the start of the sequence.4 Attention removed the bottleneck by keeping a representation of every source position and letting the decoder attend directly to any of them, with learned weights, rather than relying on one compressed state.4

What is Kyunghyun Cho doing now?

Cho is a professor of computer science and data science at NYU’s Courant Institute and Center for Data Science, the Glen de Vries Professor of Health Statistics, and co-director of NYU’s Global AI Frontier Lab with Yann LeCun.7 He co-founded Prescient Design in 2021 – acquired by Genentech that year – where he led frontier research applying generative machine learning to “lab-in-the-loop” therapeutic antibody design, coupling ML models with wet-lab experiments in an iterative optimization loop.78 He remains an active advocate for open and reproducible science.56


Sources


  1. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” arXiv:1406.1078 (2014), presented at EMNLP 2014. Primary source for the GRU. Kyunghyun Cho is the first author. The abstract describes the RNN Encoder-Decoder: “One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.” The paper introduces a novel gated hidden unit (the Gated Recurrent Unit) and reports that the model improves a statistical machine translation system. 

  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 (2014; published ICLR 2015). Primary source for attention in NMT. Dzmitry Bahdanau is the first author; Cho and Bengio are co-authors. The paper proposes letting the model “automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word,” rather than encoding the whole source into a single fixed-length vector. 

  3. “Gated recurrent unit,” Wikipedia. The GRU was introduced in 2014 by Kyunghyun Cho and colleagues in “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” (EMNLP 2014). It uses an update gate (z_t), which controls how much of the previous hidden state to retain versus incorporate new information, and a reset gate (r_t), which determines which portions of the previous hidden state influence the candidate activation; both use sigmoid activation producing values between 0 and 1. The GRU “lacks a context vector or output gate, resulting in fewer parameters than LSTM,” and performs comparably to LSTM across speech recognition, music modeling, and NLP tasks. 

  4. “Attention (machine learning),” Wikipedia. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced attention to neural machine translation in 2014 in “Neural Machine Translation by Jointly Learning to Align and Translate.” Attention addressed the limitation that encoder-decoder models compressed entire input sequences into fixed-size vectors, causing information loss for longer sentences; recurrent networks “favor information contained in words at the end of a sentence” and “attenuate the significance” of earlier content. Attention allowed “a token equal access to any part of a sentence directly, rather than only through the previous state.” By 2017, the Transformer (“Attention Is All You Need”) formalized scaled dot-product self-attention and became foundational for BERT, T5, and GPT. 

  5. “Kyunghyun Cho,” OpenReview profile, corroborated by “Kyunghyun Cho,” DBLP. Documents Cho’s active use of open peer-review platforms and his published research related to the peer-review process in machine learning – including work examining whether authors’ own assessments of their papers can assist or predict peer-review outcomes – evidence of his engagement with open and reproducible scientific practice. 

  6. “Kyunghyun Cho,” personal website (NYU / Center for Data Science), and his “Google Scholar profile.” Cho’s site and publication record reflect his advocacy for open, reproducible science – openly published research, careful empiricism, and attention to honest evaluation and baselines across his machine-learning work. 

  7. “Kyunghyun Cho,” Wikipedia, corroborated by his “NYU Courant faculty profile” and “personal website.” Born in South Korea in 1985; BS in computer science from KAIST (2009); MS (2011) and Doctor of Science (2014) from Aalto University, Finland, supervised by Prof. Juha Karhunen (with Tapani Raiko and Alexander Ilin); postdoctoral fellow with Yoshua Bengio at Université de Montréal (2014-2015); joined NYU Courant Institute in 2015 (tenured 2019); research scientist at Facebook AI Research (2017-2020); professor of computer science and data science at Courant and the Center for Data Science, Glen de Vries Professor of Health Statistics, and co-director of NYU’s Global AI Frontier Lab with Yann LeCun. Samsung AI Researcher of the Year (2020); Ho-Am Prize in Engineering (2021). 

  8. “Lab-in-the-loop therapeutic antibody design with deep learning,” bioRxiv (2025), Prescient Design / Genentech, and “Genentech: Prescient Design.” Cho co-founded Prescient Design in early 2021; it was acquired by Genentech that year. The lab-in-the-loop paradigm orchestrates generative machine-learning models, multi-task property predictors, active-learning ranking and selection, and in vitro experimentation in a semi-autonomous, iterative optimization loop; applied across clinically relevant antigen targets, the team designed and tested over 1,800 antibody variants and engineered antibodies with substantially stronger binding (reported as roughly 3 to 100 times) than the initial lead molecules. 

Articles connexes

Engineering Philosophy: Geoffrey Hinton, Conviction Over Fashion

Geoffrey Hinton bet on brain-like neural networks through two AI winters when the field mocked them -- conviction over f…

20 min de lecture

Engineering Philosophy: Demis Hassabis, Solve Intelligence to Solve Everything

Demis Hassabis built general intelligence from first principles -- games as the proving ground, neuroscience as inspirat…

21 min de lecture

LLM Tokenization: Why Korean Costs 2.8x More Than English

Translating my site into 6 languages revealed that Korean costs 2.8x more tokens than English for identical content. An …

9 min de lecture