Grokking ML Alone

By Allie, a self-supervised MLE in Texas

I remember being affectionately called “the deep learning person” by an AI PhD I knew in undergrad, to which I promptly replied “I am not deserving of that title”. Oh how undeserving I was! After graduation, I spent a year chasing divergent goals and battling personal crises, which precipitated a theory phase (think An Information-Theoretical Framework for Supervised Learning). I had an existential epiphany over this section in Applying Statistical Learning Theory to Deep Learning, exclaiming, “I should’ve thought about life in parameter space!”

The architecture is important because it determines the mapping from parameter space to function space and is the biggest contributor to the geometry by which we are searching in function space. The next most significant thing is that affects the inductive bias is the geometry of the local search in parameter space (e.g. l2 vs l1 in parameter space).

(Insert 🤦🏻‍♀️ from everyone who’s heard my musings about life being PCA)

I moved to Texas a year ago. It’s been really lonely, emotionally and intellectually. But I’m no stranger to fighting the odds with no guidance but myself — it was all of high school and the pandemic. So after the initial months of shock, I forced my brain back into problem-solving mode. If I must switch to a self-supervised paradigm, what are the auxiliary tasks?

Task A: seek external mentorship. I reached out to complete strangers on LinkedIn, and am grateful to about a dozen people for taking their time to share their journey and tips for getting better as an MLE.

Task B: rebuild community. I kept in touch with friends in the space, which helps to keep my language updated. I joined professional groups like AI Circle. On Christmas Eve, I hosted a ML study session with some fellow nerds in NYC, which rekindled the joy of discourse.

Task C: try lots of things at work. There isn’t scale, but there is the flexibility to apply a lot of the techniques from school to very small datasets and observe immediate feedback.

Task D: read. Despite having about a year of robot learning research experience and 6 months of retrieval augmented generation (RAG) and LLM-related work experience, I hadn’t yet formed the habit of reading papers beyond those immediately related to my research. I can’t believe it took so long! Sharing a couple of highlights here.

Computer Vision

The entry point was autoencoders (AE). When I first came across the DeepFont paper with the AE on synthetic data and the real world data classification head, it took me a while to make sense of the two components. But as I surveyed more of the reconstruction-based anomaly detection literature, I began to get more intuition about the kind of distribution that’s suitable for the conventional AE and ways to expand its capabilities. Some works learn to transform the distribution of good items to gaussian, others use contrastive learning to minimize the difference between good items; some works make AE’s work better for features at different scales and leverage SSL tasks like masking, and others tweak the encoder-decoder architecture to prevent information leak. One of the hardest generalization problems I’ve ever seen is getting these encoders to work for more than a dozen object classes beyond the current small benchmark, and I’ll definitely be following progress there.

The other major anomaly detection line of work is based on embedding retrieval, which aligned nicely with my text retrieval experience and helped me get over the insecurity of not having experience in vision. Modality is secondary to the recipe of representation learning, after all.

Starting in October, I did a deep dive into essential computer vision works. The highlights were:

Reading classical feature extraction papers like SIFT made me appreciate what we often abstract away: the meaningfulness of shallow, explainable features.
I finally understood how big of a deal it was for vision transformers to learn the 2d inductive bias of image structures from scratch. And DEiT made the point that we can accelerate learning by attending to CNN teachers with said inductive bias.
YOLO was my first time making sense of a multi-part loss function and how a good one can remove the need for extra subnetworks.
At first, DINO’s SSL objective felt counterintuitive: why could we just set one network to be the exponential moving average of another? But of course, like any other good objective, it prevents “the easy way” of regurgitating the data. It’s almost like how I’m training myself nowadays: the student network is the fast brain that solves specific things, and the teacher network is the slow brain that situates the fast brain in a global outlook, which then improves judgment on the fast brain. (a very scientific analogy 😛)
The latent diffusion models paper changed my attitude toward architecture design. I had understood cross-attention as “the thing you use in the transformer decoder to align it with the input”, but this reminded me that cross-attention is just another principled choice for a general conditioner. Since then, networks seemed more like living organisms as I began to see submodules as useful but not irreplaceable tools.

Retrieval, Search, and Recommendation

In the mean time, I got curious about how retrieval, search, and recommendation are connected, which is a full-circle moment because my first tech internship before senior year involved designing a classical recommendation algorithm. Even then, with a small no-training solution, I had the intuition of designing a two-stage system: filter out content-based hard constraints, and rank the rest with similarity weights. It’s endearing to see the learned variants of these ideas.