Learn Similarity Search by Writing a Vector Database
I want to be technically correct. I don’t want “You’re absolutely right…” or “That’s a great point”.
The idea that I’ve stumbled on is that Claude Code can be an excellent teacher for building a vector db when given the right set of rules. Specifically the key for me has been adding this as a ground rule
Do not just write the code. I want you to ask questions to ensure I am learning. Do not be my friend, be tough and critical in evaluating my understanding.
Even though Claude Code can write up similarity search faster than I can write this blog post, I believe there is still value in understanding how things work. The best way I know to understand how things work is to deal with the friction and details of building it.
I decided to create a personal learning course on similarity search and writing a vector db in order to learn. It is free and open on my Github.
Indexes are the magic in databases. That is where it is easiest to glaze over the ugliness of implementation. Why is Cosine better, worse, or just different from Euclidean distance? Why do tree structures fall apart at higher dimension data? What about Locality-Sensitive Hashing and Product Quantization? They both are really cool, and seem promising, right up until you benchmark them. There is a reason everyone is using HNSW.
Implementing HNSW (Hierarchical Navigable Small Worlds) is a very clever algorithm but the smart decisions get lost when an AI assistant chooses it for you. It is both more and less than a skip list but it is hard to get it from a two line summary.
With a fixed course or a traditional textbook, you can’t push back on the instructor when you want a visualization to confirm your intuition. When dealing with Claude, it is more than happy to go down that rabbit hole.
If you have to stop working on it, then stop. It won’t judge you. When you come back to it and need a reminder on what you were trying to do, Claude doesn’t mind that either. Try to implement it, try something new, throw it away and ask Claude to start this module over.
It is frustrating and annoying to have an AI assistant that pushes you to understand and show your work. It is tedious to have to write a benchmark script to confirm it actually performs the way you think it does. It is a pain to struggle with the syntax (looking at you numpy) when you think you are close enough but these are exactly the points where you learn something new that will hopefully stick with you.
It is working for me, try it out or use this idea for another topic. My next course is on recommendation engines.