Dewel Insights: Exploring the Future of AI and Technology

an abstract image of a sphere with dots and lines
10 January 2025

Latent Space Reading compilation: Top 50 Must Read for AI Engineers

macro photography of silver and black studio microphone condenser

The 2025 AI Engineer Reading List

Thank you for latent Space, they picked an interesting read and  compilation.

1: Frontier LLMs

2: Benchmarks and Evals

3: Prompting, ICL & Chain of Thought

4: Retrieval Augmented Generation

5: Agents

6: Code Generation

7: Vision

8: Voice

9: Image/Video Diffusion

10: Finetuning

 

1: Frontier LLMs

 

GPT1GPT2GPT3CodexInstructGPTGPT4 papers. Self explanatory. GPT3.54oo1, and o3 tended to have launch events and system cards instead.

Claude 3 and Gemini 1 papers to understand the competition. Latest iterations are Claude 3.5 Sonnet and Gemini 2.0 Flash/Flash Thinking. Also Gemma 2.

LLaMA 1Llama 2Llama 3 papers to understand the leading open models. You can also view Mistral 7BMixtral and Pixtral as a branch on the Llama family tree.

DeepSeek V1CoderMoEV2, V3 papers. Leading (relatively) open model lab.

Apple Intelligence paper. It’s on every Mac and iPhone.

 

2: Benchmarks and Evals

 

MMLU paper - the main knowledge benchmark, next to GPQA and BIG-Bench. In 2025 frontier labs use MMLU ProGPQA Diamond, and BIG-Bench Hard.

MuSR paper - evaluating long context, next to LongBenchBABILong, and RULER. Solving Lost in The Middle and other issues with Needle in a Haystack.

MATH paper - a compilation of math competition problems. Frontier labs focus on FrontierMath and hard subsets of MATH: MATH level 5, AIMEAMC10/AMC12.

IFEval paper - the leading instruction following eval and only external benchmark adopted by Apple. You could also view MT-Bench as a form of IF.

ARC AGI challenge - a famous abstract reasoning “IQ test” benchmark that has lasted far longer than many quickly saturated benchmarks.

 

 

3: Prompting, ICL & Chain of Thought

 

Note: The GPT3 paper (“Language Models are Few-Shot Learners”) should already have introduced In-Context Learning (ICL) - a close cousin of prompting. We also consider prompt injections required knowledge — Lilian WengSimon W.

The Prompt Report paper - a survey of prompting papers (podcast).

Chain-of-Thought paper - one of multiple claimants to popularizing Chain of Thought, along with Scratchpads and Let’s Think Step By Step

Tree of Thought paper - introducing lookaheads and backtracking (podcast)

Prompt Tuning paper - you may not need prompts - if you can do Prefix-Tuningadjust decoding (say via entropy), or representation engineering

Automatic Prompt Engineering paper - it is increasingly obvious that humans are terrible zero-shot prompters and prompting itself can be enhanced by LLMs. The most notable implementation of this is in the DSPy paper/framework.

Section 3 is one area where reading disparate papers may not be as useful as having more practical guides - we recommend Lilian WengEugene Yan, and Anthropic’s Prompt Engineering Tutorial and AI Engineer Workshop.

 

4: Retrieval Augmented Generation

 

Introduction to Information Retrieval - a bit unfair to recommend a book, but we are trying to make the point that RAG is an IR problem and IR has a 60 year history that includes TF-IDFBM25FAISSHNSW and other “boring” techniques.

2020 Meta RAG paper - which coined the term. The original authors have started Contextual and have coined RAG 2.0. Modern “table stakes” for RAG — HyDEchunkingrerankersmultimodal data are better presented elsewhere.

MTEB: Massive Text Embedding Benchmark paper - the de-facto leader, with known issues. Many embeddings have papers - pick your poison - SentenceTransformersOpenAINomic Embed, Jina v3, cde-small-v1ModernBERT Embed - with Matryoshka embeddings increasingly standard.

GraphRAG paper - Microsoft’s take on adding knowledge graphs to RAG, now open sourced. One of the most popular trends in RAG in 2024, alongside of ColBERT/ColPali/ColQwen (more in the Vision section).

RAGAS paper - the simple RAG eval recommended by OpenAI. See also Nvidia FACTS framework and Extrinsic Hallucinations in LLMs - Lilian Weng’s survey of causes/evals for hallucinations (see also Jason Wei on recall vs precision).

RAG is the bread and butter of AI Engineering at work in 2024, so there are a LOT of industry resources and practical experience you will be expected to have. LlamaIndex (course) and LangChain (video) have perhaps invested the most in educational resources. You should also be familiar with the perennial RAG vs Long Context debate.

 

5: Agents

 

SWE-Bench paper (our podcast) - after adoption by Anthropic, Devin and OpenAI, probably the highest profile agent benchmark today (vs WebArena or SWE-Gym). Technically a coding benchmark, but more a test of agents than raw LLMs. See also SWE-AgentSWE-Bench Multimodal and the Konwinski Prize.

ReAct paper (our podcast) - ReAct started a long line of research on tool using and function calling LLMs, including Gorilla and the BFCL Leaderboard. Of historical interest - Toolformer and HuggingGPT.

MemGPT paper - one of many notable approaches to emulating long running agent memory, adopted by ChatGPT and LangGraph. Versions of these are reinvented in every agent system from MetaGPT to AutoGen to Smallville.

Voyager paper - Nvidia’s take on 3 cognitive architecture components (curriculum, skill library, sandbox) to improve performance. More abstractly, skill library/curriculum can be abstracted as a form of Agent Workflow Memory.

Anthropic on Building Effective Agents - just a great state-of-2024 recap that focuses on the importance of chainingrouting, parallelization, orchestration, evaluation, and optimization. See also Lilian Weng’s Agents (ex OpenAI), Shunyu Yao on LLM Agents (now at OpenAI) and Chip Huyen’s Agents.

 

6: Code Generation

 

The Stack paper - the original open dataset twin of The Pile focused on code, starting a great lineage of open codegen work from The Stack v2 to StarCoder.

Open Code Model papers - choose from DeepSeek-CoderQwen2.5-Coder, or CodeLlama. Many regard 3.5 Sonnet as the best code model but it has no paper.

HumanEval/Codex paper - This is a saturated benchmark, but is required knowledge for the code domain. SWE-Bench is more famous for coding now, but is expensive/evals agents rather than models. Modern replacements include AiderCodeforcesBigCodeBenchLiveCodeBench and SciCode.

AlphaCodeium paper - Google published AlphaCode and AlphaCode2 which did very well on programming problems, but here is one way Flow Engineering can add a lot more performance to any given base model.

CriticGPT paper - LLMs are known to generate code that can have security issues. OpenAI trained CriticGPT to spot them, and Anthropic uses SAEs to identify LLM features that cause this, but it is a problem you should be aware of.

CodeGen is another field where much of the frontier has moved from research to industry and practical engineering advice on codegen and code agents like Devin are only found in industry blogposts and talks rather than research papers.

 

7: Vision

 

 

8: Voice

*  Whisper paper - the successful ASR model from Alec Radford. Whisper v2v3 and distil-whisper and v3 Turbo are open weights but have no paper.

9: Image/Video Diffusion

 

10: Finetuning

 

 

 

Dewel Insights, founded in 2023, empowers individuals and businesses with the latest AI knowledge, industry trends, and expert analyses through our blog, podcast, and specialized automation consulting services. Join us in exploring AI's transformative potential.

Menu

Schedule

Monday-Friday

5:00 p.m. - 10:00 p.m.

 

Saturday-Sunday

11:00 a.m. - 2:00 p.m.

Get in touch

3555 Georgia Ave, NW Washington, DC 20010

ai@dewel-insight.com

wo

Dewel@2025