(query, document) pair directly rather than comparing independent embeddings, which produces more accurate ordering at the cost of one extra inference per document. Applying a reranker on top of vector search (retrieve top-20 via embeddings, rerank down to top-5) is one of the highest-impact quality improvements for a RAG pipeline, and it runs locally on CPU for free when you use a small cross-encoder from Hugging Face.
This guide shows how to combine HuggingFaceCrossEncoder with LangChain’s CrossEncoderReranker and ContextualCompressionRetriever. The pattern works with any cross-encoder model on Hugging Face, including BAAI/bge-reranker-*, mixedbread-ai/mxbai-rerank-*, Alibaba-NLP/gte-multilingual-reranker-*, Qwen/Qwen3-Reranker-*, and the classic cross-encoder/ms-marco-* family.
Setup
Build a base retriever
Start with a standard vector store retriever. Retrieve a relatively largek; the reranker will narrow it down.
Rerank with a cross-encoder
CrossEncoderReranker wraps any cross-encoder and plugs into ContextualCompressionRetriever.
Picking a cross-encoder
| Model | Size | Notes |
|---|---|---|
cross-encoder/ms-marco-MiniLM-L6-v2 | 22M | Fastest; English only, 2022-era baseline |
BAAI/bge-reranker-v2-m3 | 568M | Multilingual, strong default for most workloads |
mixedbread-ai/mxbai-rerank-large-v2 | 1.5B | Top-tier English quality, GPU recommended |
Alibaba-NLP/gte-multilingual-reranker-base | 306M | Multilingual, 8192-token context |
Qwen/Qwen3-Reranker-0.6B | 595M | Instruction-aware, multilingual |
HuggingFaceCrossEncoder auto-selects the best available device (CUDA > MPS > CPU). To pin to a specific device, pass model_kwargs={"device": "cpu"} or similar.
Deploying to SageMaker
You can also host a cross-encoder on a SageMaker endpoint and useSagemakerEndpointCrossEncoder. Here is a sample inference.py that loads the model on the fly (no model.tar.gz artifacts required). See this walkthrough for step-by-step guidance.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

