Abdullahi Mohamed // Engineering Home Base

When building a RAG pipeline, how you split your text is just as important as how you retrieve it. After extracting text from my PDFs, the next big challenge was chunking. If your chunks are too small, you lose context. If they’re too big, you hit the LLM’s token limit.

I’ve used LangChain in Python before, and I really liked their RecursiveCharacterTextSplitter. Since I'm doing most of my heavy lifting in Go, I decided to port that logic over. It’s a fun recursive algorithm that tries to split text at the most "semantically meaningful" points first (like paragraphs), falling back to smaller separators (like spaces or individual characters) only when necessary.

The Strategy

The idea is to give the splitter a list of separators in order of importance: \n\n, \n, " ", and "".

It tries to split by the first separator.
If a chunk is still larger than the byteLimit, it recursively calls itself on that chunk using the next separator in the list.
It also handles overlap, so the end of Chunk A is the start of Chunk B, preserving context across the boundaries.

Here is the implementation I came up with:

// RecursiveSplitter attempts to divide text semantically using a hierarchy of separators.
func RecursiveSplitter(args RecursiveSplitterArgs) []string {
	args.SetDefaults()

	if len(args.Text) <= *args.ByteLimit {
		return []string{args.Text}
	}

	var separator string
	var remainingSeps []string

	found := false
	for i, sep := range args.Separators {
		if strings.Contains(args.Text, sep) {
			separator = sep
			remainingSeps = args.Separators[i+1:]
			found = true
			break
		}
	}

	if !found {
		// Fallback to a simpler chunker if no separators match
		return ByteOverlapChunker(args.Text, *args.ByteLimit, *args.ByteOverlap)
	}

	parts := strings.Split(args.Text, separator)
	var finalChunks []string
	var currentDoc strings.Builder

	for _, part := range parts {
		if currentDoc.Len()+len(part)+len(separator) > *args.ByteLimit {
			if currentDoc.Len() > 0 {
				finalChunks = append(finalChunks, currentDoc.String())

				// Handle overlap
				byteOverlapText := currentDoc.String()
				if len(byteOverlapText) > *args.ByteOverlap {
					byteOverlapText = byteOverlapText[len(byteOverlapText)-*args.ByteOverlap:]
				}
				currentDoc.Reset()
				currentDoc.WriteString(byteOverlapText)
			}
		}

		if len(part) > *args.ByteLimit {
			// Recursively split the oversized part
			subChunks := RecursiveSplitter(RecursiveSplitterArgs{
				Text:        part,
				ByteLimit:   args.ByteLimit,
				ByteOverlap: args.ByteOverlap,
				Separators:  remainingSeps,
			})
			finalChunks = append(finalChunks, subChunks...)
		} else {
			if currentDoc.Len() > 0 && !strings.HasSuffix(currentDoc.String(), separator) {
				currentDoc.WriteString(separator)
			}
			currentDoc.WriteString(part)
		}
	}

	if currentDoc.Len() > 0 {
		finalChunks = append(finalChunks, currentDoc.String())
	}

	return finalChunks
}

Why Go?

Porting this to Go made a noticeable difference in my pipeline's performance. Go’s strings.Builder and slice management are extremely efficient for this kind of repetitive string manipulation. Plus, having it native to my Go backend means one less dependency on a Python process for the "non-AI" parts of my project.

It's not as "mature" as the LangChain version yet, but it gives me great splitting results and fits perfectly into my existing architecture. It reminds me that even if a library exists in another language, sometimes building it yourself in your primary language is worth the effort for the control and speed you gain.