Learn summarized model with
In 2025, there may be maybe no group beneath the solar that isn’t making data-driven choices, or at the very least claiming to. At Omnisend, knowledge isn’t only a declare — it’s the muse upon which the remainder of the home is constructed.
DataOps & Insights is the supply of reality for our choices: operational reporting, predictive analytics, and self-serve knowledge. Our CI/CD pipeline enforces high quality so product groups can transfer quick with out breaking belief.
Nonetheless, after not too long ago becoming a member of the crew as a Software program Engineer, I observed a friction widespread to many knowledge groups: the mechanics of the work had been slowing down the logic. We would have liked to optimize not simply our code, however our workflows. After intensive trials, we recognized high-ROI functions for Giant Language Fashions (LLMs) that compress our time-to-insight from days to minutes.
Right here’s how we did it.
1. The analytics hole: Velocity vs. high quality
The actual problem in fashionable knowledge engineering isn’t writing SQL — it’s closing the hole between a pointy enterprise query and a reliable reply earlier than the chance window closes.
At Omnisend, we realized that whereas our logic was sound, the “chores” of knowledge modeling had been making a bottleneck.
The friction: Context switching and boilerplate
Constructing a strong knowledge mannequin requires fixed context switching: leaping between dbt conventions, YAML configurations, testing suites, and documentation requirements. We confronted repetitive scaffolding duties throughout each layer of our transformation pipeline (staging → dims → details → marts).
Each context change launched an opportunity for error, and sustaining consistency throughout environments turned more and more fragile. We would have liked a approach to automate the rigorous, repetitive elements of the job so our analysts may concentrate on structure somewhat than typing.
The answer: Context-aware modeling with Cursor
We turned to Cursor, an AI-powered code editor. In contrast to commonplace autocomplete instruments, Cursor indexes our total repository, permitting it to know the particular context of our undertaking construction, knowledge lineage, and naming conventions.
We arrange the surroundings to help the AI:
- Repo Indexing: Cursor listed our knowledge fashions and documentation, giving it a “map” of our knowledge warehouse
- Guardrails and prompts: We established well-scoped prompts aligned with our SQL model
- Inline critiques: The AI flags anti-patterns — like CTEs that break incremental fashions or fan-out joins — earlier than a Pull Request (PR) is even raised
Implementation: From hours to minutes
With these guardrails in place, the workflow shifted dramatically. When an analyst defines a enterprise requirement, Cursor generates the preliminary knowledge fashions in seconds. It selects acceptable supply tables, generates recordsdata in right undertaking paths (staging/dims/details), and even pulls column descriptions to auto-populate documentation.
What used to take hours of handbook file creation is now accomplished in minutes. The analyst’s position shifts from “author” to “reviewer.”
A observe on hallucinations: It’s necessary to be real looking: the instrument isn’t good. It makes errors when the token chance sequence will get “confused.” Nonetheless, getting 90% of the work accomplished immediately permits us to spend our vitality on the ultimate, essential 10% of validation.
The impression
- Growth velocity: A 2-5x improve in mannequin supply pace by templating YAML and take a look at creation
- Improved consistency: SQL and YAMLs now observe a strict commonplace, decreasing knowledge incidents.
- Higher traversability: The AI enforces a constant hierarchy (staging > dims > details > marts), making the codebase simpler to navigate, perceive, and use when creating a brand new knowledge mannequin
2. Fewer evaluate cycles, fewer incidents
As soon as modeling is completed regionally in Cursor and a PR is raised, the workflow shifts from creation to validation. That is the place we hand the baton to Gemini Code Help.
The problem: The peer evaluate bottleneck
Peer critiques are essential for high quality, however they will develop into a bottleneck. A human reviewer — particularly one from a distinct product group — may miss delicate deviations from our dbt model information or miss out on non-optimal BigQuery features.
We confronted widespread ache factors:
- Context blindness: Struggling to know cross-file context in massive diffs
- Fashion drift: Inconsistent formatting making diffs more durable to learn
- Logic gaps: Lacking delicate enterprise logic breaks (e.g., attribution order adjustments) that look syntactically right however are functionally mistaken
The answer: Gemini Code Help (with strict tuning)
We deployed Gemini Code Help as our first line of protection. It summarizes diffs by intent, checks in opposition to a repo-specific model information, and proposes concrete fixes.
Nonetheless, out of the field, the AI was noisy. To make it helpful, we needed to arrange the reviewer similar to we arrange the author:
- Noise discount: We tightened the .gemini/config.yaml to prioritize essential findings over nitpicks
- Context injection: We added a ./gemini/styleguide.md file containing our particular dbt conventions and governance checks
Actual-rorld optimization: The story of three CTEs
The worth of a second AI opinion turned clear throughout a current refactor. We had a mannequin with three duplicated Frequent Desk Expressions (CTEs).
Cursor (the author) flagged them however instructed an “if it ain’t damaged, don’t repair it” strategy, warning that unioning may be slower.
While Gemini (the reviewer) flagged the identical duplication, however advisable a concrete optimization: consolidating them into one union with a single unnest/be a part of.
We examined the Gemini-suggested refactor. The end result was a ~50% discount in runtime. This interaction is essential: the drafting AI prioritized pace, whereas the reviewing AI prioritized structure.
The impression
- 30–40% fewer evaluate cycles: Gemini catches syntax and magnificence points earlier than a human sees them
- 15–25% discount in logical errors: Fewer post-merge defects tied to inconsistent logic
- Automated governance: The assistant flags PII points and validates source-of-truth tables robotically
3. Fixing knowledge discovery: The “the place is X?” downside
As our Superset surroundings scaled to 1000’s of belongings, it turned a sufferer of its personal success. A easy query like “The place can I discover our month-to-month recurring income chart for M-segment shoppers?” required deep platform information or a ping to the info crew.
The answer: Embed, index, retrieve
We embedded a Chainlit chatbot straight into the Superset UI.
- Ingestion: A each day automated pipeline (through Dagster) extracts metadata from each dashboard and chart
- Indexing: Metadata is synced to a vector information base on OpenAI
- Retrieval: Chainlit responds by way of the OpenAI Assistant API, returning ranked belongings with direct hyperlinks when obtainable, or suggesting the place outcomes could also be discovered.
All of it comes right down to context
The facility of this strategy is knowing knowledge relationships. A marketer not too long ago requested: “How lengthy, on common, does it take retailers to activate kinds from the time of creation?“
No pre-built dashboard answered this. Nonetheless, the Assistant analyzed the intent and appropriately recognized the related dataset and columns wanted to calculate the reply. It remodeled a “no outcomes” useless finish right into a self-serve win.
The impression
- Silence is golden: A 25–40% drop in “The place is X?” pings to the DataOps crew
- Pressured hygiene: As a result of the bot depends on metadata, “undocumented” turned “invisible,” incentivizing the crew to undertake higher documentation requirements
4. Scaling EDA: 76 hours of video in minutes
A few of our Most worthy knowledge isn’t in a database — it’s in unstructured textual content, comparable to buyer conversations. We not too long ago had 76 hours of Quarterly Enterprise Evaluate (QBR) recordings — a goldmine of shopper suggestions that was virtually unattainable to investigate manually.
The strategy: Bypassing the context window
We used Cursor with Claude-4-Sonnet to construct an iterative ETL pipeline for textual content.
- Context definition: We outlined a immediate focusing on particular subjects (metrics, benchmarks, suggestions)
- Software technology: Cursor generated a Python script to course of 116 transcript recordsdata
- Iterative extraction: The script iterated by way of recordsdata, extracting related sentences into structured CSVs, which had been then summarized
The impression
This strategy gave us a blended view of qualitative and quantitative insights: frequency counts of subjects alongside exemplar quotes.
Extra importantly, it democratized the workflow. Vytautas Jakštys, our Product Director, a non-technical chief, now makes use of this similar technique. He generates SQL from our dbt docs utilizing Claude, then makes use of Cursor to investigate buyer chats to know the “why” behind the numbers.
Last ideas on knowledge as a dialog
We aren’t stapling AI onto our stack for present we’re baking it into how Omnisend asks, solutions, and acts.
The result’s a division that ships fashions quicker, critiques code smarter, and lets everybody discover reliable knowledge and not using a guided tour. AI handles the mundane work — constructing new knowledge fashions from enterprise necessities, writing YAML documentation and checks, checking syntax and proper mannequin use, validating and reviewing, and discovering charts — clearing the runway for us to concentrate on the actual query:
What’s the following step that strikes us forward?
The following step is to proceed codifying our judgment into the markdown recordsdata: guidelines, pointers, kinds, and extra. It’s an ever-evolving course of. As new LLM fashions emerge, so do new prompting strategies and approaches.
Most significantly, such workflows run totally on well-curated metadata. Your AI is just nearly as good as your documentation.
When you personal a dataset, undertake the model information and certify your belongings. You aren’t simply serving to a human reader in the present day, you’re making the assistant smarter for tomorrow.
Source link


