For a long time, Stacker sat on a goldmine of story performance data. We just needed a way to tap into it.
Stacker has distributed thousands of branded stories across our publisher network over the past few years, generating 1.2 million pickups and 77 billion estimated pageviews.
We've tracked every pickup. Every pattern and performance signal.
We learned which headlines drive authority site placements. Which categories punch above their weight, and what content formats generate the most reach by industry.
The kind of performance data we captured is usually gatekept or sitting in the minds of editors at media publications, though, when analyzed, would be an indispensable asset for content teams. Historically, we’d pooled and analyzed the data we had to guide our man-powered content insights to help brands identify what kinds of stories resonate with publishers and their readers.
Recently, however, we’ve figured out how to scale these insights for our hundreds of clients.
A gap emerged as our business model evolved from Stacker Studio (where our in-house team wrote brand journalism on behalf of clients), to Stacker Connect (where our clients write, but Stacker handles vetting, edits, and distribution). Our clients owned their content strategy but lacked the visibility into what works on our network. We frequently were asked “What should I write next?", and while our data had answers, we couldn't scale personalized guidance to 100+ brands. At least, not yet.
Thus, Sparks was born: An AI-powered system that analyzes our network’s intelligence to deliver customized recommendations weekly. I handled the technical implementation as the lead data engineer, working with our head of product, Ken Romano, and lead content strategist, Tamara Sykes, to get Sparks ready to ship in just four jam-packed months.
We built Sparks to analyze story performance across our publisher network and generate personalized content recommendations for each client. Every week, the system examines benchmarks for different story categories, trending headlines, authority site preferences, and seasonal opportunities, to deliver four actionable story ideas for clients with supporting data on why they'll perform.
These content insights were derived from data that showed which trends the whole network was seeing as well as what insights were working for specific clients. We also knew it was important to provide actionable recommendations on where the Sparks stories had good chances of pickup.
Here's an inside look into the engineering behind its build.
Client view of Sparks
We built Sparks’ foundation upon a set of SQL queries that extract meaningful patterns from our large analytics warehouse with billions of data points.
We built the system modularly, so each query lives in its own SQL file which makes it simpler to add new data sources without touching the core pipeline. For instance, when we discovered publishers favor headlines with numbers, we added a query. When clients wanted to know optimal headline length, another query. The architecture of Sparks grows along with our understanding of what drives pickup success.
Each SQL query targets a specific angle for insights, including:
We decided to structure our prompts in disparate markdown files rather than hardcoding them in Python. The four separate files handle different concerns:
Why?
This file separation helps developers more quickly update prompts based on content strategy team insights provided in natural language. And when stakeholders want different phrasing or new guidelines over time, we could modify the relevant markdown file without touching the application logic.
The approach opens opportunities for A/B testing different prompt strategies and maintains clean version control for tracking what changes drove improvements.
To ensure accuracy and prevent hallucination of the system, we implemented a three-stage reasoning pipeline that forces the LLM to show its work:
Finally, the model would produce four polished Sparks, with each one providing a concise, actionable content brief backed by verified data.
Since the model must include its reasoning in the response, we can audit exactly how it arrived at "Entertainment content gets 1.7x more pickups than the network average" by checking the calculation against the benchmark tables.
This approach worked wonders to turn vague AI suggestions into defensible, data-driven recommendations.
Once we had our model completing these checks successfully, we were ready to move into production.
Chain-of-Thought Flowchart
What started in Jupyter notebooks, four months later, was one of the greatest Stacker updates to date in production.
We iterated on three parallel tracks:
Some of these optimizations became features themselves:
We converted expensive queries for leaderboards and benchmarks into database views, dropping response times from 15+ seconds to under half a second.
Now, these views are productized, enabling clients to see them directly in their dashboards.
Our content strategist wrote dozens of "Inside Scoops," editorial tips about what makes stories perform. These now serve as context for the model but also became a productized rotating section that changes weekly for clients, and will continue to be updated as we get more data about what makes resonant content. We also added LinkedIn bio imports to give the model context on the brand, ensuring it understands each client's market positioning regardless of company size.
Next, we took on the challenge of how to maintain quality assurance on the recommendations derived from the model.
Our biggest challenge was to ensure that our customers received the high-quality recommendations they were accustomed to when working with our human team, even when generating 400+ weekly for hundreds of clients. During development, our manual review worked, but our AI outputs needed subjective assessment of tone, relevance, and strategic value. Traditional QA methods couldn't scale to this volume.
So, we built a quality flywheel with three reinforcing components:
1: Weekly Evaluation & Prompt Refinement
Every week, we sample generated Sparks across different client verticals and review them with our content strategy and insights team. We look for patterns, like: Are healthcare recommendations too generic? or Are finance suggestions missing data-driven angles? And when we spot issues, like overly broad topics such as the “Write about money” Spark we received early in the process, we update the modular prompt files to demand specificity, like “5 cities where starter homes cost less than the national average”.
This human review process ensures the system learns from real outputs, feeding improvements back into next week's generation.
2: Reasoning-Based Self-Evaluation
Our most significant breakthrough was to embed evaluation directly into the model's reasoning chain, using what's known as LLM-as-a-Judge, where the AI critiques and validates its own AI outputs. The model self-evaluates against a variety of concrete checkpoints that we know matter to LLMs, like authority of sources or formatting consistency, to optimize performance of your branded content as shaped by our data.
Every generation includes this built-in quality control. We scaled the subjective question ‘Is this a good recommendation?’ into systematic verification steps that the model executes before producing output.
This reasoning-based engineering gives us the ability to scale infinitely. Whether we're generating 4 Sparks or 4,000, each one passes through the same rigorous self-evaluation.
The model shows its work, catches its own errors, and ensures consistency across all clients.
3: Client Feedback Loop
Post-launch, clients could thumbs-up or thumbs-down each Spark directly in their dashboard. We aggregated this feedback weekly, looking for patterns to update vertical priorities and emphasize patterns that worked.
The feedback helps improve the individual client experiences while making the entire system smarter for everyone.
Each component reinforces the others: client feedback validates our evaluation sessions, reasoning requirements catch issues before clients see them, and weekly reviews ensure both systems stay aligned with our quality standards.
Clients who never had access to the content strategy team now get fresh recommendations every Monday morning. The insights are available instantly, backed by the latest network data, and consistent in quality across all accounts.
I’m proud that Sparks’ launch meant that Stacker was able to transform from a service that could maybe handle a dozen clients manually — to one that scales to hundreds automatically.
The architecture also delivers rapid iteration. This is specifically why we designed the system modularly: When our value team identifies new ways to improve model intelligence, we can ship updates in days. No architectural changes or lengthy dev cycles.
This speed lets us:
We built a system that turns millions of network data points into personalized content strategy for every client, every week. The recommendations improve as the network evolves, the insights sharpen as we refine the prompts, and the marginal cost to serve each additional client is minimal.
The launch of Sparks met a need for our customers who were thrilled with the content they were producing and distributing with Stacker and wanted to ship more, but needed clearer recommendations on what content was in demand. We knew that with AI, we maximize our ability to provide answers to our clients’ questions while keeping our services cost-effective. What once cost a premium for custom content strategy is now the baseline platform infrastructure that all brands can get access to automatically.
Our data intelligence will just get better with time, to provide more impact to our clients, so get ready: The extent of Sparks’ impact is just getting started.
Ready to turn network intelligence into weekly, data-backed story ideas?
Cole Carter is a Data Engineer at Stacker, where he helps power the data systems behind Stacker’s newsroom and brand content operations. A Houston native and graduate of the University of Texas at Austin, Cole joined Stacker in May 2024 and brings expertise across Python, Machine Learning, AI, SQL, and R. He’s passionate about using data to uncover insights that inform smarter storytelling and make complex information accessible to everyone.