A practical scorecard for evaluating UX/UI design partners — covering process, team continuity, outcomes, and the ten questions that separate partners from vendors.

Evaluating a UX/UI design partner well means looking past portfolio polish. The strongest partners earn high marks on how they think, how they staff the work, how they communicate, and whether they can show outcomes (not just deliverables). Score candidates against a five-to-eight criterion scorecard, weight the criteria by what your business actually needs, and test the team in a working session before you sign anything.
A polished portfolio is the floor for entry, not the criterion for selection. Two studios can ship visually similar work with very different effects on your team, hiring needs, and conversion numbers.
Score finalists against eight criteria covering strategy, process, team continuity, communication, outcomes, pricing, and cultural fit.
Team continuity is the single biggest predictor of a strong engagement. Get the named senior staffing in writing before you sign.
Ten first-call questions surface a partner's instincts faster than any case study, especially questions about disagreement, measurement, and who shouldn't hire them.
If two finalists score within three points, run a paid two-week pilot. The pilot tells you more than another reference call.
A portfolio shows the finished surface. It rarely shows how the team got there, who actually did the work, how they handled disagreement, or whether the work performed once it shipped.
Two studios can ship visually similar sites and have very different effects on your team's ability to iterate, your hiring needs, and your conversion numbers. The portfolio qualifies a partner for the conversation. It does not pick one.
The pattern shows up across agency selection research. Forrester's work on marketing services partnerships finds that fit and ways of working predict satisfaction more reliably than creative quality alone (Forrester). The aesthetic case is necessary. It is not sufficient.
Score each criterion 1 to 5. Weight the rows that matter most for your situation, then add the weighted scores. We recommend a minimum threshold of 28 out of 40 for a serious finalist.
What good looks like: Asks business questions before design ones, brings frameworks, challenges your brief.
Red flags: Jumps to deliverables, treats the brief as fixed, has no point of view.
What good looks like: Documented phases, defined milestones, clear discovery before execution.
Red flags: "We'll figure it out as we go," vague estimates, no kickoff plan.
What good looks like: The senior people on the pitch are the senior people on the project.
Red flags: Pitch team vanishes after signing, junior staff parachuted in mid-project.
What good looks like: Designers conversant in product, brand, and engineering constraints.
Red flags: Designers siloed, can't critique feasibility, hand off polished files with no context.
What good looks like: Weekly working sessions, async updates, named escalation path.
Red flags: Status-only meetings, surprise deadlines, slow turnaround on questions.
What good looks like: Tracks success metrics, ships and measures, iterates after launch.
Red flags: Files delivered then silence, no measurement plan, no follow-up cadence.
What good looks like: Clear fee structure, scope guardrails, written change-order process.
Red flags: Vague ranges, "trust us," scope creep absorbed quietly until invoiced.
What good looks like: Comfortable in disagreement, gives honest critique, respects internal expertise.
Red flags: Agrees with everything, defensive when challenged, condescends to in-house team.
The right ten questions surface a partner's instincts faster than any case study.
Walk me through a typical engagement from kickoff to handoff.
What's your discovery process when you don't yet know our business?
Who from your team will I work with day to day, and will those people change as the project progresses?
How do you handle disagreement when your team and ours land in different places on a recommendation?
What does success look like for you in this engagement, separate from delivering the deliverables?
How do you measure whether the work actually performed once it shipped?
Tell me about a piece of work you'd revisit and change today. What would you do differently?
How do you collaborate with our internal product, brand, or engineering teams?
How do you handle scope changes mid-project?
What kinds of clients shouldn't hire you?
If a candidate stumbles on questions 7, 8, and 10, take that seriously. Those are the questions that distinguish partners from vendors.
Even experienced buyers fall into a handful of repeating traps. Watch for these.
Beautiful work is table stakes at the studio tier you're shopping. Use it to qualify, not to decide.
The single biggest predictor of a good engagement is whether the people you met in the pitch are the people doing the work. Ask explicitly. Get it in writing.
Studios that "just start" rarely produce strategic work. Discovery isn't slow. It's where the leverage lives. McKinsey's product development research finds that teams investing in upfront problem definition ship better outcomes faster than teams that compress discovery to start design earlier (McKinsey).
Every reference says the work was great. Ask "what didn't go well, and how did they handle it?" That's where you learn how a partner behaves under pressure.
A long services list is not a track record. Ask for measurable results from comparable engagements.
The best partners help you measure and iterate after launch. If post-launch support isn't part of the conversation, you're buying a one-time deliverable, not a partnership.
After you've scored each finalist, look at the spread, not just the totals.
A candidate who scores 4s and 5s evenly across the board is usually safer than one who scores 5s in three categories and 2s in two. Weight team continuity, strategic thinking, and outcomes orientation highest if you're hiring for a multi-quarter program. Weight process clarity and communication cadence highest if you've had bad agency experiences before. Weight cultural fit highest if your in-house team is going to live with this partner daily.
If two finalists score within three points of each other, run a paid two-week pilot. The pilot will tell you more than another reference call.
Before scoping a long engagement, ANML runs what we call the 3-Lens Pilot. It's a paid, two-week working sprint designed to expose the same dynamics a six-month engagement will surface, only earlier and at a fraction of the cost.
What we test: Can the partner reframe the brief and surface the real business question?
Signal we're listening for: They challenge what's asked, not just answer it.
What we test: Can the team move from question to artifact (flow, prototype, system) inside two weeks?
Signal we're listening for: Velocity without sacrificing rigor.
What we test: How do the working sessions feel? Cadence, decisions, conflict, follow-through.
Signal we're listening for: The team you'd want with you on a hard week.
We use the 3-Lens Pilot in both directions. It's how we evaluate prospective collaborators, and it's how prospective clients can evaluate us. If the pilot doesn't end with both sides energized about the next phase, that's the most useful signal you can get.
A growth-stage fintech we worked with had completed a competitive pitch process and signed with a high-profile studio. Six weeks in, the senior designer they'd met during the pitch had been moved to a different account. The work that came back was technically polished but disconnected from the regulatory constraints and onboarding patterns specific to financial services. The team scrapped most of it.
When we re-scoped the engagement, we built two requirements directly into the contract: a named senior team for the duration, and a discovery phase that paired our designers with their compliance and engineering leads in the first two weeks. The redesigned onboarding flow shipped four months later and lifted activation by 22%.
The lesson wasn't that the first studio lacked talent. It was that the buyer hadn't tested for team continuity or cross-functional fluency before signing. The scorecard exists to prevent that exact outcome.
In consumer and luxury, the surface bar is high and the strategic gap is wider. Brands in these categories often hire for visual sophistication and end up with experiences that look the part but don't sell, retain, or convert. The portfolios all look great. The outcomes diverge.
The fix is to weight outcomes orientation and cross-functional fluency higher in the scorecard. Ask for retention, conversion, and AOV impact alongside the imagery. A partner who can talk about both is rare. A partner who can only talk about the imagery is common.
If you're evaluating UX/UI partners and want a second pair of eyes on your scorecard, your shortlist, or your discovery brief, we're here for that conversation. Follow ANML on LinkedIn for more practical guidance on brand and product experience.
Plan on three to six weeks from first call to signed contract. That covers initial conversations, scorecard reviews, reference calls, working sessions, and final negotiation. Compressing this timeline tends to surface problems later in the engagement.
Focus on process, team staffing, and outcomes. Walk through how a typical engagement runs, who you'll work with day to day, how disagreements get handled, and how the team measures whether the work performed after launch. Avoid pricing in the first call. It forces vague answers and skews the conversation toward scope rather than fit.
Useful but secondary. Industry fluency speeds up discovery, but a partner with strong fundamentals and a willingness to learn often outperforms a category specialist who has stopped questioning their own playbook. Test for curiosity over credentials.
Pick based on the messiness of the work. If the project crosses brand, product, and web, a full-service partner reduces handoff cost. If the scope is narrow and well-defined, specialists are typically faster and more affordable. Avoid full-service partners who are really three specialists in a trench coat.
The biggest one is over-promising on timeline and outcomes without asking enough questions about your business. A close second is staffing changes between pitch and project. If the senior people you met in the pitch are not on the proposed team, that's a structural risk worth raising before signing.
Ask candidates to talk through one of their case studies without showing the visual work. If they can describe the problem, the user, the constraints, the choices, and the result without leaning on the screens, the strategy is real. If the story falls apart without the visuals, the strategy probably wasn't there to begin with.
Lean harder on process and reasoning. Ask them to walk through how they would approach your problem in real time. A strong partner will run a live discovery exercise on the spot and ask the questions that reveal where they would start.
A vendor delivers what's asked. A partner challenges what's asked, brings a point of view, and stays involved in measuring whether it worked. The scorecard in this post is built to surface that distinction.