Hey everyone,
I wanted to share a few lessons I’ve learned running A/B tests using Amplitude Experiment and open up a conversation around how others are approaching it in real-world settings.
We’ve been using Amplitude for product analytics for a while, but recently moved some of our experimentation flows over to Amplitude Experiment—mainly to keep everything in one place and tie our test results more closely to user behavior.
Here’s a quick breakdown of what’s worked, what didn’t, and where I’d love to hear advice from others.
✅ 1. Getting Targeting Right Is Everything
We ran into issues early when our target segments weren’t clearly defined. One test aimed at new users actually hit a mix of new and returning users due to how our user property was being updated (or not updated fast enough).
Tip:
-
Use real-time flags sparingly—and be cautious about when properties get assigned (especially in onboarding).
-
We now double-check everything in a "pre-launch debug dashboard" to verify who would actually qualify for the test.
🔁 2. Split by User ID — Not Device ID (Most of the Time)
If your users log in across multiple devices, splitting by Device ID will backfire. We had a test where someone saw Variant A on mobile and Variant B on desktop. Ouch.
Lesson learned: Stick to User ID split for authenticated flows, and only use Device ID if you’re absolutely sure it’s a one-device experience.
📈 3. Don’t Stop Tests Too Early
Guilty of this one: our team stopped a test after 4 days because it looked like one variant was “winning.” But after checking the confidence intervals in Amplitude’s significance testing tools, it turned out the results weren’t statistically reliable.
What we do now:
-
Minimum of 7 days per test, even with high traffic
-
Set thresholds before launching (e.g., p < 0.05, minimum detectable effect, etc.)
🎯 4. Tie Experiments to Core Metrics, Not Vanity Metrics
Our first few tests focused on click-throughs and impressions—easy wins, right? But over time, we learned it’s way more valuable to track downstream outcomes (e.g., completed signups, conversions, retention after 1 week).
Now we ask:
“What user behavior do we ultimately care about, and is this test really influencing it?”
Amplitude makes this easier than most platforms by letting you hook experiments directly into funnel and retention views. That was a game changer for us.
🧪 5. Use Holdouts When Possible
This was new for us, but using holdouts (groups that get no experiment) helped reveal if any of the test variants were really moving the needle — or if we were just seeing noise.
Surprising result: In one case, both Variant A and B performed worse than the control group 🤯. Without a holdout, we would’ve never known.
💬 Let’s Share Tips
I’m still learning the ropes with Amplitude Experiment, but it’s already helping us move faster and stay aligned across data, product, and marketing.
If you’re also using it, I’d love to hear:
-
How do you pick your success metrics?
-
Do you use remote configs to roll out winning variants?
-
Any automation tricks (e.g. triggering feature flags via LaunchDarkly, Segment, etc.)?
Let’s trade war stories