How are you handling LLM evaluation in CI/CD?

We want to add LLM output evaluation to our CI pipeline but struggling with non-deterministic outputs. Currently using a combination of embedding similarity and LLM-as-judge. What frameworks are people using?

tech trends

mlops