We want to add LLM output evaluation to our CI pipeline but struggling with non-deterministic outputs. Currently using a combination of embedding similarity and LLM-as-judge. What frameworks are people using?
Sign in to answer this question.