写点什么

实战 Milvus 2.5:语义检索 VS 全文检索 VS 混合检索

作者:Zilliz
  • 2024-12-31
    上海
  • 本文字数:14811 字

    阅读完需:约 49 分钟

实战Milvus 2.5:语义检索VS全文检索VS混合检索

Milvus 作为一款向量数据库,长期以来专注于基于 Embedding 的向量检索能力,为 RAG 等应用提供了高准确率,高性能,高扩展的语义检索功能。随着大模型时代带来各种新型应用探索,社区重新认识到结合传统基于文本匹配的精确检索与混合检索所带来的增益,尤其在一些深度依赖关键字词匹配的场景中,这种需求变得尤为关键。为了满足这一需求,Milvus 2.5 引入了全文检索(FTS,Full-Text Search)功能,并将其与 2.4 版本以来支持的稀疏向量检索能力和混合检索能力结合,从而发挥出强大的协同效应。


混合检索是一项融合了多路搜索结果的搜索方法,用户可以对数据中不同的字段进行多种方式的检索,然后通过混合检索进行融合排序得到一个综合的结果,在当前流行的 RAG 场景中,典型的混合检索方式是通过结合语义搜索与词汇检索来实现的,具体来说,这种做法会在 embedding 召回上与基于词汇匹配的 bm25 检索算法通过 RRF 的方式融合成一个更优的结果排序。


在本文中我们将使用 Anthropic 提供的一个 RAG 数据集来进行展示,这个数据集是一个文本搜索代码的数据集,由 9 个代码库的片段构成,类似于现在流行的 AI 辅助编程场景。由于代码数据包含大量定义、关键字等信息,基于文本的检索在这一场景中能够带来更大的增益。同时,经过大量代码数据训练的密集嵌入模型能够理解一些高层次的语义信息。我们希望通过实验观察,二者结合会产生怎样的效果。


为了对混合检索建立起更加具体的认识,我们采样一些具体的案例来进行分析。我们使用一个经过大量代码数据训练过的先进密集嵌入模型(voyage-2)作为基线,分别挑选了混合检索比密集和稀疏结果好的个例结果(top5),看一下能反应出背后的哪些特性。



除了基于个例的微观质量分析外,我们还通过整体评估得出了定量结果,统计了数据集中的 Pass@5 指标。该指标用于衡量每个查询的 Top 5 结果中,成功检索到的相关结果占所有相关结果的比例。从这个结果我们可以看出基于先进的 embedding 模型本身可以达到一个良好的基线效果,但是通过与全文检索方法依然可以带来提升,而通过对于 bm25 结果进行观察,针对具体场景进行参数调整,可以带来更大的提升。

案例一:混合检索优于语义检索的案例

问题:How is the log file created?


这个问题是希望问一下 log file 的创建过程,正确答案是一段创建 log file 的 Rust 代码。 在语义检索结果中,可以看到了有引入 log 的头文件,以及 c++拿到 logger 的相关代码,但这个问题关键其实是“logfile”这个变量, 我们在混合检索结果的 #hybrid 0 发现了这个结果,由于混合检索是融合语义检索和全文检索的结果,自然这个结果就是全文检索出来的。除了这个结果,我们可以在 #hybrid 2 中发现了很多看起来毫无关系的测试 mock 代码,尤其是这一句“long string to test how those are handled.”, 反复出现,这就需要理解全文检索算法 BM25 背后的原理了,全文检索是希望匹配到更多的低频词(因为高频词过于普遍了从而降低了用来甄别检索对象的独特性)。假如在大量自然文本中进行统计,很容易统计出“how”是一个非常常见的词,因此在相关性分数中占很低的比例。然而本文中是一个代码数据,并不会在代码中有很多包含“how”的文本,从而让含有这个词的句子被大量检索出来。


GroundTruth


use {    crate::args::LogArgs,    anyhow::{anyhow, Result},    simplelog::{Config, LevelFilter, WriteLogger},    std::fs::File,};
pub struct Logger;
impl Logger { pub fn init(args: &impl LogArgs) -> Result<()> { let filter: LevelFilter = args.log_level().into(); if filter != LevelFilter::Off { let logfile = File::create(args.log_file()) .map_err(|e| anyhow!("Failed to open log file: {e:}"))?; WriteLogger::init(filter, Config::default(), logfile) .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?; } Ok(()) }}
复制代码


语义检索结果:


##dense 0 0.7745316028594971 /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *      http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */#include "logunit.h"#include <log4cxx/logger.h>#include <log4cxx/simplelayout.h>#include <log4cxx/fileappender.h>#include <log4cxx/helpers/absolutetimedateformat.h>


##dense 1 0.769859254360199 void simple() { LayoutPtr layout = LayoutPtr(new SimpleLayout()); AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false)); root->addAppender(appender); common();
LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple"))); }
std::string createMessage(int i, Pool & pool) { std::string msg("Message "); msg.append(pool.itoa(i)); return msg; }
void common() { int i = 0;
// In the lines below, the logger names are chosen as an aid in // remembering their level values. In general, the logger names // have no bearing to level values. LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR")); ERRlogger->setLevel(Level::getError());


##dense 2 0.7591114044189453 log4cxx::spi::LoggingEventPtr logEvt = std::make_shared<log4cxx::spi::LoggingEvent>(LOG4CXX_STR("foo"), Level::getInfo(), LOG4CXX_STR("A Message"), log4cxx::spi::LocationInfo::getLocationUnavailable()); FMTLayout layout(LOG4CXX_STR("{d:%Y-%m-%d %H:%M:%S} {message}")); LogString output; log4cxx::helpers::Pool pool; layout.format( output, logEvt, pool);


##dense 3 0.7562235593795776 #include "util/compare.h"#include "util/transformer.h"#include "util/absolutedateandtimefilter.h"#include "util/iso8601filter.h"#include "util/absolutetimefilter.h"#include "util/relativetimefilter.h"#include "util/controlfilter.h"#include "util/threadfilter.h"#include "util/linenumberfilter.h"#include "util/filenamefilter.h"#include "vectorappender.h"#include <log4cxx/fmtlayout.h>#include <log4cxx/propertyconfigurator.h>#include <log4cxx/helpers/date.h>#include <log4cxx/spi/loggingevent.h>#include <iostream>#include <iomanip>
#define REGEX_STR(x) x#define PAT0 REGEX_STR("\\[[0-9A-FXx]*]\\ (DEBUG|INFO|WARN|ERROR|FATAL) .* - Message [0-9]\\{1,2\\}")#define PAT1 ISO8601_PAT REGEX_STR(" ") PAT0#define PAT2 ABSOLUTE_DATE_AND_TIME_PAT REGEX_STR(" ") PAT0#define PAT3 ABSOLUTE_TIME_PAT REGEX_STR(" ") PAT0#define PAT4 RELATIVE_TIME_PAT REGEX_STR(" ") PAT0#define PAT5 REGEX_STR("\\[[0-9A-FXx]*]\\ (DEBUG|INFO|WARN|ERROR|FATAL) .* : Message [0-9]\\{1,2\\}")

##dense 4 0.7557586431503296 std::string msg("Message ");
Pool pool;
// These should all log.---------------------------- LOG4CXX_FATAL(ERRlogger, createMessage(i, pool)); i++; //0 LOG4CXX_ERROR(ERRlogger, createMessage(i, pool)); i++;
LOG4CXX_FATAL(INF, createMessage(i, pool)); i++; // 2 LOG4CXX_ERROR(INF, createMessage(i, pool)); i++; LOG4CXX_WARN(INF, createMessage(i, pool)); i++; LOG4CXX_INFO(INF, createMessage(i, pool)); i++;
LOG4CXX_FATAL(INF_UNDEF, createMessage(i, pool)); i++; //6 LOG4CXX_ERROR(INF_UNDEF, createMessage(i, pool)); i++; LOG4CXX_WARN(INF_UNDEF, createMessage(i, pool)); i++; LOG4CXX_INFO(INF_UNDEF, createMessage(i, pool)); i++;
LOG4CXX_FATAL(INF_ERR, createMessage(i, pool)); i++; // 10 LOG4CXX_ERROR(INF_ERR, createMessage(i, pool)); i++;
LOG4CXX_FATAL(INF_ERR_UNDEF, createMessage(i, pool)); i++; LOG4CXX_ERROR(INF_ERR_UNDEF, createMessage(i, pool)); i++;
复制代码


混合检索结果


##hybrid 0 0.016393441706895828 use {    crate::args::LogArgs,    anyhow::{anyhow, Result},    simplelog::{Config, LevelFilter, WriteLogger},    std::fs::File,};
pub struct Logger;
impl Logger { pub fn init(args: &impl LogArgs) -> Result<()> { let filter: LevelFilter = args.log_level().into(); if filter != LevelFilter::Off { let logfile = File::create(args.log_file()) .map_err(|e| anyhow!("Failed to open log file: {e:}"))?; WriteLogger::init(filter, Config::default(), logfile) .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?; } Ok(()) }}
##hybrid 1 0.016393441706895828 /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */#include "logunit.h"#include <log4cxx/logger.h>#include <log4cxx/simplelayout.h>#include <log4cxx/fileappender.h>#include <log4cxx/helpers/absolutetimedateformat.h>

##hybrid 2 0.016129031777381897 "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " };}

##hybrid 3 0.016129031777381897 void simple() { LayoutPtr layout = LayoutPtr(new SimpleLayout()); AppenderPtr appender = FileAppenderPtr(new FileAppender(layout, LOG4CXX_STR("output/simple"), false)); root->addAppender(appender); common();
LOGUNIT_ASSERT(Compare::compare(LOG4CXX_FILE("output/simple"), LOG4CXX_FILE("witness/simple"))); }
std::string createMessage(int i, Pool & pool) { std::string msg("Message "); msg.append(pool.itoa(i)); return msg; }
void common() { int i = 0;
// In the lines below, the logger names are chosen as an aid in // remembering their level values. In general, the logger names // have no bearing to level values. LoggerPtr ERRlogger = Logger::getLogger(LOG4CXX_TEST_STR("ERR")); ERRlogger->setLevel(Level::getError());

##hybrid 4 0.01587301678955555 std::vector<std::string> MakeStrings() { return { "a", "ab", "abc", "abcd", "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. "
复制代码

案例二:混合检索优于全文检索

问题: How do you initialize the logger


这个问题和上个问题很相似,并且答案也是一个,但这个问题却是混合检索找到了(即通过语义检索得到),却不在全文检索的结果中,原因是由于语料中的各个词的统计结果的反应出的权重与我们的心理模型认知不一致的,模型没有理解到“how”这个词的匹配中并不重要,甚至可能由于“logger”比“how”在代码中更多,让“how”这个词更加重要。


GroundTruth


use {    crate::args::LogArgs,    anyhow::{anyhow, Result},    simplelog::{Config, LevelFilter, WriteLogger},    std::fs::File,};
pub struct Logger;
impl Logger { pub fn init(args: &impl LogArgs) -> Result<()> { let filter: LevelFilter = args.log_level().into(); if filter != LevelFilter::Off { let logfile = File::create(args.log_file()) .map_err(|e| anyhow!("Failed to open log file: {e:}"))?; WriteLogger::init(filter, Config::default(), logfile) .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?; } Ok(()) }}
复制代码


全文检索结果


##sparse 0 10.17311954498291         "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "        "long string to test how those are handled. Here goes more text. "    };}


##sparse 1 9.775702476501465 std::vector<std::string> MakeStrings() { return { "a", "ab", "abc", "abcd", "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. "

##sparse 2 7.638711452484131 // union ("x|y"), grouping ("(xy)"), brackets ("[xy]"), and// repetition count ("x{5,7}"), among others.//// Below is the syntax that we do support. We chose it to be a// subset of both PCRE and POSIX extended regex, so it's easy to// learn wherever you come from. In the following: 'A' denotes a// literal character, period (.), or a single \\ escape sequence;// 'x' and 'y' denote regular expressions; 'm' and 'n' are for

##sparse 3 7.1208391189575195 /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */#include "logunit.h"#include <log4cxx/logger.h>#include <log4cxx/simplelayout.h>#include <log4cxx/fileappender.h>#include <log4cxx/helpers/absolutetimedateformat.h>


##sparse 4 7.066349029541016 /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */#include <log4cxx/filter/denyallfilter.h>#include <log4cxx/logger.h>#include <log4cxx/spi/filter.h>#include <log4cxx/spi/loggingevent.h>#include "../logunit.h"
复制代码


混合检索结果




##hybrid 0 0.016393441706895828 use { crate::args::LogArgs, anyhow::{anyhow, Result}, simplelog::{Config, LevelFilter, WriteLogger}, std::fs::File,};
pub struct Logger;
impl Logger { pub fn init(args: &impl LogArgs) -> Result<()> { let filter: LevelFilter = args.log_level().into(); if filter != LevelFilter::Off { let logfile = File::create(args.log_file()) .map_err(|e| anyhow!("Failed to open log file: {e:}"))?; WriteLogger::init(filter, Config::default(), logfile) .map_err(|e| anyhow!("Failed to initalize logger: {e:}"))?; } Ok(()) }}
##hybrid 1 0.016393441706895828 "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " };}

##hybrid 2 0.016129031777381897 std::vector<std::string> MakeStrings() { return { "a", "ab", "abc", "abcd", "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. " "long string to test how those are handled. Here goes more text. "
##hybrid 3 0.016129031777381897 LoggerPtr INF = Logger::getLogger(LOG4CXX_TEST_STR("INF")); INF->setLevel(Level::getInfo());
LoggerPtr INF_ERR = Logger::getLogger(LOG4CXX_TEST_STR("INF.ERR")); INF_ERR->setLevel(Level::getError());
LoggerPtr DEB = Logger::getLogger(LOG4CXX_TEST_STR("DEB")); DEB->setLevel(Level::getDebug());
// Note: categories with undefined level LoggerPtr INF_UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("INF.UNDEF")); LoggerPtr INF_ERR_UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("INF.ERR.UNDEF")); LoggerPtr UNDEF = Logger::getLogger(LOG4CXX_TEST_STR("UNDEF"));

##hybrid 4 0.01587301678955555 // union ("x|y"), grouping ("(xy)"), brackets ("[xy]"), and// repetition count ("x{5,7}"), among others.//// Below is the syntax that we do support. We chose it to be a// subset of both PCRE and POSIX extended regex, so it's easy to// learn wherever you come from. In the following: 'A' denotes a// literal character, period (.), or a single \\ escape sequence;// 'x' and 'y' denote regular expressions; 'm' and 'n' are for
复制代码


我们发现在 sparse 中出现了不少由低信息量词“How”,“What”匹配带来的低质量结果。通过观察,知道在这个数据中“How”,“What”的匹配会带来干扰,可以采用一种途径来去屏蔽这些词的匹配,即将它们加入加入 stopword 来忽略这些词的匹配结果。

案例三:混合检索(添加 stopword)优于语义检索

在经过这一步处理后,我们再次分析一个经过调优后的混合检索比语义检索好的结果,这一次是由于在 query 中的“RegistryClient” 一词的匹配让我们找到了只使用语义检索模型没有召回的结果,同时我们也可以注意到在通过 hybrid 的方式检索出来结果低质量匹配的结果减少了。


问题: How is the RegistryClient instance created in the test methods?


/** Integration tests for {@link BlobPuller}. */public class BlobPullerIntegrationTest {
private final FailoverHttpClient httpClient = new FailoverHttpClient(true, false, ignored -> {});
@Test public void testPull() throws IOException, RegistryException { RegistryClient registryClient = RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient) .newRegistryClient(); V22ManifestTemplate manifestTemplate = registryClient .pullManifest( ManifestPullerIntegrationTest.KNOWN_MANIFEST_V22_SHA, V22ManifestTemplate.class) .getManifest();
DescriptorDigest realDigest = manifestTemplate.getLayers().get(0).getDigest();
复制代码


语义检索结果:




##dense 0 0.7411458492279053 Mockito.doThrow(mockRegistryUnauthorizedException) .when(mockJibContainerBuilder) .containerize(mockContainerizer);
try { testJibBuildRunner.runBuild(); Assert.fail();
} catch (BuildStepsExecutionException ex) { Assert.assertEquals( TEST_HELPFUL_SUGGESTIONS.forHttpStatusCodeForbidden("someregistry/somerepository"), ex.getMessage()); } }


##dense 1 0.7346029877662659 verify(mockCredentialRetrieverFactory).known(knownCredential, "credentialSource"); verify(mockCredentialRetrieverFactory).known(inferredCredential, "inferredCredentialSource"); verify(mockCredentialRetrieverFactory) .dockerCredentialHelper("docker-credential-credentialHelperSuffix"); }


##dense 2 0.7285804748535156 when(mockCredentialRetrieverFactory.dockerCredentialHelper(anyString())) .thenReturn(mockDockerCredentialHelperCredentialRetriever); when(mockCredentialRetrieverFactory.known(knownCredential, "credentialSource")) .thenReturn(mockKnownCredentialRetriever); when(mockCredentialRetrieverFactory.known(inferredCredential, "inferredCredentialSource")) .thenReturn(mockInferredCredentialRetriever); when(mockCredentialRetrieverFactory.wellKnownCredentialHelpers()) .thenReturn(mockWellKnownCredentialHelpersCredentialRetriever);


##dense 3 0.7279614210128784 @Test public void testBuildImage_insecureRegistryException() throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException, ExecutionException { InsecureRegistryException mockInsecureRegistryException = Mockito.mock(InsecureRegistryException.class); Mockito.doThrow(mockInsecureRegistryException) .when(mockJibContainerBuilder) .containerize(mockContainerizer);
try { testJibBuildRunner.runBuild(); Assert.fail();
} catch (BuildStepsExecutionException ex) { Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forInsecureRegistry(), ex.getMessage()); } }


##dense 4 0.724872350692749 @Test public void testBuildImage_registryCredentialsNotSentException() throws InterruptedException, IOException, CacheDirectoryCreationException, RegistryException, ExecutionException { Mockito.doThrow(mockRegistryCredentialsNotSentException) .when(mockJibContainerBuilder) .containerize(mockContainerizer);
try { testJibBuildRunner.runBuild(); Assert.fail();
} catch (BuildStepsExecutionException ex) { Assert.assertEquals(TEST_HELPFUL_SUGGESTIONS.forCredentialsNotSent(), ex.getMessage()); } }
复制代码


混合检索结果



##hybrid 0 0.016393441706895828 /** Integration tests for {@link BlobPuller}. */public class BlobPullerIntegrationTest {
private final FailoverHttpClient httpClient = new FailoverHttpClient(true, false, ignored -> {});
@Test public void testPull() throws IOException, RegistryException { RegistryClient registryClient = RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient) .newRegistryClient(); V22ManifestTemplate manifestTemplate = registryClient .pullManifest( ManifestPullerIntegrationTest.KNOWN_MANIFEST_V22_SHA, V22ManifestTemplate.class) .getManifest();
DescriptorDigest realDigest = manifestTemplate.getLayers().get(0).getDigest();

##hybrid 1 0.016393441706895828 Mockito.doThrow(mockRegistryUnauthorizedException) .when(mockJibContainerBuilder) .containerize(mockContainerizer);
try { testJibBuildRunner.runBuild(); Assert.fail();
} catch (BuildStepsExecutionException ex) { Assert.assertEquals( TEST_HELPFUL_SUGGESTIONS.forHttpStatusCodeForbidden("someregistry/somerepository"), ex.getMessage()); } }

##hybrid 2 0.016129031777381897 verify(mockCredentialRetrieverFactory).known(knownCredential, "credentialSource"); verify(mockCredentialRetrieverFactory).known(inferredCredential, "inferredCredentialSource"); verify(mockCredentialRetrieverFactory) .dockerCredentialHelper("docker-credential-credentialHelperSuffix"); }

##hybrid 3 0.016129031777381897 @Test public void testPull_unknownBlob() throws IOException, DigestException { DescriptorDigest nonexistentDigest = DescriptorDigest.fromHash( "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
RegistryClient registryClient = RegistryClient.factory(EventHandlers.NONE, "gcr.io", "distroless/base", httpClient) .newRegistryClient();
try { registryClient .pullBlob(nonexistentDigest, ignored -> {}, ignored -> {}) .writeTo(ByteStreams.nullOutputStream()); Assert.fail("Trying to pull nonexistent blob should have errored");
} catch (IOException ex) { if (!(ex.getCause() instanceof RegistryErrorException)) { throw ex; } MatcherAssert.assertThat( ex.getMessage(), CoreMatchers.containsString( "pull BLOB for gcr.io/distroless/base with digest " + nonexistentDigest)); } }}
##hybrid 4 0.01587301678955555 when(mockCredentialRetrieverFactory.dockerCredentialHelper(anyString())) .thenReturn(mockDockerCredentialHelperCredentialRetriever); when(mockCredentialRetrieverFactory.known(knownCredential, "credentialSource")) .thenReturn(mockKnownCredentialRetriever); when(mockCredentialRetrieverFactory.known(inferredCredential, "inferredCredentialSource")) .thenReturn(mockInferredCredentialRetriever); when(mockCredentialRetrieverFactory.wellKnownCredentialHelpers()) .thenReturn(mockWellKnownCredentialHelpersCredentialRetriever);
复制代码


我们可以得到一些结论,语义检索模型可以帮助我们直接获得一个较好的结果,但是当 query 中出现希望匹配的关键词时,语义检索模型缺乏对这一需求的显式表达。而全文检索方法则可以实现这一点。但同时带来的问题是会出现一些无足轻重的匹配干扰到整体质量,这需要我们从具体的结果发现这些给负面案例,从业务的角度针对性地处理来改善检索质量。我们希望通过 Milvus2.5 的全文检索功能发布,能帮助社区用户在实现 RAG 系统中带来灵活的自由度,充分进行各种检索策略的组合探索,助力用户应对 GenAI 时代更加复杂多样的检索需求。如果想要了解如何在 Milvus 中使用全文检索的具体代码,欢迎进一步阅读使用MIlvus进行全文检索


代码:


https://github.com/wxywb/milvus_fts_exps


数据集:


https://github.com/anthropics/anthropic-cookbook/tree/main/skills/contextual-embeddings/data

作者介绍

王翔宇

Zilliz 算法工程师

陈将

Zilliz 生态和 AI 平台负责人




发布于: 刚刚阅读数: 3
用户头像

Zilliz

关注

Data Infrastructure for AI Made Easy 2021-10-09 加入

还未添加个人简介

评论

发布
暂无评论
实战Milvus 2.5:语义检索VS全文检索VS混合检索_全文检索_Zilliz_InfoQ写作社区