SOLR-18187: Document enrichment with LLMs#4259
SOLR-18187: Document enrichment with LLMs#4259nicolo-rinaldi wants to merge 14 commits intoapache:mainfrom
Conversation
…tUpdateProcessorFactory
- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date
…t with LLMs' module
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Show resolved
Hide resolved
...e-models/src/java/org/apache/solr/languagemodels/documentenrichment/model/SolrChatModel.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/modelChatExamples/dummy-chat-model-ambiguous.json
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/solr/collection1/conf/schema-language-models.xml
Show resolved
Hide resolved
...files/solr/collection1/conf/solrconfig-document-enrichment-update-request-processor-only.xml
Show resolved
Hide resolved
...ules/language-models/src/test-files/solr/collection1/conf/solrconfig-document-enrichment.xml
Show resolved
Hide resolved
...models/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactoryTest.java
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Outdated
Show resolved
Hide resolved
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
Can't this always be generalised and used for all the tests? In some of them, you are now repeating this code with small changes...
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
There was a problem hiding this comment.
I created a function initializeUpdateProcessorFactory that is used inside createUpdateProcessor. In this way, the code inside the first one can be reused
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
...uagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Test | ||
| public void init_promptFileWithMissingPlaceholder_shouldThrowExceptionInInform() { | ||
| NamedList<String> args = new NamedList<>(); |
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
There was a problem hiding this comment.
changed and fixed tests
...java/org/apache/solr/languagemodels/documentenrichment/store/rest/ManagedChatModelStore.java
Outdated
Show resolved
Hide resolved
...java/org/apache/solr/languagemodels/documentenrichment/store/rest/ManagedChatModelStore.java
Outdated
Show resolved
Hide resolved
...anguagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorTest.java
Show resolved
Hide resolved
solr/modules/language-models/src/test-files/solr/collection1/conf/schema-language-models.xml
Show resolved
Hide resolved
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
https://issues.apache.org/jira/browse/SOLR-18187
Description
The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)
Solution
This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.
The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.
Note: this PR was developed with assistance from Claude Code (Anthropic).
Tests
Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.
Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.