新しい投稿

検索

質問
· 2025年3月25日

¿Cómo tokenizar un texto usando SentenceTransformer?

Hola a todos.

Estoy intentando crear una tabla indexada con un campo vectorial para poder buscar por su valor. He estado investigando y descubrí que, para obtener el valor del vector a partir del texto (token), se debe usar un método de Python como el siguiente:

ClassMethod TokenizeData(desc As %String) As %String [ Language = python ]
{
    import iris
    # Step 2: Generate Document Embeddings
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')

    # Generate embeddings for each document
    document_embeddings = model.encode(desc)

    return document_embeddings.tolist()
}

El modelo all-MiniLM-L6-v2 se descargó de https://ollama.com/library/all-minilm y se instaló en mi instancia de Docker.

Al intentar probar este método (desde Visual Studio), se generó el siguiente error:

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.  

Luego cambié el archivo config.json para crear un archivo JSon válido (solo escribí las llaves) y repetí la prueba, pero hay un nuevo error.

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall  

¿Alguien sabe cómo solucionar este problema?

¿Hay alguna otra forma de crear el valor del vector para poder indexarlo?

Saludos cordiales.

1 Comment
ディスカッション (1)2
続けるにはログインするか新規登録を行ってください
質問
· 2025年3月25日

How to tokenize a text using SentenceTransformer?

Hi all.

I'm trying to create an indexed table with an vector field so I can search by the vector value.
I've been investigating and found that to get the vector value based on the text (token), use a Python method like the following:

ClassMethod TokenizeData(desc As %String) As %String [ Language = python ]
{
    import iris
    # Step 2: Generate Document Embeddings
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')

    # Generate embeddings for each document
    document_embeddings = model.encode(desc)

    return document_embeddings.tolist()
}

The model all-MiniLM-L6-v2 is downloaded from https://ollama.com/library/all-minilm and installed into my Docker instance.

When I've tryed to test this métod (from Visual Studio), it throws the following error:

 

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.  

Then I changed the config.json file to create a valid JSon file (I only wrote the curly braces) and repeated the test, but there is a new error.

 

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall  

Does anyone know how to fix this problem?
Is there any other way to create the vector value so I can index it?

Best regards.

2 Comments
ディスカッション (2)2
続けるにはログインするか新規登録を行ってください
質問
· 2025年3月25日

Walking a virtual document's structure

Is there a generic process for "walking" the structure of a virtual document - eg an HL7 message (EnsLib.HL7.Message) or an XML document (EnsLib.EDI.XML.Document).

At least we'd want to be able to visit all "nodes" (HL7 fields or sub-fields, XML nodes) in the virtual document and be able to work out/generate the Property Path (so we could call "GetValueAt"). 

We can just about come up with something generic for HL7, since it only nests down to 4 levels within each segment, though we're using numeric Property Path's at that point rather than symbolic ones (MSH:1.3 etc).

Reading the documentation has not so far cast light! Any pointers welcome. Thanks.

7 Comments
ディスカッション (7)3
続けるにはログインするか新規登録を行ってください
質問
· 2025年3月25日

Want to get text from the next line

I have text coming in CCDA file as below:

<td>

<content ID="1234">

    virus vaccine

    <sup>1</sup>

</content>

</td>

I want to get the entire value available inside the content tag. Can anybody please help me out how to get this?

1 Comment
ディスカッション (1)1
続けるにはログインするか新規登録を行ってください
質問
· 2025年3月25日

Distinct strings in MDX

Hi!

I have question about MDX functionality in context of IRIS Analytics.

How does IRIS MDX distinct selection works? Is there any restruqtion when analyzing strings? Like special symbols or length?

Here is an example

I have this data:

 6 rows and 2 of them unique

Then we create data cube based on this model and examine it with Analyzer

 
Detailed listing
 

6 rows an ONE unique string. Which is obviously not true. 
This happed only with string with symbols in it

My task is to get the right amount of unique strings

2 Comments
ディスカッション (2)2
続けるにはログインするか新規登録を行ってください