¿Cómo tokenizar un texto usando SentenceTransformer?

Question 1

質問

Kurro Lopez · 2025年3月25日

#Docker #JSON #Python #Vector Search #InterSystems IRIS

Hola a todos.

Estoy intentando crear una tabla indexada con un campo vectorial para poder buscar por su valor. He estado investigando y descubrí que, para obtener el valor del vector a partir del texto (token), se debe usar un método de Python como el siguiente:

ClassMethod TokenizeData(desc As %String) As %String [ Language = python ]
{
    import iris
    # Step 2: Generate Document Embeddings
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')

    # Generate embeddings for each document
    document_embeddings = model.encode(desc)

    return document_embeddings.tolist()
}

El modelo all-MiniLM-L6-v2 se descargó de https://ollama.com/library/all-minilm y se instaló en mi instancia de Docker.

Al intentar probar este método (desde Visual Studio), se generó el siguiente error:

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.

Luego cambié el archivo config.json para crear un archivo JSon válido (solo escribí las llaves) y repetí la prueba, pero hay un nuevo error.

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall

¿Alguien sabe cómo solucionar este problema?

¿Hay alguna otra forma de crear el valor del vector para poder indexarlo?

Saludos cordiales.

1 Comment

ディスカッション (1)2

続けるにはログインするか新規登録を行ってください

Question 2

質問

Kurro Lopez · 2025年3月25日

#Docker #JSON #Python #Vector Search #InterSystems IRIS

Hi all.

I'm trying to create an indexed table with an vector field so I can search by the vector value.
I've been investigating and found that to get the vector value based on the text (token), use a Python method like the following:

ClassMethod TokenizeData(desc As %String) As %String [ Language = python ]
{
    import iris
    # Step 2: Generate Document Embeddings
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('/opt/irisbuild/all-MiniLM-L6-v2')

    # Generate embeddings for each document
    document_embeddings = model.encode(desc)

    return document_embeddings.tolist()
}

The model all-MiniLM-L6-v2 is downloaded from https://ollama.com/library/all-minilm and installed into my Docker instance.

When I've tryed to test this métod (from Visual Studio), it throws the following error:

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'OSError'>: It looks like the config file at '/opt/irisbuild/all-MiniLM-L6-v2/config.json' is not a valid JSON file.

Then I changed the config.json file to create a valid JSon file (I only wrote the curly braces) and repeated the test, but there is a new error.

<THROW>DebugStub+40^%Debugger.System.1 *%Exception.PythonException <PYTHON EXCEPTION> 246 <class 'safetensors_rust.SafetensorError'>: Error while deserializing header: HeaderTooSmall

Does anyone know how to fix this problem?
Is there any other way to create the vector value so I can index it?

Best regards.

2 Comments

ディスカッション (2)2

続けるにはログインするか新規登録を行ってください

Question 3

質問

Colin Brough · 2025年3月25日

#HL7 #ObjectScript #ヒントとコツ #XML #Ensemble #Caché

Is there a generic process for "walking" the structure of a virtual document - eg an HL7 message (EnsLib.HL7.Message) or an XML document (EnsLib.EDI.XML.Document).

At least we'd want to be able to visit all "nodes" (HL7 fields or sub-fields, XML nodes) in the virtual document and be able to work out/generate the Property Path (so we could call "GetValueAt").

We can just about come up with something generic for HL7, since it only nests down to 4 levels within each segment, though we're using numeric Property Path's at that point rather than symbolic ones (MSH:1.3 etc).

Reading the documentation has not so far cast light! Any pointers welcome. Thanks.

7 Comments

ディスカッション (7)3

続けるにはログインするか新規登録を行ってください

Question 4

質問

Fahima Ansari · 2025年3月25日

#HL7 #相互運用性 #InterSystems IRIS for Health #Caché #Ensemble

I have text coming in CCDA file as below:

<td>

virus vaccine

</content>

</td>

I want to get the entire value available inside the content tag. Can anybody please help me out how to get this?

1 Comment

ディスカッション (1)1

続けるにはログインするか新規登録を行ってください

Question 5

質問

Dmitrij Vladimirov · 2025年3月25日

#分析 #アナライザ #MDX #InterSystems IRIS #InterSystems IRIS BI (DeepSee)

Hi!

I have question about MDX functionality in context of IRIS Analytics.

How does IRIS MDX distinct selection works? Is there any restruqtion when analyzing strings? Like special symbols or length?

Here is an example

I have this data:

6 rows and 2 of them unique

Then we create data cube based on this model and examine it with Analyzer

Detailed listing

6 rows an ONE unique string. Which is obviously not true.
This happed only with string with symbols in it

My task is to get the right amount of unique strings

2 Comments

ディスカッション (2)2

続けるにはログインするか新規登録を行ってください

検索

¿Cómo tokenizar un texto usando SentenceTransformer?

How to tokenize a text using SentenceTransformer?

Walking a virtual document's structure

Want to get text from the next line

Distinct strings in MDX