新しい投稿

検索

記事
· 2025年4月30日 8m read

Using vector search for duplicate patient detection

I recently had to refresh my knowledge of the HealthShare EMPI module and since I've been tinkering with IRIS's vector storage and search functionalities for a while I just had to add 1 + 1.

For those of you who are not familiar with EMPI functionality here's a little introduction:

Enterprise Master Patient Index

In general, all EMPIs work in a very similar way, ingesting information, normalizing it and comparing it with the data already present in their system. Well, in the case of HealthShare's EMPI, this process is known as NICE:

  • Normalization: all texts ingested from interoperability production are normalized by removing special characters.
  • Indexing: indexes are generated from a selection of demographic data to speed up the search for matches.
  • Comparison: the matches found in the indexes are compared between demographic data and weights are assigned based on criteria according to the level of coincidence.
  • Evaluation: the possibility of linking patients is evaluated with the sum of the weights obtained.

If you want to know more about HealthShare Patient Index you can review a serie of articles that I wrote some time ago here.

What is the challenge?

While the setup process to obtain the possible links is not extremely complicated, I wondered... Would it be possible to obtain similar results by making use of the vector storage and search functionalities and saving the weight adjustment steps? Let's get to work!

What do I need to implement my idea?

I will need the following ingredients:

  • IRIS for Health to implement the functionality and make use of the interoperability engine for HL7 messaging ingestion.
  • Python library to generate the vectors from the demographic data, in this case it will be sentence-transformers.
  • Model for generating the embeddings, in a quick search on Hugging Faces I have chosen all-MiniLM-L6-v2.

Interoperability Production

The first step will be to configure the production in charge of ingesting the HL7 messaging, transforming it into messages with the most relevant demographic data of the patient, generating the embeddings from said demographic data and finally generating the response message with the possible matches.

Let's take a look at our production:

As you can see, it couldn’t be simpler. I have a Business Service HL7FileService that retrieves HL7 messages from a directory (/shared/in), then sends the message to the Business Process EMPI.BP.FromHL7ToPatientRequestBPL where I will create the message with the patient’s demographic data and finally we will send it to another BP called EMPI.BP.VectorizationBP where the demographic information will be vectorized and the vector search will be performed that will return a message with all the possible duplicate patients.

As you can see, theBP FromHL7ToPatientRequesBPL is very simple:

We transform the HL7 message into a message that we have created to store the demographic data that we have considered most relevant.

Messages between components

We have created two specific type of messages: 

EMPI.Message.PatientRequest

Class EMPI.Message.PatientRequest Extends (Ens.Request, %XML.Adaptor)
{

Property Patient As EMPI.Object.Patient;
}

EMPI.Message.PatientResponse

Class EMPI.Message.PatientResponse Extends (Ens.Response, %XML.Adaptor)
{

Property Patients As list Of EMPI.Object.Patient;
}

This type message will contain a list of "possible" duplicated patients.

Let's see the definition of EMPI.Object.Patient class:

Class EMPI.Object.Patient Extends (%SerialObject, %XML.Adaptor)
{

Property Name As %String(MAXLEN = 1000);
Property Address As %String(MAXLEN = 1000);
Property Contact As %String(MAXLEN = 1000);
Property BirthDateAndSex As %String(MAXLEN = 100);
Property SimilarityName As %Double;
Property SimilarityAddress As %Double;
Property SimilarityContact As %Double;
Property SimilarityBirthDateAndSex As %Double;
}

Name, Address, Contact and BirthDateAndSex are the properties in wich we are going to save the most relevant patient demographic data.

Now let's see the magic, the embedding generation and the vector search in the production.

EMPI.BP.VectorizationBP

Embedding and vector search

With the PatientRequest received we are going to generate the embeddings using a method in Python:

Method VectorizePatient(name As %String, address As %String, contact As %String, birthDateAndSex As %String) As %String [ Language = python ]
{
    import iris
    import os
    import sentence_transformers

    try :
        if not os.path.isdir("/iris-shared/model/"):
            model = sentence_transformers.SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")            
            model.save('/iris-shared/model/')
        model = sentence_transformers.SentenceTransformer("/iris-shared/model/")
        embeddingName = model.encode(name, normalize_embeddings=True).tolist()
        embeddingAddress = model.encode(address, normalize_embeddings=True).tolist()
        embeddingContact = model.encode(contact, normalize_embeddings=True).tolist()
        embeddingBirthDateAndSex = model.encode(birthDateAndSex, normalize_embeddings=True).tolist()

        stmt = iris.sql.prepare("INSERT INTO EMPI_Object.PatientInfo (Name, Address, Contact, BirthDateAndSex, VectorizedName, VectorizedAddress, VectorizedContact, VectorizedBirthDateAndSex) VALUES (?,?,?,?, TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL), TO_VECTOR(?,DECIMAL))")
        rs = stmt.execute(name, address, contact, birthDateAndSex, str(embeddingName), str(embeddingAddress), str(embeddingContact), str(embeddingBirthDateAndSex))
        return "1"
    except Exception as err:
        iris.cls("Ens.Util.Log").LogInfo("EMPI.BP.VectorizationBP", "VectorizePatient", repr(err))
        return "0"
}

Let's analyze the code:

  • With sentence-transformer library we are getting the model all-MiniLM-L6-v2 and saving it into the local machine (to avoid further connections by Internet).
  • With the model imported it allows to us to generate the embeddings for the demographic fields using the encode method.
  • Using IRIS library we are executing the insert query to persist the embeddings for the patient.

Searching duplicated patients

Now we have the patients recorded with the embeddings generated from the demographic data, let's query it!

Method OnRequest(pInput As EMPI.Message.PatientRequest, Output pOutput As EMPI.Message.PatientResponse) As %Status
{
    try{
        set result = ..VectorizePatient(pInput.Patient.Name, pInput.Patient.Address, pInput.Patient.Contact, pInput.Patient.BirthDateAndSex)
        set pOutput = ##class(EMPI.Message.PatientResponse).%New()
        if (result = 1)
        {
            set sql = "SELECT * FROM (SELECT p1.Name, p1.Address, p1.Contact, p1.BirthDateAndSex, VECTOR_DOT_PRODUCT(p1.VectorizedName, p2.VectorizedName) as SimilarityName, VECTOR_DOT_PRODUCT(p1.VectorizedAddress, p2.VectorizedAddress) as SimilarityAddress, "_
                    "VECTOR_DOT_PRODUCT(p1.VectorizedContact, p2.VectorizedContact) as SimilarityContact, VECTOR_DOT_PRODUCT(p1.VectorizedBirthDateAndSex, p2.VectorizedBirthDateAndSex) as SimilarityBirthDateAndSex "_
                    "FROM EMPI_Object.PatientInfo p1, EMPI_Object.PatientInfo p2 WHERE p2.Name = ? AND p2.Address = ?  AND p2.Contact = ? AND p2.BirthDateAndSex = ?) "_
                    "WHERE SimilarityName > 0.8 AND SimilarityAddress > 0.8 AND SimilarityContact > 0.8 AND SimilarityBirthDateAndSex > 0.8"
            set statement = ##class(%SQL.Statement).%New(), statement.%ObjectSelectMode = 1
            set status = statement.%Prepare(sql)
            if ($$$ISOK(status)) {
                set resultSet = statement.%Execute(pInput.Patient.Name, pInput.Patient.Address, pInput.Patient.Contact, pInput.Patient.BirthDateAndSex)
                if (resultSet.%SQLCODE = 0) {
                    while (resultSet.%Next() '= 0) {
                        set patient = ##class(EMPI.Object.Patient).%New()
                        set patient.Name = resultSet.%GetData(1)
                        set patient.Address = resultSet.%GetData(2)
                        set patient.Contact = resultSet.%GetData(3)
                        set patient.BirthDateAndSex = resultSet.%GetData(4)
                        set patient.SimilarityName = resultSet.%GetData(5)
                        set patient.SimilarityAddress = resultSet.%GetData(6)
                        set patient.SimilarityContact = resultSet.%GetData(7)
                        set patient.SimilarityBirthDateAndSex = resultSet.%GetData(8)
                        do pOutput.Patients.Insert(patient)
                    }
                }
            }
        }
    }
    catch ex {
        do ex.Log()
    }
    return $$$OK
}

Here is the query. For our example we have included a restriction to get patients with a similarity bigger than 0.8 for all the demographics but we could configure it to tune the query.

Let's see the example introducing the file messagesa28.hl7 with HL7 messages like these:

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|269304|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1220395631^^^SERMAS^SN~402413^^^HULP^PI||FERNÁNDEZ LÓPEZ^JOSÉ MARÍA^^^||19700611|M|||PASEO JUAN FERNÁNDEZ^183 2 A^LEGANÉS^CÁDIZ^28566^SPAIN||555749170^PRN^^JOSE-MARIA.FERNANDEZ@GMAIL.COM|||||||||||||||||N|
PV1||N

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|570814|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1122730333^^^SERMAS^SN~018565^^^HULP^PI||GONZÁLEZ GARCÍA^MARÍA^^^||19660812|F|||CALLE JOSÉ MARÍA FERNÁNDEZ^281 8 IZQUIERDA^MADRID^BARCELONA^28057^SPAIN||555386663^PRN^^MARIA.GONZALEZ@GMAIL.COM|||||||||||||||||N|
PV1||N
DG1|1||T001^TRAUMATISMOS SUPERF AFECTAN TORAX CON ABDOMEN, REG LUMBOSACRA Y PELVIS^||20241120|||||||||||^CONTRERAS ÁLVAREZ^ENRIQUETA^^^Dr|

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|40613|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1007179467^^^SERMAS^SN~122688^^^HULP^PI||OLIVA OLIVA^JIMENA^^^||19620222|F|||CALLE ANTONIO ÁLVAREZ^51 3 D^MÉRIDA^MADRID^28253^SPAIN||555638305^PRN^^JIMENA.OLIVA@VODAFONE.COM|||||||||||||||||N|
PV1||N
DG1|1||Q059^ESPINA BIFIDA, NO ESPECIFICADA^||20241120|||||||||||^SANZ LÓPEZ^MARIO^^^Dr|

MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|61768|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1498973060^^^SERMAS^SN~719939^^^HULP^PI||PÉREZ CABEZUELA^DIANA^^^||19820309|F|||AVENIDA JULIA ÁLVAREZ^253 1 A^PERELLONET^BADAJOZ^28872^SPAIN||555705148^PRN^^DIANA.PEREZ@YAHOO.COM|||||||||||||||||N|
PV1||N
AL1|1|MA|^Polen de gramineas^|SV^^||20340919051523
MSH|^~\&|HIS|HULP|EMPI||20241120103314||ADT^A28|128316|P|2.5.1
EVN|A28|20241120103314|20241120103314|1
PID|||1632386689^^^SERMAS^SN~601379^^^HULP^PI||GARCÍA GARCÍA^MARIO^^^||19550603|M|||PASEO JOSÉ MARÍA TREVIÑO^153 3 D^JEREZ DE LA FRONTERA^MADRID^28533^SPAIN||555231628^PRN^^MARIO.GARCIA@GMAIL.COM|||||||||||||||||N|
PV1||N

In this file, all the patients are different so the result of the operation will be of this type:

The only match is the patient himself, let's introduce the hl7 messages from the messagesa28Duplicated.hl7 with duplicated patients:

As you can see the code has detected the duplicated patient with minor differences in the name (Maruchi is an affectionate nickname for María and Mª is the short way) , well, this case is oversimplified but you can get and idea about the capabilities of the vector search to get duplicated data, not only for patients but any other type of info.

Next steps...

For this example I have used a common model to generate the embeddings but the behaviour of the code would be improved with a fine tuning using nicknames, monikers, etc.

Thank you for your attention!

ディスカッション (0)1
続けるにはログインするか新規登録を行ってください
質問
· 2025年4月30日

Serialised classes, exports and development environments

We have classes in a Production environment that are causing us some issues - example attached.

When we export them from the production environment the XML contains a snippet like the following:

<UDLText name="T">
<Content><![CDATA[
//Property any As list Of %XML.String(XMLNAME = "any", XMLPROJECTION = "ANY") [ SqlFieldName = _any ];

]]></Content>
</UDLText>

When imported and compiled into an Ensemble instance this class works as expected.

When viewed/edited in a development environment we run into issues - the presentation is similar with both Studio and VS Code.

On first viewing in Studio the class source code displays like this:

If you use the extension in VS Code view the class, or export it from there, again the code displays as above (and is saved in this form to the filesystem if exported).

However, if the file is modified and saved in Studio, or modified and saved or even simply compiled (import and compile) in VS Code then the source is serialised and returned to Studio/VS Code by Ensemble as:

From part of a previous conversation with @Brett Saviano  (Request textDocument/documentSymbol failed. Error: name must not be falsy · intersystems-community/vscode-objectscript · Discussion #1530) we understand this serialisation is the correct behaviour.

However, we are left with a few of questions:

  1. Serialisation does not appear to happen when the Ensemble -> System Explorer -> Classes page exports the class - is this correct? (ie present in Ensemble with the comment as '//Property', then the exported class has '//Property' rather than '// Property')
  2. Is this likely to be the same reason that the "InterSystems" tab in VS Code exports the code without the space (they are using the same underlying mechanism)?
  3. Can anyone come with any ways in which classes such as the one attached came to be in our Production Ensemble in the first place?
  4. We tried importing and re-exporting the class using $SYSTEM.OBJ - Load(file, "cuk") to import, and Export(class, file) to export. It did not modify/serialise the class - the '//Property' was left intact.
  5. And finally, is there any way of programmatically forcing this serialisation change and exporting the changed classes - our repo was created with the non-serialised versions of these classes, and now the classes with this issue sporadically show up as "phantom changes" - not relating to the development we are actually attempting to do!  
8 Comments
ディスカッション (8)4
続けるにはログインするか新規登録を行ってください
質問
· 2025年4月30日

Clearing Web Sessions from Terminal

I ran into a situation where VS Code consumed all available web sessions and was unable to get to the Management Console to clear them. I was able to establish a terminal session, though.

Is there a method or routine available through the IRIS terminal that allows one to clear web sessions? I've searched the administration and class documentation and haven't found a solution.

3 Comments
ディスカッション (3)3
続けるにはログインするか新規登録を行ってください
記事
· 2025年4月30日 5m read

SQLAchemy-iris avec la dernière version du pilote Python

Après tant d'années d'attente, nous avons enfin un pilote officiel disponible sur Pypi

De plus, j'ai découvert que le pilote JDBC était enfin disponible sur Maven depuis déjà 3 mois,  et le pilote .Net driver - surNuget depuis plus d'un mois.

 La mise en œuvre de la DB-API et que les fonctions devraient au moins être définies par cette norme. La seule différence devrait se situer au niveau de SQL.

Et ce qui est intéressant dans l'utilisation de bibliothèques existantes, c'est qu'elles ont déjà mis en œuvre d'autres bases de données en utilisant le standard DB-API, et que ces bibliothèques s'attendent déjà à ce que le pilote fonctionne.

J'ai décidé de tester le pilote officiel d'InterSystems en mettant en œuvre son support dans la bibliothèque SQLAlchemy-iris.

ディスカッション (0)1
続けるにはログインするか新規登録を行ってください
質問
· 2025年4月29日

What is the REAL content of $THIS (because it seems, $THIS is not always the expected $THIS)?

According to documentation, quotation: "$THIS contains the current class context.
The class context for an instance method is the current object reference (OREF).
The class context for a class method is the current classname as a string value."
 
As my example below shows, either the documentation or the implementation (or both) is wrong, I always call a class method (Value) and expected the class name as the return value but got either the class name or an OREF. Moreover, if I repeat the call, I get another values. But why?
Does anyone have a clear explanation (for an aging brain) or have I misunderstood something?

Class DC.ValueOfThis Extends %RegisteredObject
{

ClassMethod Test()
{
	write $zv,!!
	set obj=..%New()
	do obj.Work()
	write "From classmethod: ",$this," ",$this," ",..Value()," ",..Value()," ",..Value()," ",$this,!
	do obj.Work()
}

Method Work()
{
	write "From inst.method: ",$this," ",$this," ",..Value()," ",..Value()," ",..Value()," ",$this,!
}

ClassMethod Value()
{
	quit $this
}

}

And the test output is:

USER>

USER>d ##class(DC.ValueOfThis).Test()
IRIS for UNIX (Ubuntu Server LTS for x86-64) 2021.2 (Build 649U) Thu Jan 20 2022 08:49:51 EST

From inst.method: 1@DC.ValueOfThis 1@DC.ValueOfThis DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis
From classmethod: DC.ValueOfThis DC.ValueOfThis DC.ValueOfThis DC.ValueOfThis DC.ValueOfThis DC.ValueOfThis
From inst.method: 1@DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis 1@DC.ValueOfThis

USER>
4 Comments
ディスカッション (4)4
続けるにはログインするか新規登録を行ってください