JEDAI | Perform any data integration task with a high scalability toolkit for record linkage and entity resolution
2
home,page-template,page-template-full_width,page-template-full_width-php,page,page-id-2,ajax_fade,page_not_loaded,,select-theme-ver-3.2.1,wpb-js-composer js-comp-ver-4.12,vc_responsive

The Force behind Entity Resolution

Open Source | Free | Modular Architecture

LEARN MORE

Perform any data integration task with the Java generic data integration toolkit

Open source library

JedAI implements numerous state-of-the-art methods for all steps of an established end-to-end ER workflow.

View source on github

Desktop application for Entity Resolution

JedAI offers an intuitive Graphical User Interface that is suitable for both expert and lay users.

View source on github

Workbench tool

JedAI can be used as a workbench for comparing all performance aspects of various (configurations of) end-to-end ER workflows.

How does the JedAI Toolkit Work

JedAI comprises a set of domain-independent, state-of-the-art techniques that apply to any domain. At their core lies an approximate, schema-agnostic functionality based on blocking for high scalability. In more detail, it supports the following functionalities grouped into 7 modules:

Step 1 - Data Reading

Data reading module contains Java classes that transform input data into a list of entity profiles. Input data can be read from various sources, CSV, XML, RDF and SQL databases. A set of input data to play with, have been defined at the datasets folder at GitHub.

 

Example: Reading data from a CSV file using EntityCSVReader:

String filePath = "C:\\Users\\G.A.P. II\\Downloads\\cd.csv";
EntityCSVReader csvReader = new EntityCSVReader(filePath);
csvReader.setAttributeNamesInFirstRow(true);
csvReader.setSeparator(';');
csvReader.setAttributesToExclude(new int[]{1});
csvReader.setIdIndex(0);
List profiles = csvReader.getEntityProfiles();
csvReader.storeSerializedObject(profiles, "C:\\Users\\G.A.P. II\\Downloads\\cddbProfiles");

In this example input data are read from a comma seperated value (CSV) file.The entity profiles are extracted and stored to a file with name cddbProfiles.

Step 2 - Block Building

During this step overllaping blocks are created. Block building module clusters entities into blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.

You can choose 1 out of 8 methods:

  • Block building methods

    • Token Blocking
    • Sorted Neighborhood
    • Extended Sorted Neighborhood
    • Attribute Clustering
    • Q-Grams Blocking
    • Extended Q-Grams Blocking
    • Suffix Arrays
    • Extended Suffix Arrays

Example: Demonstration of block building  using all algorithms

String entitiesFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbProfiles";
String groundTruthFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbDuplicates";

IEntityReader eReader = new EntitySerializationReader(entitiesFilePath);
List profiles = eReader.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles.size());

IGroundTruthReader gtReader = new GtSerializationReader(groundTruthFilePath);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

for (BlockBuildingMethod blbuMethod : BlockBuildingMethod.values()) {
double time1 = System.currentTimeMillis();

System.out.println("\n\nCurrent blocking metohd\t:\t" + blbuMethod);
IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blbuMethod);

System.out.println("Block Building...");
List blocks = blockBuildingMethod.getBlocks(profiles, null);
double time2 = System.currentTimeMillis();

BlocksPerformance blStats = new BlocksPerformance(blocks, duplicatePropagation);
blStats.setStatistics();
blStats.printStatistics(time2-time1, blockBuildingMethod.getMethodConfiguration(), blockBuildingMethod.getMethodName());
}

In this example the entity profiles are read from the previous step (file cddbProfiles) and a set of blocks is generated.

Step 3 - Block Cleaning

This is an optional step that cleans blocks from useless comparisons (repeated, superfluous). You can specify any combination of 3  complementary methods for Dirty ER or 4 for CleanClean ER

  • Block cleaning methods

    • Block Filtering
    • Size-based Block Purging
    • Cardinality-based Block Purging
    • Block Scheduling

Example: Demonstration of Block cleaning:

String entitiesFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbProfiles";
String groundTruthFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbDuplicates";

IEntityReader eReader = new EntitySerializationReader(entitiesFilePath);
List profiles = eReader.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles.size());

IGroundTruthReader gtReader = new GtSerializationReader(groundTruthFilePath);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

for (BlockBuildingMethod blbuMethod : BlockBuildingMethod.values()) {
double time1 = System.currentTimeMillis();

StringBuilder workflowConf = new StringBuilder();
StringBuilder workflowName = new StringBuilder();

System.out.println("\n\nCurrent blocking metohd\t:\t" + blbuMethod);

IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blbuMethod);
List blocks = blockBuildingMethod.getBlocks(profiles, null);

workflowConf.append(blockBuildingMethod.getMethodConfiguration());
workflowName.append(blockBuildingMethod.getMethodName());
System.out.println("Original blocks\t:\t" + blocks.size());

IBlockProcessing blockCleaningMethod = BlockBuildingMethod.getDefaultBlockCleaning(blbuMethod);
if (blockCleaningMethod != null) {
blocks = blockCleaningMethod.refineBlocks(blocks);
workflowConf.append("\n").append(blockCleaningMethod.getMethodConfiguration());
workflowName.append("->").append(blockCleaningMethod.getMethodName());
}

double time2 = System.currentTimeMillis();

BlocksPerformance blStats = new BlocksPerformance(blocks, duplicatePropagation);
blStats.setStatistics();
blStats.printStatistics(time2-time1, workflowConf.toString(), workflowName.toString());
}

Step 4 - Comparison Cleaning

This is an optional step that operates on the level of individual comparisons to remove the useless ones. You can choose 1 out of 7 methods (including Meta-blocking)

  • Comparison Cleaning Methods

    • Comparison Propagation
    • Cardinality Edge Pruning (CEP)
    • Cardinality Node Pruning (CNP)
    • Weighted Edge Pruning (WEP)
    • Weighted Node Pruning (WNP)
    • Reciprocal CNP
    • Reciprocal WNP

Example: Demostration of comparison refinement

String entitiesFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbProfiles";
String groundTruthFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbDuplicates";

IEntityReader eReader = new EntitySerializationReader(entitiesFilePath);
List profiles = eReader.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles.size());

IGroundTruthReader gtReader = new GtSerializationReader(groundTruthFilePath);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

for (BlockBuildingMethod blbuMethod : BlockBuildingMethod.values()) {
double time1 = System.currentTimeMillis();

StringBuilder workflowConf = new StringBuilder();
StringBuilder workflowName = new StringBuilder();

System.out.println("\n\nCurrent blocking metohd\t:\t" + blbuMethod);

IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blbuMethod);
List blocks = blockBuildingMethod.getBlocks(profiles, null);

workflowConf.append(blockBuildingMethod.getMethodConfiguration());
workflowName.append(blockBuildingMethod.getMethodName());
System.out.println("Original blocks\t:\t" + blocks.size());

IBlockProcessing blockCleaningMethod = BlockBuildingMethod.getDefaultBlockCleaning(blbuMethod);
if (blockCleaningMethod != null) {
blocks = blockCleaningMethod.refineBlocks(blocks);
workflowConf.append("\n").append(blockCleaningMethod.getMethodConfiguration());
workflowName.append("->").append(blockCleaningMethod.getMethodName());
}

IBlockProcessing comparisonCleaningMethod = BlockBuildingMethod.getDefaultComparisonCleaning(blbuMethod);
if (comparisonCleaningMethod != null) {
blocks = comparisonCleaningMethod.refineBlocks(blocks);
workflowConf.append("\n").append(comparisonCleaningMethod.getMethodConfiguration());
workflowName.append("->").append(comparisonCleaningMethod.getMethodName());
}

double time2 = System.currentTimeMillis();

BlocksPerformance blStats = new BlocksPerformance(blocks, duplicatePropagation);
blStats.setStatistics();
blStats.printStatistics(time2-time1, workflowConf.toString(), workflowName.toString());
}

Step 5 - Entity Matching

Entity Matching module compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Two schema-agnostic methods are implemented: Group Linkage and Profile Matcher.

 

Example: Group Linkage for entity matching

String entitiesFilePath1 = "/home/ethanos/Downloads/JEDAIfiles/im-identity/oaei2014_identity_aPROFILES";
String groundTruthFilePath = "/home/ethanos/Downloads/JEDAIfiles/cddbTestDuplicates";

IEntityReader eReader1 = new EntitySerializationReader(entitiesFilePath1);
List profiles1 = eReader1.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles1.size());

IGroundTruthReader gtReader = new GtSerializationReader(groundTruthFilePath);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader1.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

for (BlockBuildingMethod blbuMethod : BlockBuildingMethod.values()) {
if (blbuMethod.equals(BlockBuildingMethod.ATTRIBUTE_CLUSTERING))
{

System.out.println("\n\nCurrent blocking metohd\t:\t" + blbuMethod);
IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blbuMethod);
List blocks = blockBuildingMethod.getBlocks(profiles1);//
System.out.println("Original blocks\t:\t" + blocks.size());

IBlockProcessing blockCleaningMethod = BlockBuildingMethod.getDefaultBlockCleaning(blbuMethod);
if (blockCleaningMethod != null) {
blocks = blockCleaningMethod.refineBlocks(blocks);
}

IBlockProcessing comparisonCleaningMethod = BlockBuildingMethod.getDefaultComparisonCleaning(blbuMethod);
if (comparisonCleaningMethod != null) {
blocks = comparisonCleaningMethod.refineBlocks(blocks);
}

long start = System.nanoTime();
for (RepresentationModel model : RepresentationModel.values()) {
if (model.equals(RepresentationModel.CHARACTER_BIGRAMS))
{

GroupLinkage gp = new GroupLinkage(model, SimilarityMetric.getModelDefaultSimMetric(model));
gp.setSimilarityThreshold(0.1);
SimilarityPairs simPairs = gp.executeComparisons(blocks, profiles1);

for (int i = 0; i < simPairs.getNoOfComparisons(); i++) {

}
}

}
long elapsedTime = System.nanoTime() - start;
System.out.println("time="+elapsedTime/1000000000.0);
}

}

 

Example: Profile Matcher  for entity matching

String entitiesFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbProfiles";
String groundTruthFilePath = "C:\\Users\\G.A.P. II\\Downloads\\cddbDuplicates";

IEntityReader eReader = new EntitySerializationReader(entitiesFilePath);
List profiles = eReader.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles.size());

IGroundTruthReader gtReader = new GtSerializationReader(groundTruthFilePath);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

for (BlockBuildingMethod blbuMethod : BlockBuildingMethod.values()) {
System.out.println("\n\nCurrent blocking metohd\t:\t" + blbuMethod);
IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blbuMethod);
List blocks = blockBuildingMethod.getBlocks(profiles, null);
System.out.println("Original blocks\t:\t" + blocks.size());

IBlockProcessing blockCleaningMethod = BlockBuildingMethod.getDefaultBlockCleaning(blbuMethod);
if (blockCleaningMethod != null) {
blocks = blockCleaningMethod.refineBlocks(blocks);
}

IBlockProcessing comparisonCleaningMethod = BlockBuildingMethod.getDefaultComparisonCleaning(blbuMethod);
if (comparisonCleaningMethod != null) {
blocks = comparisonCleaningMethod.refineBlocks(blocks);
}

for (RepresentationModel model : RepresentationModel.values()) {
IEntityMatching pm = new ProfileMatcher(model, SimilarityMetric.getModelDefaultSimMetric(model));
SimilarityPairs simPairs = pm.executeComparisons(blocks, profiles);
for (int i = 0; i < 10; i++) {
System.out.println(simPairs.getEntityIds1()[i] + "\t\t" + simPairs.getEntityIds2()[i] + "\t\t" + simPairs.getSimilarities()[i]);
}
}
}

Both entity matching methods aggregate all attributes values in an individual entity into a textual representation, based on one of the following bag and graph models:

      • character n-grams (n=2, 3 or 4)
      • character n-gram graphs (n=2, 3 or 4)
      • token n-grams (n=1, 2 or 3)
      • token n-gram graphs (n=1, 2 or 3)

 

The bag models can be combined with the following similarity measures, using term-frequency weights:

      • Cosine similarity
      • Jaccard similarity
      • Generalized Jaccard similarity
      • Enhanced Jaccard similarity

 

The graph models can be combined with the following graph similarity measures:

      • Containment similarity
      • Normalized Value similarity
      • Value similarity
      • Overall Graph similarity

Step 6 - Entity Clustering

Entity Clustering module uses the similarities produced by Entity Matching to create the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities. The similarity graph is then partitioned into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object.

 

Example: Entity Clustering module

BlockBuildingMethod blockingWorkflow = BlockBuildingMethod.STANDARD_BLOCKING;

String[] datasetProfiles = {
"/home/ethanos/workspace/JedAIToolkitNew/datasets/restaurantProfiles", // "E:\\Data\\csvProfiles\\censusProfiles",
// "E:\\Data\\csvProfiles\\coraProfiles",
// "E:\\Data\\csvProfiles\\cddbProfiles",
// "E:\\Data\\csvProfiles\\abt-buy\\dataset",
// "E:\\Data\\csvProfiles\\amazon-gp\\dataset",
// "E:\\Data\\csvProfiles\\dblp-acm\\dataset",
// "E:\\Data\\csvProfiles\\dblp-scholar\\dataset",
// "E:\\Data\\csvProfiles\\movies\\dataset"
};
String[] datasetGroundtruth = {
"/home/ethanos/workspace/JedAIToolkitNew/datasets/restaurantIdDuplicates", // "E:\\Data\\csvProfiles\\censusIdDuplicates",
// "E:\\Data\\csvProfiles\\coraIdDuplicates",
// "E:\\Data\\csvProfiles\\cddbIdDuplicates",
// "E:\\Data\\csvProfiles\\abt-buy\\groundtruth",
// "E:\\Data\\csvProfiles\\amazon-gp\\groundtruth",
// "E:\\Data\\csvProfiles\\dblp-acm\\groundtruth",
// "E:\\Data\\csvProfiles\\dblp-scholar\\groundtruth",
// "E:\\Data\\csvProfiles\\movies\\groundtruth"
};

for (int datasetId = 0; datasetId < datasetProfiles.length; datasetId++) {
System.out.println("\n\n\n\n\nCurrent dataset id\t:\t" + datasetId);;

StringBuilder blockingWorkflowConf = new StringBuilder();
StringBuilder blockingWorkflowName = new StringBuilder();
StringBuilder matchingWorkflowConf = new StringBuilder();
StringBuilder matchingWorkflowName = new StringBuilder();

IEntityReader eReader = new EntitySerializationReader(datasetProfiles[datasetId]);
List profiles = eReader.getEntityProfiles();
System.out.println("Input Entity Profiles\t:\t" + profiles.size());

IGroundTruthReader gtReader = new GtSerializationReader(datasetGroundtruth[datasetId]);
final AbstractDuplicatePropagation duplicatePropagation = new UnilateralDuplicatePropagation(gtReader.getDuplicatePairs(eReader.getEntityProfiles()));
System.out.println("Existing Duplicates\t:\t" + duplicatePropagation.getDuplicates().size());

double time1 = System.currentTimeMillis();

IBlockBuilding blockBuildingMethod = BlockBuildingMethod.getDefaultConfiguration(blockingWorkflow);
List blocks = blockBuildingMethod.getBlocks(profiles, null);
System.out.println("Original blocks\t:\t" + blocks.size());

blockingWorkflowConf.append(blockBuildingMethod.getMethodConfiguration());
blockingWorkflowName.append(blockBuildingMethod.getMethodName());

IBlockProcessing blockCleaningMethod = BlockBuildingMethod.getDefaultBlockCleaning(blockingWorkflow);
if (blockCleaningMethod != null) {
blocks = blockCleaningMethod.refineBlocks(blocks);
blockingWorkflowConf.append("\n").append(blockCleaningMethod.getMethodConfiguration());
blockingWorkflowName.append("->").append(blockCleaningMethod.getMethodName());
}

IBlockProcessing comparisonCleaningMethod = BlockBuildingMethod.getDefaultComparisonCleaning(blockingWorkflow);
if (comparisonCleaningMethod != null) {
blocks = comparisonCleaningMethod.refineBlocks(blocks);
blockingWorkflowConf.append("\n").append(comparisonCleaningMethod.getMethodConfiguration());
blockingWorkflowName.append("->").append(comparisonCleaningMethod.getMethodName());
}

double time2 = System.currentTimeMillis();

BlocksPerformance blp = new BlocksPerformance(blocks, duplicatePropagation);
blp.setStatistics();
blp.printStatistics(time2-time1, blockingWorkflowConf.toString(), blockingWorkflowName.toString());

RepresentationModel repModel = RepresentationModel.CHARACTER_BIGRAMS;
// for (RepresentationModel repModel : RepresentationModel.values()) {
System.out.println("\n\nCurrent model\t:\t" + repModel.toString() + "\t\t" + SimilarityMetric.getModelDefaultSimMetric(repModel));
IEntityMatching em = new ProfileMatcher(repModel, SimilarityMetric.JACCARD_SIMILARITY);
SimilarityPairs simPairs = em.executeComparisons(blocks, profiles);

matchingWorkflowConf.append(em.getMethodConfiguration());
matchingWorkflowName.append(em.getMethodName());

IEntityClustering ec = new RicochetSRClustering();
ec.setSimilarityThreshold(0.1);
List entityClusters = ec.getDuplicates(simPairs);

matchingWorkflowConf.append("\n").append(ec.getMethodConfiguration());
matchingWorkflowName.append("->").append(ec.getMethodName());

double time3 = System.currentTimeMillis();

try {
PrintToFile.toCSV(entityClusters, "/home/ethanos/workspace/JedAIToolkitNew/rest.csv");
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

ClustersPerformance clp = new ClustersPerformance(entityClusters, duplicatePropagation);
clp.setStatistics();
clp.printStatistics(time3-time2, matchingWorkflowConf.toString(), matchingWorkflowName.toString());
}
// }

The following domain-independent methods are currently supported:

      • Center Clustering
      • Connected Components Clustering
      • Cut Clustering
      • Markov Clustering
      • Merge-Center Clustering
      • Ricochet SR Clustering

 

Step 7 - Evaluation & Scoring

During this step performance results & w.r.t. numerous measures can be presented and stored (as a CSV file)

Roadmap

Version 2.0 | Available at the end of September, 2017.
Includes support for SPARQL endpoints, multicore functionality and configuration optimization.

 

Version 3.0 | Available at the end of December, 2017.
Includes support for ontology matching, progressive ER as well as a workflow builder.

 

Version 4.0: | Available at the end of December, 2018.
All functionality is implemented in Apache Spark.