Why does OCR separate Text into Words?
Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 – detection of the units of measure through OCR function,
2 – from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units ‘(kg/kW.h)(kW.h/)’ rather than having two Words (one for ‘(kg/kW.h)’ and the other for ‘(kW.h/)’) with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load(‘test.mat’)
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,’rectangle’,roi,’ShapeColor’,’blue’);
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis=’word’);
UnitString=regexp(txt1.Words,'(?<=()[w./]*(?=))’,’match’);
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:);Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 – detection of the units of measure through OCR function,
2 – from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units ‘(kg/kW.h)(kW.h/)’ rather than having two Words (one for ‘(kg/kW.h)’ and the other for ‘(kW.h/)’) with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load(‘test.mat’)
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,’rectangle’,roi,’ShapeColor’,’blue’);
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis=’word’);
UnitString=regexp(txt1.Words,'(?<=()[w./]*(?=))’,’match’);
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:); Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 – detection of the units of measure through OCR function,
2 – from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units ‘(kg/kW.h)(kW.h/)’ rather than having two Words (one for ‘(kg/kW.h)’ and the other for ‘(kW.h/)’) with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load(‘test.mat’)
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,’rectangle’,roi,’ShapeColor’,’blue’);
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis=’word’);
UnitString=regexp(txt1.Words,'(?<=()[w./]*(?=))’,’match’);
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:); ocr MATLAB Answers — New Questions