ISSN: 2229-371X
Ashu Kumar*1, Simpel Rani Jindal2, Galaxy Singla3
|
Corresponding Author: Ashu Kumar, E-mail: ashu.software.engineer@gmail.com |
Related article at Pubmed, Scholar Google |
Visit for more related articles at Journal of Global Research in Computer Sciences
Text line segmentation is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most complicated problems in developing a reliable OCR. The nature of handwriting makes the process of text line segmentation very challenging. Text characteristics can vary in font, size, orientation, alignment, color, contrast, and background information. These variations turn the process of word detection complex and difficult. Since handwritten text can vary greatly depending on the user skills, disposition and cultural background. The technique of Piece-wise projection alongwith contour tracing to segment a handwritten document into distinct lines of text is presented. The proposed method is robust to handle line fluctuation
Keywords |
OCR, Line Segmentation, Histograms, chunks, Piece-wise separating lines, Potential PSLs. |
INTRODUCTION |
A lot of research work has been investigated for character recognition of Gurmukhi script. For an optical character recognition (OCR) system, segmentation phase is an important phase and accuracy of any OCR heavily depends upon segmentation phase. Incorrect segmentation leads to incorrect recognition. Segmentation phase include line, word and character segmentation. Before word and character segmentation, line segmentation is performed to find the number of lines and boundaries of each line in any input document image. Incorrect line segmentation may result in decrease in recognition accuracy. |
For segmentation of lines from handwritten text, survey papers are available [1,2]. Considerable amount of work has been carried out to segment lines of handwritten Roman script and there are varied and some well developed techniques [3-7]. But very little work has been carried out for Indic scripts like Devnagri, Bengali, Gurmukhi etc. Only a few papers are available for segmentation of handwritten Indic scripts [8-11]. |
The simplest and most widely used method to segment the lines is to use the inter-line gap in horizontal projection as line boundaries. This method does not work well on skewed, fluctuating or proximate images. Here, we are modifying the method to segment text lines based on histogram projection. Figures 1,2, and 3 shows three kinds of sample documents on which the line segmentation is performed. The rest of the paper is organized as follows. Section 2 describes problems associated with line segmentation. Section 3 describes the method to be proposed. Experiments and results are discussed in section 4 which is followed by conclusion in section 5. |
SEGMENTATION CHALLENGES |
When dealing with handwritten text, line segmentation has to solve some obstacles that are uncommon in modern printed text. Among the most predominant are: |
Skewed lines: lines of text in general are not straight. |
Fluctuating lines: |
Line proximity: Small gaps between neighbouring text lines will cause touching and overlapping of components, usually words or letters, between lines and irregularity in geometrical properties of the line, such as line width, height, distance in between words and lines, leftmost position etc. |
PROPOSED METHOD |
There exist several methods for text line segmentation. which are roughly categorized as follows. Smearing methods, Horizontal projections, Hough transform, repulsive attractive networks, stochastic methods and text line structure enhancing [1,2]. Due to many challenges in handwritten text line segmentation, although many methods have been proposed, the problem still remains open. |
Horizontal projections can’t deal well with skewed, curved and fluctuating lines. The method of horizontal projection of the whole text is suitable for segmentation of the text with straight lines and with large gap in lines. For example, see the Fig 1,2 & 3. |
The row where this HP is zero is a PSL. We may get a few consecutive rows whose HP is zero. Then the first row of such consecutive rows is the PSL. The PSLs of different chunks of a text are shown in Fig. 5 by horizontal lines. |
All these PSLs may not be useful for line segmentation. We choose some potential PSLs as follows. For this, we compute the estimate height of word. If the distance between any two consecutive PSLs of a stripe is less than word height, we remove the upper PSL of these two PSLs. PSLs obtained after this removal are the potential PSLs. The potential PSLs obtained from the PSLs of Fig 5 are shown in Fig 6. |
So to segment this type of text, Here we are modifying the method to segment text lines based on histogram projection and this technique is called piece-wise projection [12]. For connecting intersecting components, we are using contour tracing. |
At first, we divide the text into vertical chunks of width W. Width of the last chunk may differ from W. Computation of W is discussed later. Next, we compute piece-wise separating lines (PSL) from each of these chunks [12]. We compute the horizontal projection of each chunk. The projection profiles of the chunks of the image are shown in Fig 4. |
We stored the y-coordinates of each potential PSL in an array for future use. By proper joining of these potential PSLs, we get individual text lines. It may be noted that sometimes because of overlapping or touching of one component of the upper line with a component of the lower line, we may not get PSLs in some regions. Also, because of some modified characters of Gurmukhi (e.g. adhak, chandrabindu) we find some extra PSLs in a chunk. We take care of them during PSL joining, as explained next. |
First, we stored the y-coordinates of each potential PSL in an array for future use. By proper joining of these potential |
PSLs, we get individual text lines. It may be noted that sometimes because of overlapping or touching of one component of the upper line with a component of the lower line, we may not get PSLs in some regions. Also, because of some modified characters of Gurmukhi (e.g. adhak, chandrabindu) we find some extra PSLs in a chunk. We take care of them during PSL joining, as explained next. To join a PSL of the ith chunk to a PSL of (i + 1)th chunk, starting with the first PSL of both ith and (i+1)th chunk, we check the distance between them that should be less than 72% of word_height. The word_height is probably 40-50. So the distance can be upto 3/4th of height of word. By experimenting, we have reached the conclusion of 72%. |
If it exists, we check weather the PSL in (i+1)th chunk is on upper or lower side. Then we join the right co-ordinate of Ki with the left co-ordinate of the PSL in the (i +1)th chunk. Pointer to both PSLs is incremented by 1. If it does not exist and if PSL is in lower side, we extend the PSLi horizontally in the right direction until it reaches the right boundary of the (i + 1)th chunk or intersects a black pixel of any component in the (i + 1)th chunk. The value of PSL in (i+1)th chunk array is also inserted. Otherwise we just increment the pointer to (i+1)th chunk by 1. |
If the extended part intersects the black pixel of a component of the (i + 1)th chunk, we decide to trace the component whether in upper line or lower line. Based on the belongingness of this component, we extend this line in such a way that the component falls in its actual line[12]. Belongingness of a component is checked as follows. |
We compute the distances from the intersecting point to the topmost and bottommost point of the component. Let d1 be the top distance and d2 the bottom distance and word_height is estimated to be 40 for A4 size paper having 18-20 text lines written. If d1<d2 and d1<(word_height/2) then the component belongs to the lower line. If d2≤d1 and d2<(word_height/2) then the component belongs to the upper line. If d1 >(word_height/2) and d2>(word_heigh/2) then we assume the component touches another component of the lower line [12]. If the component belongs to the upper-line (lower-line) then the line is extended following the contour of the lower part (upper part) of the component so that the component can be included in the upper line (lower line) as shown in Fig 8. |
To follow the contour, we are testing the 8-connectivity (8 neighbouring points). To test the connectivity, we have numbered the pixel to 0 and its neighbouring pixels from 0 to 8 depending upon the type of contour, whether it is upperline or lowerline contour. For upperline (lowerline) contour, we numbered the pixels in clockwise (anti-clockwise) direction as shown in Fig 9: |
We have estimated the width size of chunk say W as 70. It is because if we take W = 50, it will make very small chunks and the chances of intersecting with the component are more and if we take W=100, number of chunks will be less and it can protect various lines to segment in case of the document in which text lines are very close to each other. |
EXPERIMENTS AND RESULTS |
The experiments are performed on various handwritten text images in Gurmukhi Script. The images with high skewness, less line gap, more gap in words etc. are considered. For experiments, we considered only single column document pages. By viewing the results on the computer’s display, we calculate line segmentation accuracy manually by checking correctly segmented components. We have shown a segmented image with lines very close to each other. The words of lines are overlapping and touching highly in this example. We tried to use contour tracing to accurately segment the intersecting components as shown in Fig 8. |
The results of Line segmentation can be seen in Table 1. This method can be applied to other Indian scripts too. But its limitation is that it is size dependant. In future we plan to use different sized text. Figures of some documents are shown below: |
References |
|