Design

The 2 Application Domains

An OMR has to deal with two major application domains. One is the music sheet, a picture with more and more refined graphical items. The other is an abstract layout that represents the score with all its musical entities.

To enforce this separation, we deliberately chose two separate sets of words :

For the physical music sheet, we speak of pictures, frames, glyphs, stems, leaves, compounds.
For the logical music score, we speak of pages, systems, staves, slurs, measures, chords, notes, rests, etc…

The 13 Steps

OptiMusic processing is organized in a series of 13 mandatory steps that are run one after the other.

Nota : From the user interface, it's always possible to go back to a previous step (although not all steps can be such backward targets). See the 'Step' menu on the user interface.

The steps are detailed hereafter, together with information on how they work When relevant, hints are indicated on possible improvements, using italic font.

Bitmap loading

A bitmap file is read and loaded as the current picture. The file is expected to be a in 'standard' Windows format (.BMP). The bitmap is limited to black & white colors.

Here is a part of an example bitmap (fandango.bmp) right after loading :

Frame detection

The purpose of this step is to detect and logically erase the staff lines. They are 'logically erased' in the sense that the line pixels are colorized with a special color (the 'line' color, generally defined as light gray) rather than plain background. Doing so, one can always use the line information afterwards since it's still available.

The following picture represents the same music example as above, once the frames have been detected and logically removed :

The processing requires 3 phases :

The mean line thickness, and the mean interline value are retrieved. A simple histogram of vertical white and black runs (i.e. vertical sequences of white - resp. black - pixels) does the job with good reliability and accuracy. The mean interline value is used to further normalize all picture variables, since for example the height of a quarter head is roughly the interline value.
The frames are roughly retrieved, using horizontal projections on the first quarter and the last quarter of the bitmap width. Peaks in histograms indicate candidate lines which are further checked using the variance of their interline values : only regular series of 5 lines are considered as being real frames. The left and right positions give a rather good approximation of the bitmap skew.
Each individual line is scanned from left to right and from right to left, so that pixels of plain line are colorized using the 'line' color, whereas traversing objects are left as they are. We say that we encounter a traversing object if the pixel just above (resp. just below) the line is black as well as at least one of the 3 pixels of the row above (resp. below). Line extremums are also recorded to provide a more accurate measurement of the frame equation and its left and right horizontal limits.

Remarks :

Phase 1 is remarkably fast and reliable, and provides good input for subsequent line removal (phase 3).

Phase 2 is good provided that the skew is limited to a couple of degrees. Otherwise, the horizontal projections don't exhibit sharp peaks. We should investigate a different approach, based on horizontal LAGs, so that chunks of frame lines are correctly detected and then assembled even in case of significant skew. For the time being, the best workaround is to be careful when scanning the music sheet !

Phase 3 suffers from the limited scope of the test used on traversal objects : we don't look further than 2 pixels above or below the line. We should use the horizontal LAG approach to get a better idea of significant objects located above, below or across the line rather than line local anomalies : the weight of LAG sections would indicate real objects.

Glyph segmentation

The purpose is to detect connected regions, now that the lines have been logically erased. We use 4-connections, that is connections made by two pixels horizontally or vertically. Pixels that touch each other on a diagonal are not considered as connected. A flood-fill algorithm is used on each non-background / non-line pixel. The result of each flood-filling is a 'glyph', and its main geometrical characteristics are retrieved on the fly.

The process is run frame by frame. Typically the frames are separated vertically by the middle of the inter-frame space. A glyph is assigned to the frame that is closest to its mass center. Each frame is thus assigned a set of glyphs.

The result is not directly visible in the bitmap display. However, when you move the mouse over a glyph, its item ID, as well as its shape currently unknown, appears in the Picture panel. When the snapshot was taken, the mouse was over the flat sign at the upper left corner of the example : the ID is 9g4, where '9' stands for frame number 9 from the beginning of the page, 'g' for glyph, and '4' for the fourth glyph in the frame (1 is the brace, 2 the bar line, 3 the treble clef). This glyph had been flooded with pixel color 'd'.

Remarks : This phase may be the weakest of all. This is due to the fact that connectivity is highly dependent on bitmap quality : objects may get separated because only one pixel is missing, and on the contrary other objects may get incorrectly merged because they are too close to each other. The consequences are too important, although we try in later phases to remedy these incorrect connections (or non-connections). This is surely a design flaw : we should investigate an approach using object proximity but not base the whole subsequent processing on lack or presence of just a few pixels.

Symbol evaluation

Each glyph is evaluated in an attempt to recognize symbols shapes, such as bar lines, treble keys, sharp signs, dots, etc... One a symbol has been recognized, it is no longer considered in the following recognition phases. So, we require a rather high value in the confidence measured by the recognition test.

Here is the result of symbol evaluation, always on the same music example. We have recognized the brace, the treble clef, the dots near the bass clef, all the bar lines but one which is touched by a slur, the flat and sharp signs and the first slur :

The recognition program for symbols is the same one used for leaf evaluation later on. It works on several stages :

A signature of the item is computed. It can be considered as a vector in a n-dimension space, where 'n' is the number of signature variables. These variables are normalized, according to the mean picture interline value. The variables are weight, height, width, height/width ratio, area, density, mean thickness, eccentricity in X, eccentricity in Y, line position, leaf attachment (see below for leaf attachment).
For each shape, the corresponding filter is applied to the signature to provide a go/no-go answer : either the signature passes through the shape filter or the shape is rejected. For a given dimension, and for a given shape, a filter may consist in up to 3 tests (less or equal, equal, greater or equal).
According to the shape at hand, additional specific tests may have to be run. They are implemented as plug-ins, and called only if the 'dimension filter' has been successfully passed.
Each of the filtered shape is further tested using a simple pattern matching. We use a 3 x 3 grid, where each case is flagged with the percentage of black pixels in the case. Then the quadratic distance is computed between the item grid and the shape grid. The best of all distances is kept, and the shape assigned, provided that the result is better than a given confidence threshold.

Remarks :

This process is rather fast and reliable. A 3 x 3 grid may seem very coarse, but let's remember that notations differ a lot from one sheet to another, for example the simple orientation of a treble key may differ of some 10 degrees. Much thinner tests have been attempted for some time, but they lead to no convincing result.

The way these tests are implemented allow two actions. One is the ability to manually refine a shape filter, since the tests are defined in an ASCII file. The other one is the training feature : when an item is manually assigned a shape, the related filter and grids are updated. The filter is expanded so that it can 'contain' the item signature with an additional margin. The grid is modified so that it becomes the arithmetic mean of all equivalent grids. Doing so, OptiMusic can gradually improve its recognition rate.

Another important remark is that we definitely lack of OCR processing here : Techniques of Optical Character Recognition should be used beforehand, in order to recognize and remove words that appear here and there in any musical sheet. For the time being, there is a risk that a character, or a set of characters, be recognized as a musical symbol. The only workaround so far, is to manually de-assign them, i.e. assign them to the 'Unknown' shape.

Glyph chopping

Now that the isolated symbols (or a large part of them) have been recognized, we can assume that the other glyphs are in fact built from other sub-symbols, typically stems, beams, note heads etc… The purpose of this step is thus to 'chop' the non-recognized glyphs into stems and attached leaves.

Here is the result of glyph chopping on the same example. The stems are displayed in dark blue, and the beams in lighter blue :

Stems are detected first, using the vertical LAG of every glyph. Vertical sections of a minimum height and a maximum width are considered as candidate stem chunks. Close chunks are aggregated into stem candidates, which are further checked (we need a minimum ratio of background pixels on the sides of the stem).

Once we have a set of stems in a given glyph, we look for attached leaves. This is done using a flood-fill algorithm. The leaves around a stem are flooded using a color which is unique to the current stem : doing so we can easily detect beams, since they are attached to two stems and thus already flooded with a color which is not the glyph color. Actually an additional test is needed since a shared voice (a note head attached to a stem up and a stem down) is also attached to two stems : we use the stem-relative position of the leaf to disambiguate such cases, since true beams have parallel stems rather than opposed stems.

Remarks :

The stem recognition is rather efficient and reliable, no improvement is needed here.

Ledger removal

The purpose is still to remove item chunks that are not part of the item itself, and thus would impede their shape recognition. Here, we deal with ledger chunks, portion of additional staff lines that are merged with note heads outside the 5-line staff : this is the typical case of a 'C5' , which is located on an imaginary line just below the lowest staff line.

To illustrate the ledger removal, I've taken another portion of the example, which exhibits such ledger chunks.

Before :

After :

The processing is based on vertical LAGS. LAG sections are scanned for sub-sections which exhibit stability in height around the typical ledger height and stability in Y. Additional tests are run on the location of this chunk within the glyph itself.

Recognized ledgers are finally colorized using the 'ledger' color (generally light gray), so that the ledger parts of any given item are not taken into account in further recognition tests.

We also use the relative position of ledgers to better compute the line position of a note head : far from the staff lines, the only Y value of a leaf is not accurate enough, so the position with respect to ledger chunks provide the parity of the line position, and the final computation is thus less error prone.

Remarks :

Ledger chunks are very small items, and so their height or Y-stability is quickly impacted by presence (or lack of) pixels. Ledger recognition could be improved by more complex tests on chunk position within the glyph.

Leaf Evaluation

This is very similar to symbol evaluation, but performed on each and every leaf. Here is the result of the step :

Notice that all note heads have been recognized, and also the bass clef (see below).

Actually, several phases are used :

A simple recognition is attempted to every non-recognized item (glyph or leaf), using a rather high confidence level.
A second pass consists in aggregation attempts, where a (non-recognized) item is considered with its (non-recognized) neighbors as a compound, which is then evaluated and kept if satisfactory. The purpose of this phase is to circumvent the anomalies generated by the staff line removal, where for example a Bass clef is usually broken into 2 if not 3 items.
Finally, a last recognition phase is attempted using a lower confidence level.

Remarks : None.

Leaf Separation into flag + head

When a flag is long enough to touch the related note head, the flag and the head are considered as only one leaf, whereas we have two different items to separate: the flag and the head. The purpose of this step is thus to detect such cases and force a separation.

Here is the result of such flag processing :

We use the Y-eccentricity between volume center and mass center, since the mass center is mostly impacted by the note head. Thus we know in which direction to find the the note head. We consider horizontal runs, starting at volume center and going in the direction indicated by the mass center : when we encounter a significant increase in the width of the horizontal run, we consider that we have reached the point where the note head and the flag are attached, and we split the item in two at this location. Flag-item and head-item are then evaluated separately.

A closer look at the first flag/head separation of the same example shows where the cut was made :

Remarks : None.

Glyph Separation into bar + stuff

A vertical bar line may be merged with other items, such as a slur going from one measure to the next one. The purpose of this step is to separate the bar line from the other items.

Here is the result of such processing on the third bar line of this system : The bar line has been correctly recognized, and the remaining leaves gathered into what is recognized as a slur.

If a glyph exhibits a stem whose shape could be a bar line, then this former stem is now considered as a self-standing item (a compound actually) and its former leaves as other self-standing items. Each former leaf is compared with the leaves of the other side of the former stem, to check for continuation : if their Y values are close enough, these 2 former leaves are merged into a single compound. New self-standing items are then evaluated.

Remarks : None.

Bar handling

This is the first step dealing with score generation, the picture is no longer updated since al the individual shapes must have been recognized in the previous steps.

Frame by frame, we scan the frame glyphs and compounds to come up with an X-ordered list of bars. They define the related score measures.

Note that the frame may not contain an ending bar line, in that case we 'invent' such missing bar line.

Note also that the processing is done only on the first stave of each system, and that system height in terms of staves is determined on the fly according to the bar bottom and the frame Y locations.

Remarks : The system determination assumes that at least one bar line is continuous between the first and the last stave. If not, then we miss it. A lack of continuity in the bar line may originate from musical symbols located there, or simply from missing pixels. In the second case, we could try to merge these two bar line chunks.

Translation to score

Self-standing symbols are translated into musical score entities. This step is thus meant for clefs, time signatures, rests, whole notes and groups. A 'group' is a combination of stems, note heads and beams.

The only difficulty stems from leaves merged into compounds : if such a 'merged' leaf is encountered then its related compound item is used from within the group processing instead of the leave itself.

Remarks : None.

Annotation

Once the self-standing score entities are in place, we try to process the annotations : alterations (sharps, naturals, flats), tenuto dashes, dots, mordents and slurs.

Dots can be : parts of repeating bar line, parts of a bass clef, duration modification, staccato indication. The location context helps disambiguate them.

Similarly, alteration signs can be local alterations linked to a given note head and limited to the current measure, or parts of a key signature. Location contexts, plus tests on possible key signatures, are used to differentiate between these two cases.

Remarks :

Only standard key signatures are recognized (those made of only sharp signs or only flat signs).

The slur processing is so far limited to the information recorded in the score (anchor points and radius), its related note linking is not actually handled.

Check/fix the score

This is the final step, before being able to play the music.

First, time signatures are checked so that partially recognized ones (no numerator or no denominator) are discarded if other fully recognized ones are found in the sheet.

Then, measures durations are checked according to the current time signature, voice per voice if any. This helps to indicate if an item has been improperly recognized. Tuplets are processed on the fly, in an attempt to fix the measure duration.

Finally, beams are further harmonized within a group, since not all stems are attached to the whole set of beams in a group : generally the first and last stems are attached to all beams, but middle stem may extend only to the first beam encountered. This is done using the Y-location of the stem edge.

Remarks : None.

File updated on Friday 29-Dec-2000 16:56