revcondevalnew.nb

Reverse Conducting Evaluation
Craig Stuart Sapp <craig@ccrma.stanford.edu>
12 October 2005 -- 31 October 2005

This Mathematica notebook examines the individual reverse conducting performances compared to carefully corrected versions of the average taps for two performances of Chopin's Mazurka in F minor, Op. 7, No. 3 as played by (1) Ignaz Friedman in 1930 which is available on CD from Philips in the Great Pianists of the Twientieth Century, vol. 30, and (2) Charles Rosen in 1989 which is available on CD from Globe Records.

The Friedman recording is representative of a historical recording and the Rosen recording of a modern recording. The Friedman recording was reverse conducted first, and also it was manually corrected first, so there could be leaning effects when comparing the two recordings, although it should be noted that the Friedman performance is faster and somewhat more difficult to follow beat by beat.

The manually corrected values of the two recordings will be used to evaluate the automatically identified beat timings in terms of historic and modern recordings. For example, early recordings contain noise which may confuse automatic idenfication algorithms. More recent performances are also less idosyncratic and more truer to the score (in general), so that may also effect the accuracy of an automatic identification of the beats.

Modern Recording Sample
Mazurka in F Minor, Op. 7, No. 3
Rosen 1989

Loading the raw data

First load the absolute beat positions for the average beat positions and the corrected beat positions:

Now measure the time difference between each average and corrected beat. A negative value means that the average beat occurs before the corrected beat time.

The average correction to each beat from the average tap times of all trials was 48 milliseconds. This means that, on the average, each average tap beat position must be moved 48 milliseconds to align with the correct beat time.

Manual correction of average reverse conducting beat positions

This section displays the difference between the average reverse conducted beat positions and the values derived from manually correcting these values by ear using an soundfile editor.

This corrected beat duration data was measured by listening to the audio file and manually locating the beat positions in the soundfile. The initial positions of the average reverse conducting performance were used as a baseline, and this baseline was then adjusted so that it sounded as if all beats were occuring with the attacks of notes in the performance. There are probably a few errors in the corrected due to the tediousness of getting this data, and there are a few locations in the audio where the beat positions become vague, but overall, this data represents a highly accurate description of the pianists' actual beat-by-beat performance tempo, probably an accuracy within 10 millseconds for all beats.

Here are the timing differences between the average reverse-conducting absolute times and the absolute times measured in a sound editor:

[Graphics:Images/index_gr_14.gif]

The average beats are not quite centered on the corrected beats. The average beats are, on the average, 4.6 milliseconds after the corrected beats.

The histogram below shows the data from the plot above. It generally shows a nicely distributed range of corrections. The peak at 0 is due to the correction values being too small to change to the theoretically correct values about up to 10 milliseconds before their measured positions.

<< Graphics`Graphics`

[Graphics:Images/index_gr_18.gif]

1.5% of the average reverse conducting beats were sufficiently close to the manually audible analysis of the beat positions so that they did not require alteration to align with the audio beat events. The distribution should be smooth (since about 1/2 as much is expected to be exactly on the correct beat), so the are some beats marked as 0 ms difference to the true location which are occuring sligntly after the actual beat position.

Trial Quality Measurements

Define a logarithmic score for the amount of correction needed for a given beat. A score of 1 means 20 ms correction, 2 means 40 ms correction, 3 means 80 ms correction, 4, means 160 ms correction, 5 means 360 ms correction, etc. A score of 0 means there was less than 1 millisecond correction necessary, and a score of 0.5 means 10 milliseconds were needed to correct the beat location.

Qualitatively, a score of 0-1 is an "A", 1-2 is a "B", 2-3 is a "C", 3-4 is a "D", 4-5 is an "E", and 5-6 is an "F". A score of 6 means the correction was 640 milliseconds, or about
2/3 of a second.

Mostly the corrections for the average of the tapping trials are evenly distributed between A, B, C, and D quality, although "A" corrections were the most common. The worst case correction was between a value of 4 and 5. This occurs on the first beat when there has been no preparation of a previous event, so is most likely due to the reaction time of the reverse conductor. Other corrections in the range between 4 and 5 are also likely due to an unprepared change in tempo which was not expected beforehand.

<< Graphics`Graphics`

[Graphics:Images/index_gr_21.gif]

The range from 0-1 can be characterized as "no audible difference"(0-20 ms); 1-2 as "slight audible difference (20-40 ms); 2-3 "minor audible difference (40-80 ms); 3-4 as "noticeable audible difference" (80-160 ms), 4-5 as "very noticeable audible difference" (160-320 ms), and anythhing higher as "extremely noticeable audible difference" (> 320 ms).

30% of the average reverse conducting beats were at the level of no audible difference, with about 20% for each of the next three categories of "slight audible difference", "minor audible difference", and "noticeable audible difference". The overall score for the average reverse conducting of the performance is a "B-":

Here is a histogram display of the necessary corrections with positive and negative adjustments separated. Negative values mean that the correct value occurs before the average reverse conduting time. Notice that most of the larger corrections occur in the negative range

[Graphics:Images/index_gr_25.gif]

Examine the learning curve

How does the quality of the reverse conducting improve over time? Does the conductor learn to follow the performance better after repeated listening? First, load the individual tapping trials which contain precalculated offset values into the audio file of the performance recording:

Now calculate the timing differences between the individual trials and the corrected beat times.

The first tapping trial gets a score of "C+", the worst trial gets a "C", and the best trial gets a "B-". The last trial is also the best trial.

Here are quality score plots for each trial. Starting at trial 7 the categories are seen to improve so that lower categores contain a higher percentage of beats on the average.

[Graphics:Images/index_gr_31.gif]

[Graphics:Images/index_gr_32.gif]

[Graphics:Images/index_gr_33.gif]

[Graphics:Images/index_gr_34.gif]

[Graphics:Images/index_gr_35.gif]

[Graphics:Images/index_gr_36.gif]

[Graphics:Images/index_gr_37.gif]

[Graphics:Images/index_gr_38.gif]

[Graphics:Images/index_gr_39.gif]

[Graphics:Images/index_gr_40.gif]

[Graphics:Images/index_gr_41.gif]

[Graphics:Images/index_gr_42.gif]

[Graphics:Images/index_gr_43.gif]

[Graphics:Images/index_gr_44.gif]

[Graphics:Images/index_gr_45.gif]

[Graphics:Images/index_gr_46.gif]

[Graphics:Images/index_gr_47.gif]

[Graphics:Images/index_gr_48.gif]

[Graphics:Images/index_gr_49.gif]

[Graphics:Images/index_gr_50.gif]

How does the accuracy of the first half of the trials compare to the second half of the trials?:

[Graphics:Images/index_gr_53.gif]

[Graphics:Images/index_gr_55.gif]

So averaging only the last 10 trials of the set of 20 gives an improvement in the average correction distance to the "true" beat. About 7% of the beats move to the "A" category, the "B" category says about the same, the "C" category declines by about 3% of the total beats, and the "D" category declines by about 4% of the total beats.

The score both halves are on the opposite ends of the "B-" range:

On the average, beats in the second half of the trials are 6 milliseconds closer to the "true" beat locations:

Also, averaging the performance of the second half of the trials improves the error displacement by 1.46 milliseconds on the average:

Plotting the average displacement error for each trial

This section plots the average displacement error for each trial which shows the gradual improvement in the reverse conducting over time.

Note that the average correction of the averaged data was 48 milliseconds, which is better than all of the individual trials. This should be double checked, but it seems to mean that you can get better absolute positioning of the beats by averaging all trials together than by finding the best particular performance.

[Graphics:Images/index_gr_73.gif]

Now fit an exponentially decreasing line though the trial errors to model what the "learning curve" is for reverse conducting this performance.

[Graphics:Images/index_gr_77.gif]

[Graphics:Images/index_gr_79.gif]

Predictions of accuracy can be derived from the learning curve over time, doubling the number of trials should result in an average error of 44 ms, or 17% greater accuracy than by stopping at 20 trials.

Doing 40 trials would improve the displacement error by about 2 milliseconds:

After 100 trials, the accuracy should be twice as good as after 20 trials:

Doing 100 trials would improve the average displacement error by about 20 milliseconds (assuming an exponentially decaying learning curve):

Of course, these estimations of accuracy after a certain number of trials do not take into account fatigue (see trial 15 above, for example as a possibility), and the effects of taking breaks between trials (all 20 trials were done consecutively).

Effect of removing the "bad" trial

What happens if the "bad" trial (number 15 in this case) is removed from the average tapping time for each beat?

[Graphics:Images/index_gr_92.gif]

[Graphics:Images/index_gr_96.gif]

[Graphics:Images/index_gr_98.gif]

So removing the "bad" trial actually increased the overall average displacement error in this case by about 0.5 milliseconds

Trying to identifying "bad" performances from the average

Instead of comparing the individual performances to the corrected data, compare them to the average of all trials.

[Graphics:Images/index_gr_107.gif]

Trial 15 is still identified as the "worst" performance when compared to the average of all trials rather than to the corrected times. So if it is necessary to throw out "bad" performances, then this method would work in identifying the bad performances.

Dropping any single trial

Dropping the worst trial does not improve the accuracy of the average. Does dropping any other trial improve the average of all trials?

[Graphics:Images/index_gr_113.gif]

If any of trials 1, 2, 3 or 5 are dropped, then the average displacement error is reduced from the average of the rest of the trials. The horizontal line represents the displacement error when averaging all trials. So it looks like dropping earlier trials will help improve the accuracy of the average trial.

Dropping a range of trials

Now examine how the displacement error changes as more and more of the earlier trials are removed from the average.

[Graphics:Images/index_gr_120.gif]

Dropping the first 15 trials gives the best error of about 46 ms, which is 2 ms better than averaging all trials together. However, dropping the first 14 trials gives an equivalent error value as averaging all trials. Even dropping the first 18 trials yields a 1 ms improvement in the error rate. Droping more than 10 trials yields more uncertainty in the result as can be seen by the jaggedness of the plot on the right side. It seems reasonable to assume that dropping the first 10 trials due to an improved learning of the piece will lower the average displacement error of the average beat times.

In this plot, Trial 15 does seem to be a "bad" performance. It must, however, counteract other "bad" qualities of previous trials, and so must be included when earlier trials are also included.

Just for fun, here is a plot created by dropping the later trials more and more:

[Graphics:Images/index_gr_127.gif]

[Graphics:Images/index_gr_129.gif]

You definitely do not want to drop later trials from the average, since dropping any later trials and keeping the earlier trials will always significantly increase the average displacement error.

Also note that all individual trials are worse than trial average, as well as averages involving only the later trials:

[Graphics:Images/index_gr_135.gif]

[Graphics:Images/index_gr_137.gif]

The red curve in the plot above shows the average displacement errors from the corrected times for each of the 20 individual trials. The blue curve represents dropping more and more of the later trials being dropped. The black curve represents dropping more and more of the earlier trials. Only trials 14 and 20 are close to the error rate of the average of all trials.

Offset sensitivity

Individual trials were aligned to the Modern test case using a few points which seemed to be in a stable tempo region. How accurate is the offset calculated from these few points in the audio? Would the average of all trials improve if an offset is calculated for each trial based on the corrected data?

The average displacement error could be reduced by shifting the average trial beat times 4.6 milliseconds forward in time (total range -22 ms to +60 ms). This might be related to the 5 ms resolution of the computer keyboard. The offset of +60 may be related to poor alignment at the measured points in the audio, or perhaps even operating system multitasking. Also, this large offset error may be result of the "badness" of trial 15 (e.g., when displayed on the learning curve with other trials).

However, note that finding the individual offsets for the trials with more points, does not yield a more accurate average displacement error (48.6 compared to 48.0), so it seems that only aligning to beats in stable portions of the music is better than aligning to all beats in the music.

Don't know what it means, but the following plot shows that there is a wider minimum in the error rate when shifting the averaged trial differences when the
offsets have been optimized for all beats in the piece, rather than a small set as shown in the second plot:

[Graphics:Images/index_gr_154.gif]

Here is a plot using the initial offset values:

[Graphics:Images/index_gr_156.gif]

Comparing corrected beats to average reverse conducting beats

This section plots the average tempo range and compares it to the acutal tempos measured from the audio file.

[Graphics:Images/index_gr_163.gif]

[Graphics:Images/index_gr_165.gif]

[Graphics:Images/index_gr_169.gif]

[Graphics:Images/index_gr_171.gif]

[Graphics:Images/index_gr_173.gif]

Here is a zoom in on the tempo for every 8 measures. The black line indicates the mean reverse-conducted durations for every beat. The dark gray lines surrounding the average duration line is the 95% confidence range for the true mean, and the light gray lines indicate the maximum and minumum durations for each beat from all reverse conducting trials.

The red dots indicates the beat event durations from the audio file which are more accurate than the average reverse conducting durations, and can be assumbed to be the "correct" answer. Notice that the averasge reverse conducting beats are often one beat behind the red plot when the tempo changes suddenly which probably shows the delayed reaction time to listening to the performance.

[Graphics:Images/index_gr_176.gif]

[Graphics:Images/index_gr_177.gif]

[Graphics:Images/index_gr_178.gif]

[Graphics:Images/index_gr_179.gif]

[Graphics:Images/index_gr_180.gif]

[Graphics:Images/index_gr_181.gif]

[Graphics:Images/index_gr_182.gif]

[Graphics:Images/index_gr_183.gif]

[Graphics:Images/index_gr_184.gif]

[Graphics:Images/index_gr_185.gif]

[Graphics:Images/index_gr_186.gif]

[Graphics:Images/index_gr_187.gif]

[Graphics:Images/index_gr_188.gif]

[Graphics:Images/index_gr_189.gif]

Historical Recording Sample
Mazurka in F Minor, Op. 7, No. 3
Friedman 1930

Now do the same analysis on the historic recording sample, and compare any similarities/differences.

Loading the raw data

First load the absolute beat positions for the average beat positions and the corrected beat positions:

Now measure the time difference between each average and corrected beat. A negative value means that the average beat occurs before the corrected beat time.

The average correction to each beat from the average tap times of all trials was 46.2 milliseconds. This is almost 2 milliseconds better than the average for the modern recording sample.

Manual correction of average reverse conducting beat positions

This section displays the difference between the average reverse conducted beat positions and the values derived from manually correcting these values by ear using an soundfile editor.

Here are the timing differences between the average reverse-conducting absolute times and the absolute times measured in a sound editor:

[Graphics:Images/index_gr_209.gif]

The average beats are noticeably not quite centered on the corrected beats. The average beats are, on the average, 27 milliseconds before the corrected beats. This is most likely due to the error in identifying the correct time offsets for each trial. 28 time points were used to align the trials to the soundfile. Probably the larger error in the offset value is due to the use of alignment beats from unstable tempo regions of the piece.

<< Graphics`Graphics`

[Graphics:Images/index_gr_213.gif]

Similar to the modern sample, 1.5% of the average reverse conducting beats were sufficiently close to the manually audible analysis of the beat positions so that they did not require alteration to align with the audio beat events. The distribution should be smooth (since about 1/2 as much is expected to be exactly on the correct beat), so the are some beats marked as 0 ms difference to the true location which are occuring sligntly after the actual beat position.

Trial Quality Measurements

Qualitatively, a score of 0-1 is an "A", 1-2 is a "B", 2-3 is a "C", 3-4 is a "D", 4-5 is an "E", and 5-6 is an "F". A score of 6 means the correction was 640 milliseconds, or about
2/3 of a second.

Mostly the corrections for the average of the tapping trials are better allocated between A, B, C, and D quality. Again, the "A" corrections were the most common. The worst case correction was between a value of 5 and 6 (for the last note in the piece). Other corrections in the range between 4 and 5 are also likely due to an unprepared change in tempo which was not expected beforehand.

<< Graphics`Graphics`

[Graphics:Images/index_gr_215.gif]

35% of the average reverse conducting beats were at the level of no audible difference, with about 20% for each of the next three categories of "slight audible difference", "minor audible difference", and "noticeable audible difference". The highernumber of beats in the "A" region may be partially due to the fact that the tempo of this performance is faster. The overall score for the average reverse conducting of the performance is a "B" (compared to a "B-" for the modern sample:

Here is a histogram display of the necessary corrections with positive and negative adjustments separated. Since the offset value was not estimated very accurately, it can be seen in the histogram that most average beats were occuring after the corrected beat than before it (or is that vice-versa?):

[Graphics:Images/index_gr_219.gif]

Examine the learning curve

Now calculate the timing differences between the individual trials and the corrected beat times.

The first tapping trial gets a score of "B-", the worst trial gets a "C", and the best trial gets a "B-". The last trial is not the best trial.

Here are quality score plots for each trial.

[Graphics:Images/index_gr_225.gif]

[Graphics:Images/index_gr_226.gif]

[Graphics:Images/index_gr_227.gif]

[Graphics:Images/index_gr_228.gif]

[Graphics:Images/index_gr_229.gif]

[Graphics:Images/index_gr_230.gif]

[Graphics:Images/index_gr_231.gif]

[Graphics:Images/index_gr_232.gif]

[Graphics:Images/index_gr_233.gif]

[Graphics:Images/index_gr_234.gif]

[Graphics:Images/index_gr_235.gif]

[Graphics:Images/index_gr_236.gif]

[Graphics:Images/index_gr_237.gif]

[Graphics:Images/index_gr_238.gif]

[Graphics:Images/index_gr_239.gif]

[Graphics:Images/index_gr_240.gif]

[Graphics:Images/index_gr_241.gif]

[Graphics:Images/index_gr_242.gif]

[Graphics:Images/index_gr_243.gif]

[Graphics:Images/index_gr_244.gif]

How does the accuracy of the first half of the trials compare to the second half of the trials?:

[Graphics:Images/index_gr_247.gif]

[Graphics:Images/index_gr_249.gif]

So averaging only the last 10 trials of the set of 20 gives a slighter improvement in the average correction distance to the "true" beat compared to the modern sample.

The score both halves are on the opposite ends of the "B" range:

On the average, beats in the second half of the trials are 3.7 milliseconds closer to the "true" beat locations:

Also, averaging the performance of the second half of the trials does not improve the error displacement:

Plotting the average displacement error for each trial

This section plots the average displacement error for each trial which shows the gradual improvement in the reverse conducting over time.

Note that the average correction of the averaged data was 46 milliseconds, which is better than all of the individual trials except for #19.

[Graphics:Images/index_gr_267.gif]

Now fit an exponentially decreasing line though the trial errors to model what the "learning curve" is for reverse conducting this performance.

[Graphics:Images/index_gr_271.gif]

[Graphics:Images/index_gr_273.gif]

Predictions of accuracy can be derived from the learning curve over time, doubling the number of trials should result in an average error of 43 ms.

Doing 40 trials would improve the displacement error by about 2.6 milliseconds:

After 100 trials, the accuracy would be about twice as good as after 20 trials:

Doing 100 trials would improve the average displacement error by about 17.5 milliseconds (assuming an exponentially decaying learning curve):

Effect of removing the "bad" trial

What happens if the "bad" trial (number 15 in this case) is removed from the average tapping time for each beat?

[Graphics:Images/index_gr_286.gif]

[Graphics:Images/index_gr_290.gif]

[Graphics:Images/index_gr_292.gif]

Removing the "bad" trial decreases the overall average displacement error in this case by about 0.2 milliseconds which is very small.

Trying to identifying "bad" performances from the average

Instead of comparing the individual performances to the corrected data, compare them to the average of all trials.

[Graphics:Images/index_gr_301.gif]

Trial 13 is still identified as the "worst" performance when compared to the average of all trials rather than to the corrected times. So if it is necessary to throw out "bad" performances, then this method would work in identifying the bad performances.

Dropping any single trial

Dropping the worst trial does not improve the accuracy of the average. Does dropping any other trial improve the average of all trials?

[Graphics:Images/index_gr_309.gif]

If any one trials is excluded, then the average displacement error is increased from the average of the rest of the trials. The horizontal line at the bottom of the plot represents the displacement error when averaging all trials.

Dropping a range of trials

Now examine how the displacement error changes as more and more of the earlier trials are removed from the average.

[Graphics:Images/index_gr_316.gif]

Dropping the first 15 trials gives the best error of under 46 ms, which is 0.5 ms better than averaging all trials together. Droping the first 10 trials yields the same average error as including all 20 trials.

Just for fun, here is a plot created by dropping the later trials more and more:

[Graphics:Images/index_gr_323.gif]

[Graphics:Images/index_gr_325.gif]

You definitely do not want to drop later trials from the average, since dropping any later trials and keeping the earlier trials will always significantly increase the average displacement error.

Also note that all individual trials are worse than trial average (except trial #19), as well as averages involving only the later trials:

[Graphics:Images/index_gr_331.gif]

[Graphics:Images/index_gr_333.gif]

The red curve in the plot above shows the average displacement errors from the corrected times for each of the 20 individual trials. Only trials 14 and 20 are close to the error rate of the average of all trials.