Popular Posts

Wednesday, December 5, 2012

An OSC / ChucK Windows8 instrument

OSC


Up until now, I had avoided OSC because getting synths that understand it setup correctly was very inconsistent, if not difficult.  At the time, I was wringing all I could out of MIDI, or rather unhappily building internal audio engines - knowing that the result would not be as good as a battle hardened synth.  I have tinkered with Pd (hard to do non-trivial programming in brute-force visual spaghetti code), SuperCollider (a rather horrible language, but more complete programming capability), and ChucK (a little weird at first, but a great language - but performance is not necessarily great).  The other main issue was that before I found myself on an x86 tablet, the GPL license for SuperCollider and ChucK were problematic.  On iOS, you end up having to bake everything into a monolithic app.




But I really wanted to offload all signal processing into one of these environments somehow, and found out that ChucK does OSC really nicely.  It's pointless (and ridiculous) for me to spend my time implementing an entire windowing system because the native (or just common) toolkits have too much latency, and it's just stupid for me to try to compete with the greatest synthesizers out there.  So, I offloaded absolutely everything that's not in my area of expertise or interest.  The synthesizer behind it?

Here is a ChucK server that implements a 10 voice OSC synth with a timbre on the y axis (implemented a few minutes after the video above).  It's just sending tuples of (voice, amplitude, frequency, timbre):

//chucksrv.ck
//run like: chuck chucksrv.ck
"/rjf,ifff" => string oscAddress;
1973 => int oscPort;
10 => int vMax;
JCRev reverb;
SawOsc voices[vMax];
SawOsc voicesHi[vMax];
for( 0 => int vIdx; vIdx < vMax; vIdx++) {
  voices[vIdx] => reverb;
  voicesHi[vIdx] => reverb;
  0 => voices[vIdx].gain;
  440 => voices[vIdx].freq;
  0 => voicesHi[vIdx].gain;
  880 => voicesHi[vIdx].freq;
}
0.6 => reverb.mix;
reverb => dac;
OscRecv recv;
oscPort => recv.port;
recv.listen();
recv.event( oscAddress ) @=> OscEvent oe;
while( oe => now ) {
  if( oe.nextMsg() ) {
    oe.getInt() => int voice;
    oe.getFloat() => float gain;   0.5 * oe.getFloat() => float freq => voices[voice].freq;
    2 * freq => voicesHi[voice].freq;
    oe.getFloat() => float timbre;
    timbre * 0.025125 * gain => voices[voice].gain;
    (1 - timbre) * 0.0525 * gain => voicesHi[voice].gain;
   
    //<<< "voice=", voice, ",vol=", gain, ",freq=", freq, ",timbre", timbre >>>;
  }
  //me.yield();
}
while( true ) {
  1::second => now;
  me.yield();
}

The two main things about ChucK you need to decipher it are that assignment is syntactically backwards, "rhs => type lhs" rather than the traditional C "type lhs = rhs"; where the "@=>" operator is just assignment by reference.  The other main thing is the special variable "now".  Usually "now" is a read-only value.  But in ChucK, you setup a graph of oscillators and advance time forward explicitly (ie: move forward by 130ms, or move time forward until an event comes in).  So, in this server, I just made an array of oscillators such that incoming messages will use one per finger.  When the messages come in, they select the voice and set volume, frequency, and timbre.  It really is that simple.  Here is a client that generates a random orchestra that sounds like microtonal violinists going kind of crazy (almost all of the code is orthogonal to the task of simply understanding what it does; as the checks against random variables just create reasonable distributions for jumping around by fifths, octaves, and along scales):

//chuckcli.ck
//run like:  chuck chuckcli.ck
"/rjf,ifff" => string oscAddress;
1973 => int oscPort;
10 => int vMax;
"127.0.0.1" => string oscHost;
OscSend xmit;
float freq[vMax];
float vol[vMax];
for( 0 => int vIdx; vIdx < vMax; vIdx++ ) {
  220 => freq[vIdx];
  0.0 => vol[vIdx];
}

[1.0, 9.0/8, 6.0/5, 4.0/3, 3.0/2] @=> float baseNotes[];
float baseShift[vMax];
int noteIndex[vMax];
for( 0 => int vIdx; vIdx < vMax; vIdx++ ) {
  1.0 => baseShift[vIdx];
  0 => noteIndex[vIdx];
}
xmit.setHost(oscHost,oscPort);
while( true )
{
  Std.rand2(0,vMax-1) => int voice;
  //(((Std.rand2(0,255) / 256.0)*1.0-0.5)*0.1*freq[voice] + freq[voice]) => freq[voice];
  ((1.0+((Std.rand2(0,255) / 256.0)*1.0-0.5)*0.0025)*baseShift[voice]) => baseShift[voice];
  //Maybe follow leader
  if(Std.rand2(0,256) < 1) {
    0 => noteIndex[1];
    noteIndex[1] => noteIndex[voice];
    baseShift[1] => baseShift[voice];
  }
  if(Std.rand2(0,256) < 1) {
    0 => noteIndex[0];
    noteIndex[0] => noteIndex[voice];
    baseShift[0] => baseShift[voice];
  }
  //Stay in range
  if(vol[voice] < 0) {
    0 => vol[voice];
  }
  if(vol[voice] > 1) {
    1 => vol[voice];
  }
  if(baseShift[voice] < 1) {
    baseShift[voice] * 2.0 => baseShift[voice];
  }
  if(baseShift[voice] > 32) {
    baseShift[voice] * 0.5 => baseShift[voice];
  }
  //Maybe silent
  if(Std.rand2(0,64) < 1) {
    0 => vol[voice];
  }
  if(Std.rand2(0,3) < 2) {
    0.01 +=> vol[voice];
  }
  if(Std.rand2(0,1) < 1) {
    0.005 -=> vol[voice];
  }
  //Octave jumps
  if(Std.rand2(0,4) < 1) {
    baseShift[voice] * 2.0 => baseShift[voice];
  }
  if(Std.rand2(0,4) < 1) {
    baseShift[voice] * 0.5 => baseShift[voice];
  }
  //Fifth jumps
  if(Std.rand2(0,256) < 1) {
    baseShift[voice] * 3.0/2 => baseShift[voice];
  }
  if(Std.rand2(0,256) < 1) {
    baseShift[voice] * 2.0/3 => baseShift[voice];
  }
  //Walk scale
  if(Std.rand2(0,8) < 1) {
    0 => noteIndex[voice];
  }
  if(Std.rand2(0,16) < 1) {
    (noteIndex[voice] + 1) % 5 => noteIndex[voice];
  }
  if(Std.rand2(0,16) < 1) {
    (noteIndex[voice] - 1 + vMax) % 5 => noteIndex[voice];
  }
  //Make freq
  27.5 * baseShift[voice] * baseNotes[noteIndex[voice]] => freq[voice];
  xmit.startMsg(oscAddress);
  xmit.addInt(voice);
  xmit.addFloat(vol[voice]);
  xmit.addFloat(freq[voice]);
  xmit.addFloat(1.0);
  35::ms +=> now;
  <<< "voice=", voice, ",vol=", vol[voice], ",freq=", freq[voice] >>>;
}

What is important is the xmit code.  So, when I went to implement OSC manually in my windows instrument, I had to work out a few nits in the spec to get things to work.  The main thing is that OSC messages are just as simple as can be imagined (although a bit inefficient compared to MIDI).  The first thing to know is that all elements of OSC messages must be padded to be multiples of 4 bytes.  In combination with Writer stream APIs that don't null terminate for you, you just need to be careful to pad so that there is at least a null terminator with up to 3 extra useless null bytes to pad the data.  So, OSC is like a remote procedure call mechanism where the function name is an URL in ASCII, followed by a function signature in ASCII, followed by the binary data (big-endian 32bit ints and floats, etc).

"/foo"   //4 bytes
\0 \0 0 \0   //this means we need 4 null terminator bytes
",if"          //method signatures start with comma, followed by i for int, f for float (32bit bigendian)
\0              //there were 3 bytes in the string, so 1 null terminator makes it 4 byte boundary
[1234]    //literally, a 4 byte 32-bit big endian int, as the signature stated
[2.3434] //literally, a 4 byte 32-bit big endian float, as signature stated 

There is no other messaging ceremony required.  The set of methods defined is up to the client and server to agree on.

Note that the method signature and null terminators tell the parser exactly how many bytes to expect.  Note also, that the major synths generally use UDP(!!!).  So, you have to write out things as if messages are randomly dropped (they are.  they will be.).  For instance, you will get a stuck note if you only sent volume zero once to turn off voice, or would have leaks if you expected the other end to reliably follow all of the messages.  So, when you design your messages in OSC, you should make heartbeats double as mechanisms to explicitly zero out notes that are assumed to be dead (ie: infrequently send 'note off' to all voices to cover for packet losses).  If you think about it, this means that even though OSC is agnostic about the transport, in practice you will need to at least design the protocol as if UDP is the transport.

OSC Ambiguity

So, the protocol defines little more than a verbose RPC syntax, where the names look like little file paths (where parents are the scope and bottom most file in the directory is the method name to invoke).  You can make a dead simple instrument that only sends tuples to manipulate the basic voice spline changes (voiceid, volume, frequency, timbre).  It will work, and will be a million times easier than trying to do this in MIDI.  Everything, including units is up to you (literal hz frequencies? floating point 'midi numbers' which are log frequencies?, etc.).  That's where the madness comes in.

If you use OSC, you must be prepared to ship with a recommendation for a freeware synth (ie: ChucK or SuperCollider), instructions on exactly how to setup and run them, and an actual protocol specification for your synth (because some synth you don't have doesn't know your custom protocol).  This is the biggest stumbling block to shipping with OSC over MIDI.  But I have finally had enough of trying to make MIDI work for these kinds of instruments.  So, here is an OSC version.  It really is a custom protocol.  The OSC "specification" really just defines a syntax (like Json, etc).  Implementing it means nothing for compatibility, as it's just one layer of compatibility.  But if you plan on using SuperCollider or ChucK for the synth, it's a pretty good choice.  You can scale OSC implementations down to implement exactly and only what you need.

Monday, November 26, 2012

Thought Experiment: Forget MIDI and OSC

 

Thought Experiment: Forget MIDI and OSC

MIDI

Trying to wrangle MIDI into handling polyphonic bending is so difficult that almost no synths get it right.  If you disagree with this statement, then you are surely spending all of your time on discrete-key instruments; a small subset of the kinds of instruments you can make on a touch screen.  If you are using MIDI for something that is polyphonic and fretless, then you will notice that very few synths can do this correctly.  Most of the capability is there if the synthesizers were to simply behave like multi-timbral synths, even when there is one voice assigned to all channels; at that point, it's an issue that setting up bend width is problematic from the point of view of ubiquity.  But MIDI's biggest problem is that it thoroughly vexes end users when there is a requirement to span across all MIDI channels to make a single instrument.  I do nothing but deal with support email for the 90% of synths that don't handle much of anything outside of piano controllers correctly; even if I document in detail what is supposed to work and what you should not try to make work (ie: Arctic vs Animoog, etc.).  MIDI's biggest problem is that the first problem that any music protocol should solve is to simply express pitch and volume correctly.  Maddeningly, MIDI just can't do this because it's very much note and discrete-key oriented.  MIDI is strangling progress on touch screens just as much as it helps progress.  Music notes do not belong in protocols, as they are a premature rounding off of the data.  We must deal with frequencies directly, or everything turns into a mess when handling real-world scenarios from touchscreens.

http://rrr00bb.blogspot.com/2012/04/ideal-midi-compatible-protocol.html

Read that link if you want to know what an unholy mess MIDI can be when you try to do something as simple as get correct pitch handling; where the situation is untenable when going to microtonal and fretless MIDI.

OSC

OSC on the other hand could readily solve all of my technical issues because it can easily represent what I am trying to do.  The problem is that almost nothing implements it.  And where it is implemented, only the syntax is standardized.  It's very much like the situation where you open a *.xml file, and haven't got the faintest clue as to what its semantics are, or what program is supposed to consume and create this format.  Even worse, most "OSC" implementations transport over MIDI.  This is like tunneling floating point numbers over integers; doing things conceptually backwards.  It's a lot of useless indirection that simply guarantees that nobody implements anything correctly.

The Simplest Possible Protocol That Can Possibly Work


So, what happens if I just forget about all standard protocols, and design the simplest thing that could possibly work?  I am convinced that it would be easier to get that standardized than it would be to subset the complex protocols we already have.  Observe the following:
  • The music instrument industry currently has almost nothing to do with the new music instruments now.  The music instrument industry is mostly just chugging along in the same direction it has been going in, using tablet devices for little more than weak signal processing brains, or re-configurable knob/slider surfaces.  Everything they announce is just another piano-controller with a bunch of knobs/sliders, and a brain.  It isn't say... guitars that can do what all synths and all guitars can do (correctly!  MIDI can't do the basic pitch handling.)  It isn't say... violins... either.  It isn't microphones that can do the opposite of auto-tune and take a human voice and play instrument patches at the exact frequencies sung into the mic (even if none of the notes are close to the standard 12 notes).  MIDI is the root cause, because the protocol forces you into a discrete-note oriented mindset.  It's a mistake to think that the music instrument industry is really relevant here; we need to do what is best for tablet apps first.
  • Almost everybody using tablets is reluctant to deal with MIDI cables or Wireless connections anyhow.  The reasons vary from latency concerns, to setup complexity, to a kludgy CameraConnectionKit way of getting connected.  We are standardizing on MIDI only because that was an easily available low-latency pipe.  It's weird that you need to use MIDI protocol just to use the pipe.
  • Since the tablet is both the controller and the synthesizer, there is no reason to let obese hardware oriented specifications slow us down.  Presuming that you needed to fix something, you would get a result an order of magnitude faster if you simply get things working and publish the protocol and wait for the hardware vendors to follow the lead of a popular app that implements it, than to get the MIDI or OSC groups to make a necessary change for you.
So the main thing I need (what kills me about MIDI) is stupidly simple.  I just need to control independent voice splines, with continuous updates.  There is no need for a complex protocol to do this.  I write my own so that it's easy enough that I can describe it to any developer.  So 90% of it looks like this:

//When controller sends messages to the synth, the most basic method just
//sets voice properties.
//
//64 bits, all voice properties are set separately
struct SetProperty {
  int16 voice;  //there are no notes, just voices with multiple properties
  int16 property; //common stuff: phase,amplitude,pitch,timbre[i],etc.
  float32 val;
}

//Define standard mappings that everything is going to understand.
#define PHASE_property 0 //val==0 means sample begin, 0..1 is the period corresponding to freq
#define LOGFREQ_property 1 //val==33.5 would be pitch of 'midi' note's pitch 33.5
#define AMPLITUDE_property 3 //val==1.0 would be full amplitude for wave
#define TIMBRE_0_property 16 //first timbre control - assume it was mapped
#define TIMBRE_1_property 17 //next timbre control - assume it was mapped

This would handle everything that Geo Synthesizer and Cantor can do.  This is enough to handle polyphonic instruments that may be samplers or ADSR synths, because we explicitly set the phase.  This is because when a finger goes down, it maps to a voice (NOT a note!).  That voice will have its phase set to zero (SetProperty message 1), then its frequency set (next message), its timbres (next messages), then amplitudes (next message) set.  Then as the note is held, the pitch can change as the finger moves around or when amplitude must be changed.  Just send new SetProperty values to do aftertouch affects.  This is totally unlike MIDI, which treats aftertouch as special cases.

Note that timer stuff is not in the protocol.  That's because we presume to send the message at the time we want it interpreted.  Having timestamps in the protocol only adds latency or helps when things are in big batches (a different and more complex protocol that we should stay away from for now).

Negotiation

Building in a simple negotiation from the beginning helps to ensure that synth and controller are never sending unintelligible data to each other.  MIDI handles this situation very badly, where you end up having to enumerate synths in an ever growing list (assumption of central control).  As an example, presume that the controller and synth simply exchange lists of properties that they send and understand.  We re-use SetProperty, but use some well known values to note that we are simply exchanging agreement:

#define IUNDERSTAND_voice -1
#define ISEND_voice -2
//ex:  controller sends 64 bit SetProperty messages:
//  (-2,0,0.0),(-2,1,0.0),(-2,3,0.0),(-2,16,0.0),(-2,17,0.0)
//       which means "I send: phase,logfreq,amplitude,timbre0,timbre1"
//if we don't send any "I understand" messages, then the controller knows that this is
//a one-way conversation, and will not try to send messages to the controller.
//
//if we get "I understand" messages from the synth, then we must not send messages
//not in its list.
//

The whole point of this process is to simply recognize that rather than announcing a vendor or product id (which the other side may have never heard of), we announce what we expect to handle gracefully instead. 

Proxying Controls

The other thing besides controlling voices that we would need to do in a instrument controller is to have some mechanism to proxy knobs/sliders, etc of the synths in the background.  This is really important on iOS because we can't have the kind of setup mess that a typical hardware synthesizer user would deal with.  Because we have a negotiation mechanism, we can safely include any baroque sub-protocols that we need.  Presume that we have a message to say to start a blob of bytes.  We can use this to send a string.  The synth would like to name the controls (rather than moving them around...the controller uses fixed locations, and the synth may want to rename them):

#define BLOB_voice -3 // (-3,24,0.0) means to expect 24 bytes out of the stream (strings)
#define RENAME_voice -4 //(-4, 16, 0.0) means to expect a blob and use it to rename timbre_0

ex: Rename timbre_0 to 'Filter' (6 byte name):  (-4,16,0.0) (-3,6,0.0) 'Filter'

A synthesizer may want to expose the options to the controller.  So it would need to send a list of mappings to properties to strings, and a mechanism to remap standard properties to synth-specific ones.   Say that the synth set messages like:

#define IHAVE_voice -5 //(-5, 100,0.0) (-3,24,0.0) 'Filter'
#define REMAPFROM_voice -6 //(-6,16,0.0)(-7,100,0.0)
#define REMAPTO_voice -7

ex Remap standard timbre_0 to synth's 'Filter' at 100:
  (-6,16,0.0)(-7,100,0.0)

Where we knew to use 100, because earlier we got a IHAVE that gave us the name of the control that we showed to the user.  Once remapped, this;

  (3,16,0.5)  //voice three's timbre_0 set to 0.5

Has the same effect on the synth as:

  (3,100,0.5) //which is the same as 'Filter' set to 0.5





Monday, October 8, 2012

Carbculus

Carbculus

I am a newbie at doing any kind of detailed carb and insulin counting.  Previously I just tried to count from labels and make my meal hit specific targets like 60g and keep the insulin constant.  I have no idea how much any of this lines up with reality, so I am experimenting with it to see if I can find out what these actual values are (in as much as the equations line up with reality).  Obviously, there are always confounding factors like exercise.  But the point of this is near the bottom where I try to figure out how many carbs I ate by looking at the difference between what I tried to do, and what happened.

U is the units for insulin
CIR = Carb To Insulin Ratio (carbgrams / U)
SIR = Sugar To Insulin Ratio ((mg/dL) / U)
T = Target sugar reading (mg/dL)

R = Sugar reading just before eating (mg/dL)
F = Food (carbgrams)
I = Insulin (U)

Example values that mostly stay the same:

CIR=4, SIR=15, T=120

Example values that you plug in every time you eat:

F=45, R=93

Basic Insulin Calculation, of how much you need to stay on target while eating:

InsulinTotal = InsulinForFood + InsulinForCorrection
I = F / CIR + (R-T) / SIR

So I always divide food carbs by 4 (CIR) and how far off I am by 15 (SIR) and try to reach 120 (T).  If I eat what I thought was 45g of food and I am at 93.  This is simple enough that you can routinely do the calculation in your head:

I = 45/4 + (93-120)/15
  = 45/4 - 27/15

Roughly 10 insulin required.  But one thing I never hear anybody talking about is trying to actually measure how many carbs you actually ate; because all of the carb amounts are guesses unless you revise the estimates with actual measurements.  The problem is that all the variables that we plugged in were exact except for F which is a guess; one that could be very wrong.  If you actually did eat 45g of carbs, then you should be at 120 later; which is the whole point of taking the insulin.

So, I take another measurement when I expect it to be 120.  I got a new variable

L = Landing sugar reading (mg/dL)
G = Actual food (carbgrams)

So, you inject the insulin with a guess of F carbs at reading R, then measure your sugar later and call it L, and find the actual carbs G.  And I rewrite the basic insulin calculation to plug in landing sugar instead of target sugar and solve for food.  In essence, I write it as if my landing blood sugar was intentional and figure out how many carbs I ate:

I = G / CIR + (R-L)/SIR

Which you manipulate to get:

G = CarbsWeInjectedFor + CarbsPerSugar * HowFarOffWeWere
G = CIR * I + CIR/SIR * (L - R)

So say I landed at 130 rather than 120.  You do it in your head like this:

G = 4*13 + 4/15 * (130 - 120)
    = 4*13 + 4/15 * 10        #off by 10 sugar, injected 13
    = 4*13 + 8/30 * 10
    = 2*26 + 8/3
    = 52 + 8/3

Roughly 55 carbs.  You can memorize CIR/SIR (ie: 8/30), so that this isn't any harder than the basic equation.  There doesn't seem to be a lot of point in getting highly precise because carb counting involves guesswork, as does the way that you are going to consume it in your body.  But this tells me that I can go over all the numbers in my pump and look for systematic carb counting errors (ie: I assume that I am always trying to reach 120 and I have before/after meal readings).  So, you get the number I from the pump, which will take in F and R as parameters each time you eat(with all of the others set to constants), and plug in L 2 hours later as a check.  A little python program that I haven't used yet (I just came up with the carb regression a little while ago... So I will see how it goes):


CIR = 4.0
SIR = 15.0
T = 120.0

def EatAt(records,n, F, R):
  records[n] = {}
  records[n]["F"] = F
  records[n]["R"] = R
  records[n]["I"] = F/CIR + (R-T)/SIR

def Regress(records,n,L):
  R = records[n]["R"]
  F = records[n]["F"]
  I = records[n]["I"]
  records[n]["L"] = L
  records[n]["G"] = CIR * I + CIR/SIR * (L-R)

records = {}
EatAt(records, 0.0, 60.0, 80.0)
Regress(records, 0.0, 130.0)

print records[0]

Also, when my sugar is low, I would like to know how many carbs to reach my target without injecting (ignoring Basal for the moment):


I = F / CIR + (R-T) / SIR
0 = F / CIR + (R-T) / SIR
(T-R)/SIR = F/CIR
CIR/SIR * (T-R) = F

4/15 * (T-R) = F

So again, memorize CIR/SIR, which is 4/15 for me.  If I am 50 points under (ie: 120 is Target, and 70 is Reading), then I want 4/15 * 50 carbs.  4/15 is roughly one fourth, so I need 13 grams of carb.

4/15 * 50 is about 13

http://www.github.com/rfielding/BloodSugar

Tuesday, August 7, 2012

Co-Routine Refactoring

Co-Routine Refactoring

Say that you are given an arbitrary function in C code that typically runs in a blink of an eye and returns a value, and you need to be able to run it incrementally or partially for some reason.  And you want to do this without making large structural changes to the original function.  You could be implementing concurrency in a single thread, stepping through the function, implementing a generator, or making a generalization of a generator such as a symmetric co-routine.

int fun(int a, int b, int c)
{
  int d=0;
  int e=42;
  int f=2;
  int i=0;

A:

  foo(&d, &e, &f, &i);
B:
  for(i=0;i<10 font="font" i="i">
  {
    int g;
C:
    bar(&d, &e, &f, &i);

    baz(&g);
D:
  }
E:
  buxx(&d, &e);
F:
}

So this is our original function.  We want to be able to stop and inspect it at the points that are labeled.  If we want to be able to exit at D and resume at D by just calling fun, then we would have to live with the restriction that foo can't depend on anything that gets modified in the loop.  Eventually, this restriction is likely to get violated if this is a complex function and the transformation will have to be done anyhow.

The first part of this transformation is to pack all local variables and parameters into a structure to completely save the state of where we are:

struct fun_task {
  int a,b,c,d,e,f,g,i;
};

The next step is to figure out a state machine that gives us the stopping points that we need.  We can inspect fun_task at this point, which happens to be every single piece of state that is going on in this function:

//A steppable version of our previous function, 
// where you can use the source of this function
// as documentation on how to drive the 
// lower level set of functions:
int fun(a,b,c)
{
  int ret;
  int done=0;
  fun_task t;
  fun_init(a,b,c,&t);  //Up to A:
  do {
    //fun_loopBody decides what to 
    //do next based on its internal state.
    //it could be an FSM or more typical code, etc.
    done = fun_loopBody(&t); //C to D
  }while (!done);
  ret = fun_end(&t); //E to F
  return ret;
}

So, the single function fun is replaced with a bunch of functions that trace the life of the fun_task.  Where this gets interesting is when you add more parameters to the task functions.  If you add a parameter to loopBody called is_printable, and add debugging statements conditional on this parameter, then is_printable can vary per iteration rather than having to be the same after each iteration.  loopBody can *set* variables on each iteration as well.  So, this really is a co-routine with the ability to yield with parameters, resume with parameters, and to even let the caller inspect and manipulate state.  One tedious task that has to be dealt with because there is no language support for restoring context (ie: autos and parameters), is that these little functions reference the context like this:

  bool fun_loopBody(fun_task* t)
  {
    bar(&(t->d), &(t->e), &(t->f), &(t->i));
    baz(&(t->g));    
    int done = !((t->i) < 10);
    i++;
    return done;
  }

There is no construct in C to pass in a parameter to bind a struct to the autos like this fictitious item:

bool fun_loopBody[fun_task]()
{
  bar(&d,&e,...);
  ...
}

So we end up doing it manually as above.  Also note that yet another approach to this would be to mark the function somehow to automatically generate the function prototypes for stepping through the task, so that the original code can continue to look like a simple function.  (That would mean that you would also have to mark it up with parameters for yielding and resuming as well at these points)

OOP Digression and State Machines

Notice that this is kind of similar to OOP, in the sense that all state for this task is created inside of a struct and used for the duration.  The fun_* methods would just be methods in that case.  But there is a huge difference in that the methods have a very definite legal grammar on their calling order, and this aspect of these methods is far more important than any of the other aspects of OOP.  In fact, the type system can be used to document this order by taking a different perspective on OOP.  The first thing would be to separate the idea of the type of the state object from the set of methods that are allowed.  Conceptually, the id of the object (its pointer) would stay the same, though the interface would change based on what state it's in.  Ie:

//FlushedFile can only read/write/close
FlushedFile f = File("/etc/passwd").open();

//We can't close until it's flushed 
//- notice that the type *changed*
UnflushedFile f = f.read(&buffer);
//f.close() isn't available until it becomes a flushed file

FlushedFile f = f.flush();
f.close();

So, this concept is called "Protocols" in languages that implement it, like Sing#.  In Erlang, message passing order is probably the most important thing to document, which is why they don't think too much about type safety beyond preserving the integrity of the run-time system.  Anyways, it either passes around state from object to object when implemented as OO, or changes available methods based on known state (or possible states known at compile-time) if that is supported.

A major problem with the OO way of thinking is that the type system generally only documents what the function signatures are.  It's even more messed up when the concept of 'messages' gets warped in a concurrent context (where the messages should behave like an incoming mailbox like Erlang where work happens inside the object in a way that looks single-threaded to the object, rather than executing in the caller's thread like C/Java do).  But in any case, the concept of messages is the most important thing to get right, and it's basically neglected.  Beyond constructors and destructors, it's quite a rare thing for the type system to provide good hints as to the intended calling (messaging) order.  But a well defined specification should at least do these things:
  • Document exactly what the object requires, generally through dependency injection where real dependencies are just on interfaces.
  • Document legal calling orders (messages) that imposes enough constraints on the state machine that the code so that you don't end up having to support wrong usage that worked in earlier versions of the calling code, or write too much documentation on how to do it right.
  • Try to have functions/methods per state so that the state machine is not so damn implicit.  When you look at a stack trace, the names of functions on the stack should be obvious when naming states.  The (nested) FSM that you would draw on the markerboard should bear some resemblance to the actual code.  This is a little more difficult to do in the absence of tail calls (where switches often emerge in their place), but it's important to do this where possible.
Back to C Functions and Refactoring In General

So, the concept of coroutines/tasks in C is basically just one of taking a function and breaking it into pieces by pulling all of its state (autos and parameters) into a struct.  It is more general than an object, since it is up to you to decide if you want to segregate the state into multiple pieces.  This is a more useful pattern than it may seem to be at first because of what happens when you try to refactor C code.

I judge the goodness of a language (and development environment) largely on how easy it is to turn crappy code into good code over a long period.  One major weakness of many languages when it comes to refactoring is that a lot of languages don't support multiple return values.  If most languages did support multiple return values, then function extraction would be trivial to do automatically in an IDE.  For example:

int foo(int x)
{
  int a=0;
  int b=0;
A:
  a = 1 + b;
  b = 2 * c;
B:
}

Extract A to B as a function if you have multiple returns:

int foo(int x)
{
  int a;
  int b=0;
  int c=0;
  a, b = doAtoB(b,c);
}

With multiple return values, it is very clear that b,c are in values, and a, b are out values.  We didn't have to pass in a pointer to a and wonder if doAtoB uses the original value of a, which is uninitialized.  With only single values returns, you would need to return structures (which are allocated in their own chunk of memory somewhere), or use the function pointers where it's not clear if the value is read and written, or just written.  Google's language Go actually uses this feature to lay down the law that the last parameter returned is an error code, and you are never to try to mix actual return values with error codes (ie: a comparator that returns a possibly negative integer, yet returns -1 when something went wrong, which can cause infinite loops in sorting algorithms by creating inconsistent sorting orders where a < b && b < a, etc.)

But the pattern of pulling all autos and parameters into a struct makes it very easy to break a giant function into a bunch of pieces without having to worry about changing how it behaves.  The alternative to this is to break into a bunch of smaller functions with a *lot* of parameters.  But the big downside to this is that these parameters can expose private types in the method signatures, or change the overall semantics of the function when it becomes unclear how to plug parameters back in when going from step to step.  In a big struct, some state may exist for a little longer than is strictly required, but pulling loop bodies into a single line function is now a very mechanical process.  Then perhaps with a little re-ordering of the code in the function, the parts can be given sensible names and put into functions that require no further commentary.


Sunday, July 1, 2012

Rates and Latency

This Google talk is about latency, and it is relevant to making music apps.


And here is something being done by John Carmack, where the issue is ensuring that a head-mounted display tracks the image closely.  The immersive feel is totally due to getting low latency so that when you move your head around, you don't detect lag in movement.



Carmack recently lamented in a tweet that typically, you can send a transatlantic ping faster than you can get a pixel to the screen.


Rate Hierarchies

So, there are some analogies here that are relevant.
  •  Render an Image Frame (width*height pixels) at 60fps.  60fps is the Screen Refresh Rate.  Frames are rendered on a VSync, which is a period at which you can draw to the screen without image tearing artifacts.  If you miss this deadline then you will get choppy video.  Color depth is generally going to be 8 bits per color, or 32-bits total for red,green,blue,and alpha.
  • Touches come in from the OS at some unknown rate that seems to be at around 200fps.  I will call this the Control Rate.  It is the rate at which we can detect that some control has changed.  It is important because people are very sensitive to audio latency when turning touch changes into audible sound changes.
  • Like the Image Frame rendered in a Screen Refresh, an Audio Frame is made of samples (an analog to pixels).  At 44.1khz, about 5ms requires 256 samples.  5ms is a rate of about 200fps when you think of sending a finished frame off to the audio API like an audio VSync.  If you miss this deadline, you will get audible popping noises; so you really cannot ever miss this deadline.  The bit depth is going to be either 16 or 32 bits, and the internal processing before final rendition is generally going to be in 32 bit floating point arithmetic.  And analogous to color channels, there will generally be a stereo rendition which doubles the data output.
 With video, the framerate is about "smoothness" of the graphics.  Taking the analogy to audio, you can think of it in terms of frequency.  At 44.1khz, most audible frequencies can be represented.  Even though the Image Frame rate of 60fps is sufficient for "smoothness", it may not be sufficient to give a notion of instantaneous visual response.  It is unclear if there would be any benefit to ensuring that graphics run at the control rate.  Anyways, given the number of pixels rendered per-frame, a limit of 60fps does not sound unreasonable.

In the same way, audio frames could run at 60fps (1024 samples), and the "smoothness" of the audio is not an issue.  But taking 23ms to respond to control changes feels like an eternity for a music instrument.   The control rate really needs to be somewhere around 5ms.  Note that the amount of data that would be emitted per Audio Frame is roughly the same amount of data as 1 line of the display frame in most cases.  But we need to render them at a rate of 4 to 8x as often.

CSound is a very old and well-known computer synthesizer package, and one of the first things you put into the file when you are working on it is the control rate versus the sample rate.  The sample rate (ie: 44100 hz) is generally about 10x to 20x the control rate (220 hz) and the control rate is 4 to 8x the screen refresh rate (60hz).

GPUs and Rates

So, with this heirarchy of rates in use, it is fortunate that the higher rates have less data that is output.  A GPU is designed to do massive amounts of work during the period between VSync events.  Video textures are generally loaded into the card and used across many Image Frames, with the main thing being that at 60fps geometry must be fed in, and in response (width*height) pixels need to be extracted.

To use the GPU for audio, it's only 4 to 8x the rate (to match the control rate).  And because the sample rate is going to be on the order of 200x the control rate, then we expect to have to return merely hundreds of samples at the end of each audio frame.  This way of working makes sense if there is a lot of per-sample parallelism in the synthesis (as there is with Wavetable synthesis).

http://www.music.mcgill.ca/~gary/307/week4/wavetables.html

Similar to video textures, the wavetables could be loaded into the GPU early on during initialization.  At that point, making control changes and extracting new audio buffers is going to be generating all the traffic from the GPU to the rest of the system.  If there were an audio equivalent to a GPU, then the idea would be to set the sample rate for the Audio GPU, and simply let the card emit the samples to audio without a memory copy at the end of its operation.  Currently, I use vDSP, which provides some SIMD parallelism, but it certainly doesn't guarantee that it will run its vector operations across such a large sample buffer as 256 samples in one operation.  I believe that it will generally do 128 bits at a time and pipeline the instructions.

In fact, the CPU is optimized for single thread performance.  Latency is minimized for a single thread.  For GPUs, latency is kept low enough to guarantee that the frame will reliably be ready at 60fps, with as much processing as possible allowed at that time (ie: throughput on large batches).  An Audio optimized "GPU" may have a slightly different set of requirements that is somewhere in the middle.  It doesn't emit millions of pixels at the end of its completion.  But it could have large internal buffers for FFTs and echoes, and emit smaller buffers.  In fact, the faster it runs, the smaller the buffers it can afford to output (to increase the control rate).  In the ideal case, the buffer is size 1, and it does 44100 of these per second.  But realistically, the buffers will be from 64 to 1024 samples, depending on the application.  Music instruments really can't go above 256 samples without suffering.  At a small number of samples per buffer, the magic of this GPU would be in doing vector arithmetic on the buffers that it keeps for internal state so that it can quickly do a lot of arithmetic on the small amount of data coming in from control changes.  This would be for averaging together elements from wavetables.  The FFTs are not completely parallel, but they do benefit from vector arithmetic.  It's also the case that for wavetable synthesis, there is a lot of parallelism in the beginning of the audio processing chain, and non-parallel things come last in the chain; at the point at which the number of voices is irrelevant, and it's generally running effects on the final raw signal.  



Friday, June 22, 2012

Wavetable Synthesis and Parallelism

Cantor's sound engine is the first of my three apps where I am experimenting with parallelism and the complexities of handling audio aliasing (rather than taking a "don't do that" approach).  I never got into instrument making thinking I would get sucked into actual audio synthesis; but it's unavoidable because MIDI has far too many problems with respect to setup and consistency.  The fact that I insist upon fully working microtonality almost makes MIDI untenable.  I don't have the electrical engineering background to compete with the great sound engines in the store; I got into this because the playability of iOS instruments is totally laughable at the moment.    I think anybody that can write a simple iOS app can absolutely run circles around the current crop of piano skeuomorphisms that rule the tablet synthesizer world today; if you judge by the playability of the resulting instrument.  So on this basis, I'm not giving up because of the Moogs and Korgs of the world putting out apps of their own.  :-)  But for now... because MIDI still sucks for this use case, I still need an internal engine; most users will never get past the setup barrier that MIDI imposes.

So, one of the more surprising revelations to me when I started into this was that you can't avoid multi-sampling, even if you aren't making a sample-based instrument, and even if you only have one sound.  The idea that you can just speed up or slow down a digital sample to change its pitch is only half-true.  The same sample played at higher than original speeds is actually not the same wave shape as the original sample.  You not only need to drop half the samples to go up an octave, but you need to cut out frequency content that it can no longer accurately represent.  Because this instrument doesn't have a discrete set of notes, and is inherently fretless, it also needs perfect continuity when bending from the lowest to the highest note.  So the function that defines a wave cycle not only takes phase into account, but the maximum frequency at which you might play it back determines its shape.

And also, because I have a life outside of iOS instrument making, I am always looking to tie-ins to things that I need to know more generally to survive in the job market.  I am currently very interested in the various forms of parallel computing; specifically vector computing.  Ultimately, I am interested in OpenCL, SIMD, and GPU computing.  This matches up well, because with wavetable synthesis, it is possible in theory to render every sample in the sample buffer in parallel for every voice (multiple voices per finger when chorusing is taken into account).

So with Cantor, I used Apple's Accelerate.Framework to make this part of the code parallel.  The results here have been great so far.  I have been watching to see if OpenCL will become public on iOS (no luck yet), but am preparing for that day.  The main question will be whether I will be able to use the GPU to render sound buffers at a reasonable rate (ie: 10ms?).  That would be like generating audio buffers at 100 fps.  The main thing to keep in mind is the control rate.  The control rate is the rate at which we see a change in an audio parameter like a voice turning on, and how long until that change is audible.  If a GPU were locked to the screen refresh rate and that rate was 60fps, then audio latency will be too high.  200fps would be good, but we are only going to output 256 samples at that rate.  The main thing is that we target a control rate, and ensure that the audio is fully rendered to cover the time in between control changes.  It might even be advantageous to render ahead (ie: 1024 samples) and invalidate samples in the case that a parameter changed, which would allow audio to be processed in larger batches most of the time without suffering latency.
Todo:

One thing I am not doing explicitly so far is handling interpolation between samples in these samplebuffers.  I compensate by simply having large buffers for the single-cycle wave (1024 samples). The single cycle wave in the buffer is circular, so linear interpolation or cubic spline between them are plausible.  There is also the issue of the fact that as you go up an octave, the wave is oversampled by 2x because you will skip half the samples.  Because of this, a single sample represented in all the octaves could fit in a space that's about 2x the original sample size (n + n/2 + n/4 + n/8 +...).

And finally, there is the issue of control splines themselves.  It is currently an ad-hoc ramp being applied to change the volume, pitch, timbre for a voice.  It's probably better to think of it as a spline per voice.  The spline could be simple linear or cubic (possibly), or the spline could limit changes to prevent impulsing (ie: more explicit volume ramping).

Post Processing:

The part that generates the voices is easily done in parallel, which is why I am thinking about GPU computation.  In an ideal world, I load up all of the samples into the card, and when fingers move (at roughly the control rate), the kernel is re-invoked with new parameters, and these kernels always complete by returning enough audio data to cover the time until the next possible control change (ie: 256 samples); or possibly returning more data than that (ie: to cover 2 or 3 chunks into the future that might get invalidated by future control changes)  if it doesn't increase the latency of running the kernel.

I am not sure if final steps like distortion and convolution reverb can be worked into this scheme, but it seems plausible.  If that can be done, then the entire audio engine can essentially be run in the GPU, and one of the good side effects of this would be that patches could be written as a bundle of shader code and audio samples, which would allow for third-party contributions or an app to generate the patches - because these won't need to be compiled into the app when it ships.  This is very much how Geo Synthesizer and SampleWiz work together right now, except all we can do is replace samples.  But the main question is one of whether I can get a satisfactory framerate when using the GPU.  I have been told that I won't beat what I am doing any time soon, but I will investigate it just because I have an interest in what is going on with CUDA, OpenCL anyhow.

OSC, MIDI, Control:

And of course, it would be a waste to make such an engine and only embed it into the synth.  It might be best to put an OSC interface on it to allow controller, synth, patch editor to evolve separately.  But if OSC was used to eliminate the complexity of using MIDI (the protocol is the problem ...the pipe is nice), then the question of whether to use a mach port or a TCP port becomes a question.  And also an issue of what kind of setup hassles and latency issues have now become unavoidable.

Thursday, May 17, 2012

"Cantor" for iOS

The app formerly known as "AlephOne" was submitted to Apple last night.  It may or may not get approved, as it is using the finger-area sensing that a few other apps are using; though it has been very difficult to get absolute clarity on whether finger-area sensing gotten through dynamic, but public, APIs is admissable.  It was once a very important part of the feel of the app Geo Synthesizer's playability, which suffered mightily when we pulled that feature out to be on the safe side (it wasn't my account that was at stake).  I now believe that it totally is admissable (as I have been informally told), and that most developers are just not trying it out of little more than fear.  I decided to just jump into the app mosh pit.  It is generally good advice to never release "Beta" software into the app store, but that assumes that you have unbounded time to work on apps and that you won't go work on something else in any case.  I never uttered the name "Cantor" and used AlephOne because when I had used the name "Pythagoras" for my last app, that name was taken by the time I tried an allocate a spot for it in the store.

The test builds are here:



It is being shipped somewhat early (If it was "done" then there would never be any updates!).  So, where is Cantor?
  • Wavetable synthesis engine - but it demands that you span MIDI channels to use it internally.  Using Waveable synthesis allows me to address per-finger filtering and anti-aliasing issues.  There is not a lot of sound variety right now.  But that's why it's a simple $2 app for now.  I have some testers that think that the MIDI support alone is very valuable and will never use the internal engine, but there will always be people that will never try the MIDI.
  • Like Geo Synthesizer, it has octave auto, for very fast soloing.  This instrument is just as playable on the phone as it is on the iPad if you play it that way.  
  • MIDI with polyphonic bending (note ties, channel cycling, etc).  A lot of synths won't be able to deal with this.  But Kontakt4, Korg Karma, Nord (N2?), ThumbJam, SampleWiz, Arctic, etc all work with it to some degree because they at least do real multi-timbral pitch handling.  ThumbJam, SampleWiz, Arctic specifically recognize the note tie that I added custom to MIDI to work around legato and bend width issues in MIDI.  Generally, you create many adjacent channels (more than 4) with identical patches, and set their bend widths very high, to either 12 or 24 semitones.  It will channel hop around to get all the pitch bending right.  MIDI does not "just work".  It's horribly broken for almost every non-piano instrument scenario when it comes to pitch handling.  This works around that, but it's not pretty.
  • Looper - Audio copy doesn't work yet, but I have it half implemented.
  • Moveable frets to let you define your own scales, which includes exact pitch locations and the number of frets per octave.
  • Common microtonal scale shapes already setup:  Diatonic, Pentatonic, 12ET chromatic, Pseudo Maqam Bayati (ie: 12ET with a quartertone and the quartertone above and below it by a fourth - not quite the unwieldly full 24ET), Pythagorean, Just (Pythagorean with perfect maj/min thirds as a subset of 53ET), 19ET, 31ET, 53ET, or just fretless.  None of these are named, as you simply pick a shape and use the circle of fifths to center it on one of the 12ET base notes.
As Geo Synthesizer was a rewrite of Mugician (that was Pythagoras for a while), Cantor is a rewrite of Geo Synthesizer (and was called AlephOne for a while).  The videos of it in use are here:



This app has nothing at all to do with beat making, sequencing, DAWs, or iPad-only workflows.  I'm a microtonal metalhead.  It's a pocketable instrument that's as playable as a guitar or a piano for a lot of uses.  It should be reliable and simple, for you to plug into an effects processor or a computer *like* a guitar or hardware synth.  I tried initially to ship this without any audio engine at all, as a pure MIDI controller.  But the current state of MIDI with respect to continuous pitch instruments is still bad.  I only implemented an internal engine to demonstrate a full exercise of the playability parameters.  For example: this is ThumbJam in Omni mode on the JR Zendrix patch (it's different from most Omni-modes on most synths... it's done correctly with respect to pitch handling):


If you write a MIDI synth and have issues getting Cantor to control it with full expression, then send me an email.

Please keep your money if you will only give it a low rating based on a missing feature.   See the videos and see what it does; and pass on buying it if you think it needs to do a lot more.  Generally, I cannot respond to feature requests via low ratings like "2 stars when X is added".  I wrote this for my own needs, in my spare time, to solve my own problem of needing a microtonal instrument, and don't sell enough to service any kind of hassles beyond that.  But please let me know if you use it for real and something is actually broken, or about things that are actually impossible to do correctly instead.  Specifically: I won't add in uneven tunings like guitar because they don't solve a problem that I have, and I cannot add in a mini multi-track on account of reliability and memory consumption (though I might get export of current loop to clipboard working in an update), and I won't spend a lot of time on the sound engine when MIDI is available.  If you need these things, then what you are saying is that MIDI itself is broken (ie: in all apps), which I can't spend more time on, (ie: other apps can already record MIDI output, and MIDI is recognized by background synths - but interpreting MIDI gives sonically unreliable results, etc.). 

Why "Cantor"?

Geog Cantor was a mathematician who essentially went insane contemplating infinity, specifically, contemplating the continuum.  Though history regards him well, he was regarded as a crank by a lot of his peers:


Coincidentally, a "Cantor" is kind of like the lead musician in a church's music setup; an interesting meaning when thinking of an instrument designed to deal with intonation issues that weighed heavily on the church organ builders in the time before they all gave up on correct sound (with respect to numerology and aligning music with physical laws) and went to tempered instruments.

Monday, April 23, 2012

An Ideal MIDI-Compatible Protocol

A Torture Test For Correct Pitch Handling

The MIDI protocol is designed as a keyboard-extender. The actual pitches emitted aren't strictly specified, as it is supposed to faithfully capture the gestures coming out of a hardware keyboard. Under this model, instruments are a set of discrete keys with an initial velocity, and one pitch wheel per instrument. This means that any instrument that wants to send MIDI should fundamentally *be* a set of discrete keys and 1 pitch wheel per instrument.

But any instrument that fits that description can't do correct pitch handling, meaning that it is able to correctly track simultaneous timelines of (pitch,volume,phase). In other words, you can't model a fretless instrument with it. We can put together multiple channels to come up with the best approximation that is backwards compatible with MIDI though. But fair warning, the client generating fretless MIDI is VERY complicated at the level of MIDI bytes. It should be done through a common API that is used by many people. Fortunately, it doesn't make the synth (the server) any more complicated. In fact, it simplifies a lot of things in the server by allowing things like explicitly enumerated modes and tuning tables to just be gotten rid of.

If you think of the microtonal cases as something bizarre that won't be used in practice, then ignore these extreme cases. But realize that you can't properly model any string instrument unless you can model fretlessness correctly; if just for the reason that guitars can and do emit simultaneous pitches that are not exactly some number of semitones apart when bending notes and chording at the same time.

Touch Screens And Fretlessness

Background (pure MIDI engine vs ThumbJam with MIDI chorusing):



When using touch screens, fretlessness and per-finger bending is readily available to us. Everything is continuous in reality, and the notion of discrete keys being turned on and off is a major annoyance. In this space, we think more of a finger that goes down and can move to any volume or pitch, expression value, over its lifetime. Especially when trying to model any kind of string instrument, this sort of polyphonic fretlessness is a basic requirement to be able to faithfully model realistic behavior. In any case, touchscreen instruments can trivially create this situation and MIDI is quite terrible at modelling it.

  1. Every finger down can have its own completely arbitrary pitch
  2. Any pitch can bend to any other pitch over its lifetime, including from MIDI 0 to MIDI 127
  3. All pitch bending, no matter how wide happens at the fullest pitch resolution
  4. A note can attack at any pitch, and therefore needs its own pitch wheel
  5. A note can release at any pitch, and therefore needs its pitch wheel to not be disturbed until release completes
  6. Due to 5, a note turned off is still reserved for some time after the note has been turned off.
  7. Due to 1, every note is in its own channel, of which there are only 16.
  8. Due to 7, we are limited to 16 polyphony, including releasing notes that are actually turned off already.
  9. Because of 4, 5 and 8, there is a speed limit that we can play without pitch problems. We will have to simply steal the channel of the note that has been dead the longest, and hope that the releasing tail of the note is inaudible as we alter its pitch wheel for the new note.
  10. Because of 2, we can bend beyond the maximum pitch wheel depth. If we do so, we have no choice but to stop the note (at its current arbitrary pitch - and therefore release at this pitch) and restart it (on a new channel), where we can place a note-tie for synths to understand that these are not two separate notes; that the note is being *migrated* to a new name and a new channel.
  11. The effect of a note tie is to move the state of the note to a new channel and note, skipping the attack phase of the note. This means that the note tie can be used to legato notes together. A straight up note-tie in support of a wide bend moves a note from one note/channel to another with 0 cents difference in pitch, where legatoing two notes does the same thing with some large number of cents different in the transition.
  12. Because of 3, we can't simply support absurd pitch bend sizes like +/- 48 semitones. We need bending to be at full resolution.
* My Item 12 is debateable.  Conversations with the maintainer of the MIDI spec suggest that very large bends are not a problem in themselves, though many existing synths have a maxiumum of 12 or 24 semitones if they allow bend width to change at all.  As an example, use MIDI note 64 as a starting point with +/- 60 semitones gives a span of 120 of the MIDI notes, and the 1 cent resolution minimum resolution that microtonalists talk about (8192 steps up and down).  You can set to a non-octave number of semitones to do +/-64 to ensure that you can hit midi notes 0 and 127 as well.  So the idea of simply ensuring that very large bends are supported is a good one.  There is still the issue of sending 1 MIDI stream to multiple devices and having to live with the minimum supported bend, or possibly setting to 1 semitone bends (half the default) so that pianos simply render chromatic passages.  Note ties do more than handle an exception where bend width is exceeded.  You may want to use them for other reasons; but the complexity may not be worth it if you have very wide bends.


Pitches To MIDI And Backwards Compatibility
Warning: Take the source code here as what I really mean, as it has been pretty well vetted to work in practice, and should have fewer mistakes than this document, which is being written off the top of my head:

https://github.com/rfielding/DSPCompiler/blob/master/Fretless.c



As long as there is no definite pitch assigned to MIDI notes, the setup will be too involved for this kind of instrument. Ultimately, the notion of setting bend width and how many channels to span over has to go away. Under iOS, every instrument can supply its own virtual MIDI device anyway (with 16 channels to itself).
  • c0 is the lowest note in MIDI, with note value zero.
  • MIDI Note n is defined as : mNote(n:float):float = c0 * 2^(n/12)
  • If midi note 33 is 440hz, then: 440hz = c0 * 2^(33/12)
  • Integer values of n cause mNote(n) to return the pitches of chromatic notes
  • bend 8192 is the center value representing zero cents adjustment
  • semis is the number of semitones up or down. it is 2 by default.
static void Fretless_fnoteToNoteBendPair( 
    struct Fretless_context* ctxp, 
    float fnote,
    int* notep,
    int* bendp)
{
    //Find the closest 12ET note
    *notep = (int)(fnote+0.5);
    //Compute the bend in terms of -1.0 to 1.0 range
    float floatBend = (fnote - *notep);
    *bendp = (BENDCENTER + floatBend*BENDCENTER/ctxp->channelBendSemis);
}

This example code converts a floating point "MIDI" note into a integer note value and integer bend value.

MultiTimbral synths that can set bend width

We include channelBendSemis as an input to gracefully handle synths that don't understand note ties. If note ties are not understood, then at least we can set the bend width high to minimize breaks in the note bending (because it's harder to exceed the bend width). To get this level of backwards compatibility, it is generally sufficient to set a number of channels to the exact same patch that matches maximum required polyphony (ie: channels 1,2,3,4) and set bend width to +/-12 semitones.

Mixing fretless synths with pianos

The other issue is that of playing a fretless voice with piano in the background. Either two completely different MIDI streams would need to be sent to each device (ie: fretless MIDI for the violin, and a chromatic rendition that inserts every chromatic note as notes bend for the piano), or the bend width should be set to +/-1 semitone and have only the violin respect bends and note ties (and channeling for that matter).

The Biggest Weakness Of MIDI

Converting actual pitches to MIDI

When dealing with a device that takes a polyphonic signal (or set of mono signals) and converts it to MIDI, there is the distinct possibility that the original audio is to be mixed in with the MIDI. In my opinion, this is the biggest weakness of not taking on a frequency orientation for MIDI. The output of the speakers will resonate with a guitar-body that's generating MIDI and feed back in...think about the consequences of that! If a guitarist tunes a quartertone flat to match some un-tuneable instrument in the ensemble and runs through a device to generate MIDI that matches the guitar pitches, then there is no reason for any of the MIDI pitches to come out out of tune with the real audio under any circumstances. There is no good reason to make him do some kind of setup to make that happen either.

In the same way, an acapella singer that is some number of cents off with standard tuning (perhaps due to a resonance in the environment or singing in a Just Intonation), then the MIDI should LEAVE IT ALONE! and render bytes that will create those exact pitches on any standard MIDI device - because the original pitch isn't wrong. Auto-tuning isn't up to the protocol, and it isn't up to the synth. You can write a controller that rounds off pitches or sends a re-tuned variant of the original audio if that's what you want.

And it can still be backwards compatible with existing MIDI if it can cycle over channels and tie notes together.

How Note Ties Work

Note ties are something that I added to Geo Synth and AlephOne in an attempt to fix this problem. Because we need to be backwards compatible without creating wrong pitches or stuck notes, we have to be valid MIDI for everything which synths already know, and allow synths that don't understand note ties to ignore them without major problems in the sound.

#Turn on a note with a 0% bend and turn it off
0xE1 0x40 0x00 #bend to center position
0x91 0x21 0x7F #turn note on
0xE1 0x7F 0x7F #bend +100%
0x91 0x21 0x00 #turn note off

If we want to continue the bend, we need a new note. Because any note needs to be able to attack and release at any pitch, then it must be on its own channel when it does so, so there is a channel transition as would be in any other case:

#Turn on a note with a 0% bend and turn it off
0xE1 0x40 0x00 #bend to center position
0x91 0x21 0x7F #turn note on
0xE1 0x7F 0x7F #bend +100%
0x91 0x21 0x00 #turn note off
0xE2 0x40 0x00 #same pitch as note releasing on channel 1 (notice its different bend)
0x92 0x23 0x7F #continue note at new channel

So, any multi-timbral synth will recognize this bend in some way. If there is no audible attack-phase for the patch, it seems that we are done; it already sounds correct. But if there is an audibly different attack phase, then we can hear the note break before it transitions to the new value (what usually happens). So we need to put an NRPN into the stream to warn the MIDI device to expect a note off that is actually a legato to the next note; on a different channel even!. It basically just sends the number "1223" as an NRPN, to warn the MIDI engine about what is coming. It is this code here:

void Fretless_noteTie( 
    struct Fretless_context* ctxp,
    struct Fretless_fingerState* fsPtr)
{
    int lsb;
    int msb;
    Fretless_numTo7BitNums(1223,&lsb,&msb);
    int channel = fsPtr->channel;
    int note = fsPtr->note;
    //Coarse parm
    ctxp->midiPutch(0xB0 + channel);
    ctxp->midiPutch(0x63);
    ctxp->midiPutch(msb);
    //Fine parm
    ctxp->midiPutch(0xB0 + channel);
    ctxp->midiPutch(0x62);
    ctxp->midiPutch(lsb);
    //Val parm
    ctxp->midiPutch(0xB0 + channel);
    ctxp->midiPutch(0x06);
    ctxp->midiPutch(note);
    ///* I am told that the reset is bad for some synths
    /*
    ctxp->midiPutch(0xB0 + channel);
    ctxp->midiPutch(0x63);
    ctxp->midiPutch(0x7f);
    ctxp->midiPutch(0xB0 + channel);
    ctxp->midiPutch(0x62);
    ctxp->midiPutch(0x7f);
    */
    //*/
}




When this sequence is seen, then the sound engine will simply *remember* what note is being turned off without actually doing it. Then when the next note-on is given, it transfer the current phase, pitch, and volume to the new note; and the pitch will have to legato over to its new pitch and volume values as fast as it can.

#Turn on a note with a 0% bend and turn it off
# we have no idea what the future is once we turn note on...
0xE1 0x40 0x00 #bend to center position
0x91 0x21 0x7F #turn note on
0xE1 0x7F 0x7F #bend +100% #a surprise event when the finger bent up really high
oxB2 0x62 msb(1223) #Note tie warning
0xB2 0x63 lsb(1223) #Note tie warning
oxB2 0x06 0x21
0x91 0x21 0x00 #turn note off
0xE2 0x40 0x00 #same pitch as note releasing on channel 1 (notice its different bend)
0x92 0x23 0x7F #continue note at new channel

The Lifecycle Of A Finger

I hide the actual channels and notes of MIDI behind an API for many reasons. The most important of which is that a pair is not a good 'primary key' for a finger. If you bend a note past its maximum bend width, then you have to rewrite note for certain, and also channel because of my choice to hop channels even on note-off because of release time. So, an abstraction over MIDI is more like this:

beginOn finger0
express finger0 11 127
express finger0 42 127
...
endOn finger0 pitch vol polyGroup0
...
move finger0 pitch vol
...
express finger0 43 127
...
move finger0 pitch vol
...
off finger0

On a touchscreen, the only stable identifiers over the course of a gesture are finger numbers. The actual note and channel are a MIDI concept that we hide.

When we intend to turn a note on, we just do enough work to allocate a MIDI channel for it. Once we have a MIDI channel, and know the starting values for the various per-channel CC expression parameters we can send them - BEFORE the note is turned on. Then when we have the pitch and are about to turn the note on we send the bend value that will be required to make the right pitch - again, BEFORE the note is turned on. The channel pressure's initial value is also sent before the note turns on (in my case it's the same as volume). Then the note is finally turned on. As the note moves around in pitch and volume, it can easily hop around channels many times because of exceeding max bend width. Then finally the note is turned off by finger. The finger knows which channel and note was used. This design makes it very hard to have mistakes that create stuck notes.

And of course, part of that lifecycle not shown is that a previous finger could be down that it could legato to. In which case, the note off part of the lifecycle simply transfers note state onto the next finger.

Synth Engine's State Machine

The synth engine is actually really simple under this scenario. The synth doesn't have to know anything about the various polyphony modes or whether to legato a note, or to track much. There is very little state to track.

  • Since we don't have the value for the finger (it's lost as a (note,channel) combination), we keep track of each (note,channel) that is somewhere between turning on and still releasing. If you want to simplify the engine further, you can simply keep track of channel and insist that there is *never* more than one note per channel. It isn't realistic with other controllers feeding input, but will work for your own controller; and it will dramatically simplify the engine.
  • For each finger (tracked as (note,channel)), keep track of the note phase, pitch, and volume. Note that these are *exactly* the main things that we want to have arbitrary control of over the life of a voice.
  • There is one pitch wheel per channel, and CC values are remembered per channel.
  • If we get a warning that a note tie is about to happen, then the next note off determines the source and next note on determines the destination. Once we have source and destination, we transfer all note state from source to destination. This includes aftertouch, CC values, current phase, current volume, and current pitch. There should be no audible change in voice characteristics, as we have basically only renamed the note. (Another approach could be that a voice independent of channel could have been used and the channel is simply re-assigned to that voice.) The new pitch and volume implied in the note on is something that we begin to ramp towards. What is most important of all is that the phase of the note matches as it is moved to the new channel. Other characteristics are kept the same to prevent impulsing.
  • We interpret pitch bending according to our current bend setting.
  • Note on and off all behave as normal.
  • Because polyphony was done in the controller, we don't need any rules for solo versus poly mode. There is a notion of polyphony groups in the C API that generates the MIDI messages, but it shows up as nothing more than notes turning on and off (possibly tied together) at the synthesizer end. We don't try to take advantage of note overlaps; the controller already figured this part out for us.
  • Because legato was done in the controller, we simply play notes with the attack unless it is a note on that's part of a note-tie. Similarly, we don't try to use note overlaps; the controller already told us exactly what to do. We only legato if the note has been tied to another.
  • Using CC per channel works fine, because if each finger is in its own channel then it's also per-voice by coincidence.
The internal engine is very simple. In my implementation it's just one function with a few parameters, and very little state being maintained in the engine. What state there is exists entirely in a per-voice way. (I simplified my own engine by only handling one note per channel, and require polyphony to span channels.). So, the complexity created by fixing these pitch problems is only created in the controller. It actually simplifies the synth.



This is my internal engine, which is driven by the MIDI messaging.  I would have greatly preferred to ship a controller with no internal engine at all though.  The whole point of MIDI is co-existence.  I could free up all of the memory, get rid of all of the extra code, not get rated on the quality of the sound engine (which should be the job of a dedicated synth), and never get requests for sound tweaks.  The engine can literally add months of development to a controller that was long finished from the standpoint of MIDI.  The fact that every iOS app has its own internal engine in spite of MIDI support suggests that MIDI is too hard to setup and the results of connecting a controller to a synth are too unpredictable to leave it up to whatever combination the user picked.

* Note to self * - Maybe we can allow 'legato' without any tied note to let us start off in sustain phase of a note ramping up from zero volume without attack. This could happen to a string that starts to resonate without being plucked as happens with sympathetic strings. Perhaps a note could be started with the lowest possible volume (unfortunately, it can't be exactly zero due to the protocol treating vol 0 as note off!) or started off with a later than normal phase. In the same way that I am using MIDI to do manual chorusing in places, it would be useful to do sympathetic strings at the controller as well, rather than some sitar post-processing effect. But note also how easily we could easily get far beyond 16 voice polyphony like that.

A Tetrachord Example

A backwards compatible re-interpretation of existing MIDI messages can give us what we need. I will use the basic tetrachord (Ajam Bayati) as an example of how these problems arise, and how to model them as MIDI messages. This tetrachord is a note of a fundamental and fourth, minor third, and a note that falls in the middle between the root and the minor third. In this notation, we have notes D, E-quarterflat,F,G. But this notation is just an approximation to the real intonation. It is likely that this is the real intonation that will be played:

  • If D is taken to be the fundamental, then its pitch ratio is 1/1
  • G is a pitch ratio of 4/3 with respect to D
  • F may be the perfect minor third 6/5 with respect to D
  • E-quarterflat has a few plausible choices, 13/12 is a plausible one, with respect to D
When changing modes (ex: phrase moves up by 4/3 or 3/2), the whole tetrachord is played in these ratios relative to the root. So, the exact pitches move around based on root notes. Scales are a fiction that just don't exist in this system. Any attempt to remap the 12 notes to keys that turn on and off will result in some kind of failure to do what is required because notes continually move around to fit the context.

If we take the simplest approximation that D is midi note 38.0, E-quarterflat is 39.5, F is 41, and G is 43, then the note 39.5 must be on its own channel to work at all. But what's even worse is that it doesn't really matter what these notes start as. Their future lifetimes are unknown, and they can all bend off in independent directions. Every note must be on its own channel always. Thus, we have 16 polyphony in MIDI if we want fretlessness. This is a reasonable limitation for 1 instrument. We can have multiple MIDI devices for multiple instruments on iOS.

And, NO, tuning tables don't help here. Only fretlessness really does the job. It is easy to intermix straight 12ET playing with quartertones, and at some points to adjust pitches to Just Intonation (so that the quartertones don't cause horrible clashing, etc.). This always happens when there is an ensemble of fretless instruments in which the singer is doing Just Intonation but fretted instruments like pianos and bass guitars have been mixed in. The band often copes by relegating the 12ET guys to pentatonic lines, or adjusting their pitches to hit Just Intervals versus the 12ET notes that are technically out of tune. It is actually better to think of a series of approximations, where it takes about 53 notes per octave to accurately locate the various third and fifth related intervals, of which the 12 tone pitches are just approximations, etc. In the fretless world, people that know what they are doing will move these pitches around at will. You can't just remap keys because the number of possible keys will change. If you insist on frets, you still need a moveable fret system where the number of frets can be changed, and the whole fretboard can be re-fretted at run-time. As a result, MIDI should just stay out of the music theory mess and just play the required pitches. Scales and pitch snapping are the controller's job. MIDI interfering here is no different than trying to impose diatonic scales onto unknown musical input.

The pitch wheel positions

The image to the right is a graphical representation of the pitch wheels. Straight up is midi channel 1, marked by the gold triangle on the top. Going around the circle there is a gold triangle tick on channel 16 to show where the channel span ends. The blue triangles going in from the middle radius represent how far down the bend for that channel's pitch wheel is. The red triangles going out show a sharp pitch bend. The green rays sticking out show the channels that still have a note on in them. So, in this picture, we see six notes down total. The pitch wheels are all over the place at slightly different values. This is because in this picture, we are creating a chorusing effect by doubling each note as two MIDI notes slightly bent out of tune from each other. When we run our fingers up and down the screen, exceeding the whole tone limit (what the bend happens to be set to), a single note will hop around from channel to channel, leaving the pitch wheel in the position it was in when the note was turned off. We cycle channels clockwise looking for the least used channels (hopefully one with zero notes active) and pick a channel for our new note that way. As we play fretlessly, all of the channels get left in unusual positions as play goes on. If I intentionally played everything a about a quartertone high, most of these pitch wheels would be sticking out with a red triangle of 25% out towards the larger radius to make all notes a quartertone sharp. It would match exactly what my internal audio engine is producing.

The idea that a note starts with bend zero and is only bent by the user later is a piano-ism that injects an unwarranted music theory constraint into a protocol that's supposed to be setting the pitches that I asked for.

Miscellaneous

Because the MIDI protocol at this level is considered as a fretless protocol, any fretting rules are outside the scope of this layer of the API. However, because it's fretless you can simply round off pitches in the controller to get back to strict 12ET pitches, or more subtly handle rules for:

Portamento

Because we are not constrained to moving from discrete pitches, and have complete pitch freedom, we want to dispense with the concept of portamento. Portamento is a discrete key concept, and produces wrong pitches. If you start from midi note 0 and increase towards midi note 127 at a rate of 1 midi note per second, you will fall on an integer pitch value once every second (on pitch and on time) and the transition between notes will be smooth. More importantly, the pitch will represent where your finger actually *is*. It won't lag behind as you move up 1 chromatic note per second while providing an arbitrary ramp that is not a perfectly smooth bend overall. This is really only doable in practice on a continuous surface instrument.

This is what I mean when I say that "Portamento is wrong". It's a discrete key concept that is not useful on string instruments. String instruments track where your finger actually *is*. In fact, at the controller there are three things to track with respect to pitch:

  • Where your finger actually is located (ie: pitch 33.001)
  • Where your pitch is tuning to (ie: pitch 33.0)
  • Where your pitch actually is, usually somewhere in between these two (ie: pitch 33.0005)
You need all of these pieces of information in the controller, because you send the last item, but use the first two items to render the user interface.

Legato

The concepts of when to play the attack part of a note is separate from other rules having to do with whether notes are turned on and off for mono/poly play. So legato only refers to when the attack portion of a note is played. In our API, we have a notion of polyphony groups , which are a sort of pseudo channel. The first note down in a polyphony group will definitely play the attack phase, and further notes down in this same polyphony group will do a legato continuation of the note. All notes in this poly group behave as a solo mode. Note that every note goes into its own MIDI channel, and poly groups specify how these notes get grouped if the note-tie is a recognized message.

Polyphony

Polyphony rules include the standard solo mode (mono) that would be expected, and the full polyphony mode (poly) that is also expected. But in the spirit of doing away with enumerating special cases (that MIDI does a lot of), every note can be put into one of 16 polyphony modes. If every note is put into a different polyphony group then we get "full polyphony". If they are all placed into the same group (ie: group 0), then it becomes solo mode.

If notes are grouped according to guitar strings then we have something in between. When multiple notes are chorded within the same poly group, they act like isolated solo modes. But since each string is its own group, then chording and hammer on/hammer off effects just happen as a side effect. By default, the legato rule is to have attack on first note and legato on others. This rule can be toggled off to attack on every note (thus legato and poly are not identical concepts), or a controller can have it set for every note (ie: via velocity, finger area, or some gesture to signify that the note should have an explicit "pick attack".) There is NO WAY to make an even moderately realistic string instrument rendition of MIDI notes without these poly and legato concepts. They are fundamental to functioning correctly.



For example:


//Put down one finger into poly group 0

beginOn finger0
express finger0 11 127
express finger0 42 127
...
endOn finger0 33 vol polyGroup0


//Put down another finger into poly group 0

beginOn finger1
express finger1 11 127
express finger1 22 127
...
endOn finger1 35 vol polyGroup0


//Put down another finger into a different poly group 1

beginOn finger2
express finger2 11 127
express finger2 33 127
endOn finger2 35 vol polyGroup1
...
move finger0 pitch vol
...
express finger0 43 127
...
move finger0 pitch vol
...

off finger1
off finger0
off finger2

We end up with chording and solo-mode-like tricks happening simultaneously.  finger0 and finger1 are in the same poly group.  So the original pitch 33 has a finger with pitch 35 buried over it.  So when finger 1 goes down, finger 0 is silent.  When finger 1 comes up, finger 0 turns back on again - without a pick attack(!).  This is like solo mode.  But after finger 1 went down, finger 2 went down.  Finger 2 stays on until the end.  So we have soloing and chording at the same time.  This always happens on guitar, where one string will be hammered on while other strings ring through.  Since Geo Synthesizer and AlephOne are two-handed instruments, it is a very common thing to do.  One hand will play riffs with trills and hammer-ons with one hand, while chords are held down or created on other strings.

The beauty of doing it this way is that the synth is completely unaware of modes of any kind.  They simply don't exist.  The synth is polyphonic and stays in that mode.  But we explicitly turn notes on and off and explicitly tie notes together to play a note without a pick attack, from the controller.  This is a crucial element of string instrument gesturing.

Note that a slightly different way to render this whole sequence would be to assume that we have very large pitch bend that can cover any possible range, as in the example of every channel simply playing midi note 64 with a pitch bend of +/- 60 or perhaps +/-64.  Instead of playing new notes, we can simply bend 33 to 35 because they are in the same poly group.  This is a suggestion made by the MIDI Manufacturers Association.  It is simpler to implement if you don't want to add in note-ties, and my objections about the lower pitch resolution required to do it that way might not be as big of a problem as have always assumed.  The question arises as to what the coarsest acceptable pitch resolution is, given that we need to be able to represent exact pitch intervals without beating, and be able to double voices to get chorusing effects.  (Is it within 1 cent for any given interval made by two notes?).  For note 64 as center with +/- 64 semitones, that system, the entire set of available pitches, which I think is (haven't checked it):    


c0 * 2^( (128*n/16384)/12)


For pitch bend n from 0 to 16383, where c0 is the pitch of MIDI note 0, with 16 independent voices per instrument that can do this.

Unison

This should not even be an issue, except when starting with a piano keyboard as a model for what an instrument is, there is a mistake being made that notes of the same name (or even just pitch) are (or can be) unique. It is completely expected to have multiple versions of the same exact note at multiple locations. So this means that there should not be situations where we remember a note by its note number and look it up to do something with it later. We use (note,channel) pair and can treat *that* as unique only if every note can be placed into its own channel. But when mashing down this fretless protocol into 1 MIDI channel, we run into this problem again. We deal with it by sending a note off for a MIDI note before retriggering it a second time. Ie:

#what we want to send
ch1 on 35 127
ch2 on 35 127
ch3 on 35 127

ch1 on 35 0
ch2 on 35 0
ch3 on 35 0

#what we have to send because it's all mashed down to 1 channel
ch1 on 35 127
ch1 on 35 0
ch1 on 35 127
ch1 on 35 0
ch1 on 35 127
ch1 on 35 0

Note that when we overfill MIDI channels with more than 1 note, that we have multiple problems to deal with. The first problem is the obvious problem of having 1 pitch wheel per note so that notes are not independent. The second problem is less obvious; and it's the problem that we have to render messages in a different ORDER to get correct behavior. Supporting fewer channels than polyphony creates a lot of complications like this in the DSPCompiler code.

CC Transfer

One of the problems with abusing channels to get pitch bend independence, is that if you move a note from one channel to another that you need to also move all the per-channel values for other expression parameters. So, just as the pitch bend is remembered and altered if it must be, the same must be done for channel pressure, for per-channel expression values (of which there may be dozens! unfortunately!). So if between channel 5 and channel 6, a dozen CC values are known to differ, when there is a note-tie from channel 5 to channel 6, all dozen CC values that are different need to be sent. The Fretless protocol is *not* doing this. This is something that should be added, though in practice there is one fixed CC being used in practice right now (CC 11).

iOS Specifics

One of the things that is specific to iOS that must be fixed about MIDI is that setup is a horrible experience, just like on hardware devices. The kinds of terrible VCR-programming-like experiences offered by hardware devices are not tolerated by iOS users at all. Specifically:

  • The user has to know far too much about how MIDI works to get it setup
  • Correct pitch handling per finger is not an out of the box experience (what almost all of this document is about), which means at least: a bend width, channel span, channel number setting, and something to disable pitch bending for instruments that just can't handle it right. Most users don't have the faintest clue as to why they must span channels to get correct pitch handling, because it's a workaround for a deeply non-intuitive limitation. (4 unnecessary controls at least)
  • There is no real reason to stick a bunch of MIDI devices on a small number of channels, as VirtualMIDI can add multiple devices as easily as adding multiple channels. When talking to external hardware, it makes sense to span a small number of channels to make room for more instruments. But what really matters on iOS (the 90% case) is the issues that pertain to what can happen all inside of one iOS device.
  • The inconsistency of things like channel pressure handling (ie: is it a continuation of velocity change?) means that you can't ship an instrument one way. You need a switch to turn it off if it misbehaves versus some synth. (at least 1 more unnecessary control).
  • Modes are set in the synth for things that should be controller-driven. My example here is the legato and polyphony issues. The controller has plenty of CPU power to handle these things, and it should be defined there rather than in the synth. The use of modes prevents per-note expression types that arise on expressive controllers.
  • Capability Negotiation should be one of the very first things to happen in an MIDI session. We should never be sending device and vendor specific identifiers unless the idea is that everything is just so buggy that we must code to specific devices. Otherwise, a protocol for negotiating what the sides expect each other should have been one of the first things to go into the standard. If you can negotiate a small set of required things among controller and synth, then you know that things should work between them and they can keep things simple internally. You need to know if something you send or expect to receive is not going to work.
  • Once there is capability negotiation, there should be a way to know that you can query for knobs and sliders for the current patch. On iOS, you are not dealing with a hardware box plugged into a controller that you can physically touch *both* of them at the same time. You end up with a controller in the foreground, and that foreground controller needs to present controllers to the synth in the background. The background synth is unreachable in real-time. The names of knobs and sliders are going to be patch specific, and the knobs themselves will come and go depending on patch and synth chosen. So the common MIDI wisdom of picking names from pre-defined lists won't work. In current MIDI, there is a chronic pattern of enumerating many known cases rather than coming up with a few general mechanisms that combine well. They made sense at the time, but the controllers are not dumb controllers. They are fully functional computers that should be taking on most of the complexity tasks, while synthesizers are more specialized signal processors that don't really have their own interfaces now.
  • These patches might be loaded from the controller (ie: pick a file from foreground app) and pushed into the synthesizer. They will be defined versus some standard, possibly something based on Pure Data (Pd, libpd, Max/MSP) or CSound, or something iOS specific (related to AudioBus?). In any case, a capability to get and push patches into the synth should be an obvious and consistent part of the standard. Currently, developers end up trying to create and hardcode all possible patches for users, rather than letting the user community build most of that.
  • On iOS, it's clear that the synthesizers will start to get "cored out" so that we don't have the present situation continue as it now is. Ironically, every app defines its internal synth (usually not the greatest synth engine) and a controller (usually a keyboard) in spite of having a MIDI implementation, in addition to having an in-app-record facility (a DAW too!?). This represents enormous waste of effort. It creates too much user interface code to be written, requires too much synthesis knowledge, and causes controllers to limit themselves to the most common forms (ie: keyboards). When MIDI is actually doing its job correctly, controllers will stick to controlling stuff and synths will stick to synthesizing stuff, and DAWs will stick to recording stuff (via AudioBus?). All of these things will evolve independently, and much faster as a result.
  • This is very similar to how OpenGLES2.0 "cored out" the graphics APIs to stop enumerating all possibilities and introduced lower level primitives and a shading language. The MIDI primitives need to be much more strictly defined and allowances made to use these primitives to build unseen possibilities. This is the opposite of the current situation, which leaves many things up to the implementer; which causes setup headaches due to extremely basic things not being consistent among devices.
A Capabilities "Shell"

This is a hypothetical notion that I don't have code for at the moment.

A major part of the problem with MIDI is its combination of complexity, fragility, and ambiguity.  The ambiguity of it is sometimes touted as a way to get flexibility, where there is only the guarantee of reproducing gestures coming out of the hardware, but no guarantee that the same signals will produce a reasonable result against different synthesizers.  Because there isn't a common mechanism to negotiate (or simply dictate to the synth) a "language" during a MIDI session, there is no guarantee that the bytes being sent from controller to synth are interpreted as intended.  The current way around this is to put in device-specific hacks, involving setup by the user, or proprietary messages defined between specific controller/synth pairs.  So, now I am going to define a hypothetical mechanism by which we can deduce the minimum language that will be understood by the other side, without user setup in most cases.  The other issue is to be able to factor out common sub-languages into standard extensions to make it difficult to need actual proprietary messaging.

So suppose that all new MIDI devices can listen for just one type SysEx message used for this purpose.  Devices that do not respond to these SysEx messages are taken to be old MIDI devices.  Take this example:

c -> s: i can read [negotiation]
c -> s: i can send [negotiation,noteOnOff,bend[2..64],noteTie]
s -> c: i can read [negotiation,noteOnOff,bend[2..60]]
s -> c: i can send [negotiation]

In this case, the controller should note that there is no point in sending note ties, and to try to use a bend width of 60 to create legato phrasing.  We have no idea what vendor synth or hardware, etc.  That's irrelevant.  What we need to know is what we can *send* and expect to be understood, and what we can allow to be *sent* to us.  It amounts to specifying a legal state machine for the traffic between the two devices.  How about this:

c -> s: i can send [negotiation, OSC-pitchcontrol,...]
..
s -> c: i can read [negotiation, OSC-pitchcontrol, ...]
..
c -> s: expect OSC

If we have a negotiation protocol, then we can use it to abandon MIDI completely after the negotiation phase.  This is a way out of the backwards compatibility mess.  It isn't a problem that we switched protocols like this because we *negotiated* this so that it is guaranteed that this only happens if both sides agree.  The command to expect OSC is "proprietary", but it is known to be handled because OSC-pitchcontrol says that it will be understood.  What is most important is that each side can implement the minimum protocol required, with the overhead of a negotiation shell, rather than having to implement the whole MIDI spec.

Some standard capability name spaces that we could expect would be ones that expose the knobs and sliders of the synthesizer back to the controller; giving them names, default values, and keeping them in sync. Under iOS, it is very important to be able to proxy a synth parameter in the controller because the synth is sitting inaccessible in the background.

OSC

What I describe sounds a lot like OSC in a lot of ways. The main issue is how to at least be pseudo-compatible with MIDI, and how to take advantage of the fact that a background MIDI pipe exists on iOS but not a background OSC pipe (nor many synths that are even pseudo-compatible with OSC). If trying to use MIDI this way becomes overly complicated, or attempts to fix it are too incompatible to matter, then something will emerge from the iOS world. If it isn't OSC, then it will probably be something similar being smuggled over AudioBus. iOS spawning a pseudo-compatible-with-MIDI standard sounds like a bad thing; but realistically, there is no point in being compatible when the required scenarios can't be made to work. The only options would be to completely forget about any kind of MIDI compatibility, to make OSC finally take hold, or to make something entirely new happen.