python bridge round-trip variable latency

mhisted's Avatar

mhisted

16 Dec, 2013 03:10 PM

Hi Chris,

In the course of using the python bridge for state-system processing, I've needed to know how long it took for a roundtrip. I set a variable in the XML, respond to it via callback in the python bridge by setting another variable which the XML acts upon.

I constructed a self-contained test for this, which I'm attaching. You should be able to run it easily (it's based on your bridge example).
I find a roundtrip takes ~300ms for the client bridge, and either 100 or 200 ms for the server bridge, so worst-cases are 300ms and 200ms.
That includes about 75 ms for "conduit.send_data()" for both the client and server bridge.

Is that what you expect?

The example illustrates one more small nit I found - in string variable assignments quotes are treated as part of the string and everywhere else they are just syntax. For example:
<variable tag="messageVar" scope="global" logging="when_changed" default_value='"a"' type="string"/>
if read in the python bridge, the value has three elements with the first and last quote characters.

Thanks,
Mark

  1. 1 Posted by mhisted on 16 Dec, 2013 03:19 PM

    mhisted's Avatar

    Whoops, here is the code.

  2. Support Staff 2 Posted by Christopher Sta... on 18 Dec, 2013 06:42 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    I find a roundtrip takes ~300ms for the client bridge, and either 100 or 200 ms for the server bridge, so worst-cases are 300ms and 200ms.

    This took a little while to figure out. These delays reflect the "work cycle" of several MWorks threads that handle the sending, receiving, and processing of events. I'll spare you the hairy details, but the upshot is that by changing two scheduling intervals in the MWorks source code, I can reduce the round-trip time to ~20ms for both the client and server bridges. (I think the 20ms delay, in turn, is fixed by the work cycle of some other thread(s), but I haven't determined that conclusively.)

    Do you have a sense of what an acceptable round-trip time would be for you purposes?

    in string variable assignments quotes are treated as part of the string and everywhere else they are just syntax

    This relates to this discussion, so I'll answer in that thread.

    Cheers,
    Chris

  3. 3 Posted by mhisted on 18 Dec, 2013 08:56 PM

    mhisted's Avatar

    A 25 ms worst-case round trip time should be fine.

    I plan to use a "wait" action for a fixed 25 ms before acting on the output from Python, so I'd be interested in the worst case. I could add another state each time I need to wait for this, but if you expect a non-constant latency perhaps a new action that waits for a variable to take a value would be useful.

    By the way - Dave said a while ago that the server bridge had lower latency - does that mean only for passing events from MWorks to Python (and not the reverse), or is that no longer true?

    Thanks,
    Mark

  4. Support Staff 4 Posted by Christopher Sta... on 12 Feb, 2014 09:24 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    Any progress on the round-trip bridge times?

    No, I haven't been working on this lately. The code changes I mentioned previously are trivial, but I've been hesitant to commit them without some more thought and testing. I'll try to get to that soon.

    Chris

  5. 5 Posted by mhisted on 12 Feb, 2014 10:54 PM

    mhisted's Avatar

    Thanks, I am making some changes soon that would benefit from this. Mark

  6. 6 Posted by mhisted on 18 Feb, 2014 03:37 PM

    mhisted's Avatar

    Hi Chris,

    We'd be willing to test a build with these changes if that helps you. I could use it on our human psychophysics rig and then after a few days, move it into more daily use in our training rigs. If you have other suggestions, let me know.

    Mark

    cc:ing Bram, who would also use these changes.

  7. Support Staff 7 Posted by Christopher Sta... on 20 Feb, 2014 07:58 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    After experimenting with this some more, I think I'm comfortable with pushing my changes in to the nightly build. Do you want them in the task definition file build, too?

    I plan to use a "wait" action for a fixed 25 ms before acting on the output from Python, so I'd be interested in the worst case. I could add another state each time I need to wait for this, but if you expect a non-constant latency perhaps a new action that waits for a variable to take a value would be useful.

    After tweaking the timing a bit more, I now get consistent round-trip times of ~20ms with the client bridge and ~10ms with the server bridge. However, I don't think I can ensure a hard upper limit on the round trip time, so waiting for explicit confirmation that the output from Python has been received would be the more robust approach.

    Dave said a while ago that the server bridge had lower latency - does that mean only for passing events from MWorks to Python (and not the reverse), or is that no longer true?

    It's still true, and the latency is lower in both directions. The reason is that events sent over the client-side bridge have to first pass from server to client before going from the client to the Python process, whereas events going over the server-side bridge go directly from the server to Python.

    Chris

  8. 8 Posted by mhisted on 20 Feb, 2014 11:33 PM

    mhisted's Avatar

    Hi Chris,
    Comments below.

    After experimenting with this some more, I think I'm comfortable with pushing my changes in to the nightly build. Do you want them in the task definition file build, too?

    Great!! I'll test them in the nightly. I don't need them in the task definition build for now.

    I plan to use a "wait" action for a fixed 25 ms before acting on the output from Python, so I'd be interested in the worst case. I could add another state each time I need to wait for this, but if you expect a non-constant latency perhaps a new action that waits for a variable to take a value would be useful.

    After tweaking the timing a bit more, I now get consistent round-trip times of ~20ms with the client bridge and ~10ms with the server bridge. However, I don't think I can ensure a hard upper limit on the round trip time, so waiting for explicit confirmation that the output from Python has been received would be the more robust approach.

    Great. Can I request a wait_for_condition action? ie.
    <action type="wait_for_true" condition="python_output_received==1" error_timeout_ms="100" error_message="Python semaphore not high within 100 ms"/> (If the condition is not true in 100ms the task system stops with a console error.)

    We can implement this now in the state system, but it requires creating a separate new state for each wait, this would be a huge improvement.

    In practice, I'm likely to use a fixed 30 ms delay action plus an assert for each wait, instead of a new state, so this action would just make for faster, more precise roundtrips.

    Dave said a while ago that the server bridge had lower latency - does that mean only for passing events from MWorks to Python (and not the reverse), or is that no longer true?

    It's still true, and the latency is lower in both directions. The reason is that events sent over the client-side bridge have to first pass from server to client before going from the client to the Python process, whereas events going over the server-side bridge go directly from the server to Python.

    Good, and it's nice to know the difference quantitatively: 20ms and 10ms for roundtrips.

    Thanks again, this is great.
    Mark

  9. 9 Posted by Verhoef, Bram-E... on 26 Feb, 2014 01:30 AM

    Verhoef, Bram-Ernst's Avatar

    This comment was split into a new discussion: MWorks on new Mac Pro

    Hi Chris,

    I am a postdoc working together with Mark Histed on a mouse project. I'm thinking about buying a new Mac Pro for my experiment but was not sure if MWorks would run properly on it. Do you know of any users that have employed MWorks successfully on the new Mac Pro?

    Thanks a lot,
    Bram

  10. 10 Posted by mhisted on 26 Feb, 2014 08:02 PM

    mhisted's Avatar

    After experimenting with this some more, I think I'm comfortable with pushing my changes in to the nightly build. Do you want them in the task definition file build, too?

    Hi Chris,

    After testing for a few days, it appears the reduced-latency changes cause no visible problems.

    Do you mind doing a TDF build with the reduced-latency changes? Thank you, Mark

  11. Support Staff 11 Posted by Christopher Sta... on 27 Feb, 2014 03:41 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    After testing for a few days, it appears the reduced-latency changes cause no visible problems. Do you mind doing a TDF build with the reduced-latency changes?

    Thanks very much for your testing. I'll merge these changes into the TDF branch and do a new build.

    Can I request a wait_for_condition action?

    Sure, no problem. That should be straightforward to implement, so I'll try to get it done soon and roll it in to the TDF build, too.

    Cheers,
    Chris

  12. 12 Posted by mhisted on 27 Feb, 2014 03:49 PM

    mhisted's Avatar

    Before you spend time on this- we are now seeing bugs with the low latency
    build. I will give you a call this afternoon if that's ok. -M

  13. Support Staff 13 Posted by Christopher Sta... on 27 Feb, 2014 04:26 PM

    Christopher Stawarz's Avatar

    Before you spend time on this- we are now seeing bugs with the low latency build. I will give you a call this afternoon if that's ok.

    Sure, that's fine.

    Chris

  14. 14 Posted by mhisted on 03 Mar, 2014 08:35 PM

    mhisted's Avatar

    Hi Chris,

    I played around with the client and was able to put together a testcase that shows the behavior on my laptop, not just on the rig machine.

    Load the attached experiment and launch the matching python script from the Client bridge. You should get "Done setting up.." and then a bunch of variable/event change state messages.
    Then terminate and reload the python script until it loads, but you no longer see any event messages (no callback being called). Once that happens, clicking on the green arrow in the client will start the server but the client will crash.
    If you can't get the bridge into the bad state, you may try starting the experiment and stopping a few times (may need to use MWServer->Action->Stop to stop it).
    This happens without reloading the experiment about 2/3rds of the time on our rig machines.

    Also - try uncommenting all the variables in the attached XML; you get a boost error that seems to arise from too many variables.

    There may be other issues we're experiencing but I think this is the main one; if you fix this I'll test again.

    best
    Mark

  15. 15 Posted by mhisted on 03 Mar, 2014 08:39 PM

    mhisted's Avatar

    Hi Chris,

    I wanted to clarify our limits. We typically have up to 220 variables per experiment. I can see that possibly going as high as 300, but I believe lists/arrays will limit that growth somewhat. I register a callback for each variable in Python that saves its current value for use by other python code.

    Mark

  16. Support Staff 16 Posted by Christopher Sta... on 03 Mar, 2014 09:19 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    Thanks for the test case. Your suspicions were correct: I've confirmed, using both your test and my own, that the issue is a big, ugly deadlock. (Note that the reduction in event-handling latency exposed the issue but didn't cause it.) I haven't untangled things yet, but I'll let you know when I do.

    Also - try uncommenting all the variables in the attached XML; you get a boost error that seems to arise from too many variables.

    Yeah, this is an old problem. My past testing indicated that you run into this issue once you get past ~500 experiment-defined variables. There's no reason that would have changed.

    Chris

  17. 17 Posted by mhisted on 03 Mar, 2014 09:54 PM

    mhisted's Avatar

    Hi Chris,

    Thanks, keep me posted.

    Mark

  18. Support Staff 18 Posted by Christopher Sta... on 07 Mar, 2014 08:36 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    I resolved the deadlock. The fix will be in tonight's nightly build.

    Unfortunately, now when I run your test case, I get a bunch of these messages when registering callbacks in the Python script:

    ERROR: Send on outgoing queue timed out; event will not be sent
    

    Basically, MWClient is filling the interprocess message queue faster than the Python-side conduit can empty it. Because the queue is fixed size, once it's full, outgoing events are just dropped.

    You can avoid this problem by sleeping after each callback registration. (In my tests, I used time.sleep(0.1), although a shorter sleep interval may work, too.) However, this is kludgy. I think the real fix is to replace boost::interprocess::message_queue with a different interprocess transport (probably ZeroMQ-based). Getting away from a fixed-length, fixed-message-size queue would eliminate this problem as well as the too many variables issue.

    That said, I do think the kludgy version will allow you to proceed with further testing. Should I push the deadlock fix in to the TDF build and get to work on a wait_for_condition action? Or would you prefer to wait for the "real" fix?

    Chris

  19. 19 Posted by mhisted on 07 Mar, 2014 09:40 PM

    mhisted's Avatar

    Chris,

    Just to summarize our conversation today:
    I will test the new nightly with the delays.
    In the next few months, we were thinking about, in order:
    fix the issue with IPC queue overflow by going to a new IPC mechanism
    Add a wait_for_condition action
    I'd test/roll out the TDF stuff at this point
    Add a conduit function to return the current state of a variable, avoiding a need to use Python callbacks on all vars and maintain their state
    Long run: Think about Python script actions/embedded interpreter, or MWorks-Server-as-a-Python library.

    thanks,
    Mark

  20. 20 Posted by mhisted on 13 Mar, 2014 11:44 PM

    mhisted's Avatar

    Hi Chris,

    With the registration delays, it seems like we are not seeing any problems with the new low-latency build after a few days of testing.
    I'll let you know if anything else comes up.

    Mark

  21. Support Staff 21 Posted by Christopher Sta... on 28 Apr, 2014 07:03 PM

    Christopher Stawarz's Avatar

    [From Mark]

    Hi Chris,

    I wanted to summarize where I was at with MWorks.

    First, the nightlies with new conduit delays work fine. Thanks!

    I have switched to using ZeroMQ (instead of a raw socket) for some python-to-python communication in our experiment code. If you haven't implemented this yet, I can give you a few pointers on issues I've run into. (Summary: use PAIR, use ZeroMQ 4.0.x, need to set rcv and send_timeout, high_water_mark, turn LINGER off. I didn't use an extra heartbeat socket.)

    As for my todo list, it now looks like this:

    1. Add a wait_for_condition action
    2. add capability to edit LISTs in variables window
    3. I'd test/roll out the TDF stuff at this point

    (2 and 3 can be swapped)

    Lower priority:

    • destructor-type state called on state system stop
    • fix the issue with MPC queue overflow by going to a new MPC mechanism
    • Add a conduit function to return the current state of a variable, avoiding a need to use Python callbacks on all vars and maintain their state
    • Long run: Think about Python script actions/embedded interpreter, or MWorks-Server-as-a-Python library.

    One more comment on the "long run" point: Either of those models works for me. Either way, I think it would be best if MWorks and Python were not independent threads, so that a call to Python code would by default block until it returned. That would save intro users the need to think about threading. And advanced users could do multithreading using Python via the multiprocess module. (Right now to make blocking calls, I make calls into python by setting a variable, using a python callback which when finished sets a 'finished' variable, and MWorks waits on the 'finished' value. This is an ok solution but probably should be easier by default).

    thanks again!
    Mark

  22. Support Staff 22 Posted by Christopher Sta... on 29 Apr, 2014 08:17 PM

    Christopher Stawarz's Avatar

    Hi Mark,

    The wait_for_condition action is done and will be in tonight's nightly build. Example usage:

    <action type="wait_for_condition" tag="Wait until foo is 2" condition="foo == 2" timeout="100ms" timeout_message="Wait timed out (foo = $foo)" stop_on_timeout="YES"></action>
    

    This will wait until either foo == 2 evaluates to true or 100ms elapses. If the wait times out, an error will be generated using the given message (which can include $varname substitutions like report and assert), and the experiment will stop.

    The "condition" and "timeout" parameters are required. If "timeout_message" is omitted, a default message is generated. If "stop_on_timeout" is omitted, it defaults to true; set it to NO or 0 if you don't want the experiment to stop on timeout.

    When you get a chance, please try this out and let me know how it works for you.

    Thanks,
    Chris

  23. 23 Posted by mhisted on 29 Apr, 2014 08:25 PM

    mhisted's Avatar

    Great! I will test this in the next few days.
    -M

  24. 24 Posted by mhisted on 20 May, 2014 11:31 PM

    mhisted's Avatar

    I'm now using this and it appears to work well.
    Thanks!
    Mark

  25. Christopher Stawarz closed this discussion on 14 Apr, 2015 05:20 PM.

  26. Christopher Stawarz closed this discussion on 23 Jun, 2015 03:11 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac