Scientific Linux Forum.org



  Reply to this topicStart new topicStart Poll

> Firewire camera app issue, hardware & software - dependent bug
Crystal Cowboy
 Posted: Jul 6 2018, 06:18 PM
Quote Post


SLF Junior
**

Group: Members
Posts: 43
Member No.: 835
Joined: 13-September 11









I have encountered a problem that is puzzling me. It involves multiple components, so that I don't know where the blame might lie. It involves an application that interfaces to a Firewire camera using libdc1394, but involves more than that.


System A:
Mobo: Intel DH77DF (H77 Express chipset)
CPU: i3-3220T (2 cores, 4 threads)
4 GB RAM
Using onboard graphics
I have 2 of these systems, so it doesn't seem to be a hardware component failure. The problem is consistent on both.

System B:
Mobo: Intel DQ77MK (Q77 Express chipset)
CPU: i7-3770 (4 cores, 8 thread)
8 GB RAM
Radeon PCIe graphics card or onboard graphics, makes no difference.

Both systems have IEEE1394 Firewire port on mobo.

--------------------
The application is in Java, and utilizes a Firewire(400) camera through C++ code calling libdc1394, wrapped with JNI.

The camera may be either a Sony model or a The Imaging Source model, specs are similar and behavior is the same with both.

The Java application, other than fetching images from the camera and displaying them, has a number of other features, including RMI and a few sockets.

The application worked fine under SL6, and had been going just fine for over 10 years on various versions of Linux.

--------------------
After upgrading to SL7(currently 3.10.0-862.6.3.el7.x86_64), the application runs OK on system B but has problems on system A.

The symptoms are such:
After finding the camera and configuring it, isochronous image capture is setup. This includes a call to dc1394_capture_setup(handle, NUM_BUFFERS, DC1394_CAPTURE_FLAGS_DEFAULT)
and dc1394_video_set_transmission(handle, DC1394_ON).
NUM_BUFFERS is the number of images in the ring buffer. Have been using the number 4 successfully for quite a while.

After setup, image fetching begins. Images are fetched with dc1394_capture_dequeue(handle, DC1394_CAPTURE_POLICY_WAIT, &frame) and returned to ring buffer use with dc1394_capture_enqueue(handle, frame).

With NUM_BUFFERS at 4, fetching frames fails and never returns on frame 5.
If NUM_BUFFERS is increased to 10, the failure comes on frame 11. And so on.
This appears to indicate that frames are not being returned for use in the buffer.

If NUM_BUFFERS is increased to ~17-22 the program starts to work properly some of the time, being more successful as the number increases.

If NUM_BUFFERS is set to 24 it seems to work fairly reliably.
Until another feature of the program is turned on, which is to open a communication socket. Then it is back to square one; set NUM_BUFFERS to 24, it stops at frame 25. Set NUM_BUFFERS to 48, it stops at frame 48.

I tried changing to different kernel versions. This makes no difference.
Tried: kernel-devel-3.10.0-862.el7.x86_64, kernel-devel-3.10.0-862.3.3.el7.x86_64, kernel-devel-3.10.0-862.3.2.el7.x86_64

Tried updating the BIOS on system A from v0108 to v0111. This makes no difference.

The standard EPEL package on SL7 is libdc1394-2.2.2-3.el7.x86_64.
I tried removing this and compiling libdc1394-2.2.5 from source. This made no difference.

I wrote a minimal C test program to set up the camera and fetch (but not display) frames. This works reproducibly with never any problems.

I tried doubling the RAM on system A to 8 GB. No difference.

Rolled back Oracle Java-Netbeans from jdk-8u171-nb-8_2-linux-x64 to jdk-8u151-nb-8_2-linux-x64. No difference.

Switched desktop from Mate to Gnome3 or XFCE. No difference.

Tried to influence timing of events in program startup by introducing sleep commands before and after the camera setup and frame-fetching. No difference.

------
So where does this leave me?

The manipulation of the value of NUM_BUFFERS, and the opening of a socket after camera setup are the only factors that seem to make a difference on system A under SL7.

So the problem doesn't seem to be in libdc1394 itself.

What does SL7 do differently than SL6 that might be relevant?
What does SL7 do differently on system A vs. system B?
What does the dependence on opening of the socket tell me? Is there some kernel resource that is not being respected?



PM
^
burakkucat
 Posted: Jul 6 2018, 11:21 PM
Quote Post


SLF Administrator
****

Group: Admins
Posts: 229
Member No.: 14
Joined: 10-April 11









That is certainly puzzling.

As a temporary measure, you could try using the latest kernel-ml package set that is available from the ELRepo Project. (As of the date of this posting kernel-ml is built from the linux-4.17.4 sources.)

At present I can't think of anything else to try. unsure.gif

--------------------
user posted image 100% Linux and, previously, Unix. Co-founder of the ELRepo Project.
PMUsers Website
^
Crystal Cowboy
 Posted: Jul 12 2018, 08:01 PM
Quote Post


SLF Junior
**

Group: Members
Posts: 43
Member No.: 835
Joined: 13-September 11









Thanks for the suggestion. I installed kernel-ml, booted into the new kernel, ...
and the bug was still present as described.

Then I did a fresh install of SL6.10 and verified that the bug does not show up in that version.
PM
^
Crystal Cowboy
 Posted: Jul 14 2018, 09:14 PM
Quote Post


SLF Junior
**

Group: Members
Posts: 43
Member No.: 835
Joined: 13-September 11









Other stuff I have since tried:

In the BIOS, turning off cores or turning off hyperthreading don't seem to make a difference.

I tried a fresh install of SL7.5, made no difference.

I swapped disks. Moving the disk from system A (which experiences the problem) to system B resulted in perfect operation (once I sorted out the ethernet address). This would indicate that the problem goes with the hardware, although I have not done the next step, which is to swap CPUs to tell if it is the CPU or the motherboard.
PM
^
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

Topic Options Reply to this topicStart new topicStart Poll