Is there a Linux system command that can be called to change the arbitration scheme set for PCIe devices?

I don't think that arbitration is the issue here, and adjusting it's settings requires board support as well as kernel modification. The vc extended capability interface is handled in part in the linux kernel here: http://lxr.free-electrons.com/source/drivers/pci/vc.c

I've written drivers for custom PCIe boards in Linux, and the algorithm for routing traffic between boards has not shown itself to be an issue in the past, unless you have a very unusual use case - extremely long transfers with near real time latency requirements (in which case you shouldn't be using PCIe).

What can directly impact this type of performance, and is far more readily addressed, is the topology of the bus the itself, although the impact is usually barely measureable.

On the machine, run the lspci command as:

lspci -tv

Which will show you a tree view of the PCIe interfaces and the route to the CPU(s) that they take. With most processors you will have some slots that go directly to the CPU and others that go through a bridge chip (see Intel x99 chipset

These bridges introduce latency and the possibility of slower throughput. The CPU direct is specifically configured for high performance devices such as video cards. To your initial point, deep in the processor micro code there may be optimizations which further degrade the bridged links. To dig deeper in evaluating the performance and routing of the PCIe slots continue in the sysfs.

Under /sys/bus/pci/slots/ will be a list of the pci slots (physical) in your system. In it is a virtual file which associates bus address <----> physical slot.

Under /sys/bus/pci/devices is a list of all devices (this is where lspci gets it's info).

Going through each of the devices you can see all of the information exposed by the kernel on them, the drivers associated with them, the CPU associated with the device (on multiple CPU system), among other things.

Edit - I didn't mention some obvious things that i assume you have ruled out, but just in case:
1. Do the different slots both have at least as many lanes as the boards?
2. Is there a spec discrepancy- eg board is pcie 3, one slot is 3 and the other 2?
3. Have you discussed this concern with the board vendor and/or the driver developer beyond them acknowledging iy? They may be aware of some random errata regarding it

If you provide specific details I can provide specific advice.

Beyond looking at the topology (is the faster device on a direct CPU path, while the other is not), not knowing the type of chipset / CPU you are using, I can only offer general advice, but three areas that I would start looking at are:

Interrupt Latency: If the interrupt for board is associated with a CPU/core that is handling other devices with a high interrupt rate, you will take a performance hit. Is there other kernel context heavy lifting going on on that core? watch /proc/interrupts to see what other kernel modules are using that CPU for it's interrupt handling and the count/rate at which they occur. Try adjusting the CPU affinity for that device in /proc/irw ... smp_affinity. smp affinity is a mask, if you had 8 cores and didn't specify anything, it would be set to FF (8 1's). If you set it to, e.g. 0x02, that will force Core 2 to handle the IRQ. Unless you know you are addressing a specific issue, forcing these changes can easily make things worse.

Interrupt support: Take a look and see if one of the devices is using MSI-x or MSI interrupts, while the other is using a standard (electrical) interrupt. Sometimes bridges don't support a boards MSI implementation (MSI means message signaled interrupt - rather than an electrical interrupt its just a packet that gets sent over the bus itself). If a device typically uses multiple interrupts but has to operate with only a single one due to this, it can be hard to detect unless you are directly looking for it, and can cause performance issues.

Characterize performance. There are many tools in the kernel to collect performance data. The one thing that they all have in common is that they are poorly documented and generally unsupported. But with that said, I would look at using Ftrace to characterize the dma transfers from each board, and the IRQ latency for each. You can get statistical information as well as specific details on outlier events. You can start looking into that here: http://elinux.org/Ftrace

In general I strongly discourage mucking about in very low level settings without as complete as possible an understanding of what you are trying to correct (not the symptoms to correct, but the underlying root cause). 99% of the time you will end up turning 'knobs' for the sake of it, but without understanding why or what the original problem is, how can evaluate the effectiveness of a given setting (both immediate and in terms of long term stability).

I heavily use ftrace for general kernel debug and highly recommend it. If you want things abstracted a bit, there are wrappers around ftrace that claim to make it easier to use, but I found the additional abstraction just muddies the water - trace-cmd, kernel shark, etc. If you are on a red hat system you can look into systemtap - not the same thing but can provide similar data (and it's well supported).