NS-3 based Named Data Networking (NDN) simulator
ndnSIM 2.0: NDN, CCN, CCNx, content centric networks
Overall ndnSIM 2.0 documentation

How to speed up simulations by parallel execution

A way to speed up your simulations is to run them in parallel taking advantage of the power of all the processors and the memory availability of your machine. This can be done by using the Message Passing Interface (MPI) along with the distributed simulator class provided by NS-3.

To make use of MPI, the network topology needs to be partitioned in a proper way, as the potential speedup will not be able to exceed the number of topology partitions. However, it should be noted that dividing the simulation for distributed purposes in NS-3 can only occur across point-to-point links. Currently, only the applications running on a node can be executed in a separate logical processor, while the whole network topology will be created in each parallel execution. Lastly, MPI requires the exchange of messages among the logical processors, thus imposing a communication overhead during the execution time.

Designing a parallel simulation scenario

In order to run simulation scenarios using MPI, all you need is to partition your network topology in a proper way. That is to say, to maximize benefits of the parallelization, you need to equally distribute the workload for each logical processor.

The full topology will always be created in each parallel execution (on each “rank” in MPI terms), regardless of the individual node system IDs. Only the applications are specific to a rank. For example, consider node 1 on logical processor (LP) 1 and node 2 on LP 2, with a traffic generator on node 1. Both node 1 and node 2 will be created on both LP 1 and LP 2; however, the traffic generator will only be installed on LP 1. While this is not optimal for memory efficiency, it does simplify routing, since all current routing implementations in ns-3 will work with distributed simulation.

For more information, you can take a look at the NS-3 MPI documentation.

Compiling and running ndnSIM with MPI support

  • Install MPI

    On Ubuntu:

    sudo apt-get install openmpi-bin openmpi-common openmpi-doc libopenmpi-dev
    

    On Fedora:

    sudo yum install openmpi openmpi-devel
    

    On OS X with HomeBrew:

    brew install open-mpi
    
  • Compile ndnSIM with MPI support

    You can compile ndnSIM with MPI support using ./waf configure by adding the parameter --enable-mpi along with the parameters of your preference. For example, to configure with examples and MPI support in optimized mode:

    cd <ns-3-folder>
    ./waf configure -d optimized --enable-examples --enable-mpi
    
  • Run ndnSIM with MPI support

    To run a simulation scenario using MPI, you need to type:

    mpirun -np <number_of_processors> ./waf --run=<scenario_name>
    

Simple parallel scenario using MPI

This scenario simulates a network topology consisting of two nodes in parallel. Each node is assigned to a dedicated logical processor.

The default parallel synchronization strategy implemented in the DistributedSimulatorImpl class is based on a globally synchronized algorithm using an MPI collective operation to synchronize simulation time across all LPs. A second synchronization strategy based on local communication and null messages is implemented in the NullMessageSimulatorImpl class, For the null message strategy the global all to all gather is not required; LPs only need to communication with LPs that have shared point-to-point links. The algorithm to use is controlled by which the ns-3 global value SimulatorImplementationType.

The strategy can be selected according to the value of nullmsg. If nullmsg is true, then the local communication strategy is selected. If nullmsg is false, then the globally synchronized strategy is selected. This parameter can be passed either as a command line argument or by directly modifying the simulation scenario.

The best algorithm to use is dependent on the communication and event scheduling pattern for the application. In general, null message synchronization algorithms will scale better due to local communication scaling better than a global all-to-all gather that is required by DistributedSimulatorImpl. There are two known cases where the global synchronization performs better. The first is when most LPs have point-to-point link with most other LPs, in other words the LPs are nearly fully connected. In this case the null message algorithm will generate more message passing traffic than the all-to-all gather. A second case where the global all-to-all gather is more efficient is when there are long periods of simulation time when no events are occurring. The all-to-all gather algorithm is able to quickly determine then next event time globally. The nearest neighbor behavior of the null message algorithm will require more communications to propagate that knowledge; each LP is only aware of neighbor next event times.

The following code represents all that is necessary to run such this simple parallel scenario

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
#include "ns3/core-module.h"
#include "ns3/network-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/ndnSIM-module.h"
#include "ns3/mpi-interface.h"

#ifdef NS3_MPI
#include <mpi.h>
#else
#error "ndn-simple-mpi scenario can be compiled only if NS3_MPI is enabled"
#endif

namespace ns3 {

int
main(int argc, char* argv[])
{
  // setting default parameters for PointToPoint links and channels
  Config::SetDefault("ns3::PointToPointNetDevice::DataRate", StringValue("1Gbps"));
  Config::SetDefault("ns3::PointToPointChannel::Delay", StringValue("1ms"));
  Config::SetDefault("ns3::DropTailQueue::MaxPackets", StringValue("10"));

  bool nullmsg = false;

  // Read optional command-line parameters (e.g., enable visualizer with ./waf --run=<> --visualize
  CommandLine cmd;
  cmd.AddValue("nullmsg", "Enable the use of null-message synchronization", nullmsg);
  cmd.Parse(argc, argv);

  // Distributed simulation setup; by default use granted time window algorithm.
  if (nullmsg) {
    GlobalValue::Bind("SimulatorImplementationType",
                      StringValue("ns3::NullMessageSimulatorImpl"));
  }
  else {
    GlobalValue::Bind("SimulatorImplementationType",
                      StringValue("ns3::DistributedSimulatorImpl"));
  }

  // Enable parallel simulator with the command line arguments
  MpiInterface::Enable(&argc, &argv);

  uint32_t systemId = MpiInterface::GetSystemId();
  uint32_t systemCount = MpiInterface::GetSize();

  if (systemCount != 2)  {
    std::cout << "Simulation will run on a single processor only" << std::endl
              << "To run using MPI, run" << std::endl
              << "  mpirun -np 2 ./waf --run=ndn-simple-mpi" << std::endl;
  }

  // Creating nodes

  // consumer node is associated with system id 0
  Ptr<Node> node1 = CreateObject<Node>(0);

  // producer node is associated with system id 1 (or 0 when running on single CPU)
  Ptr<Node> node2 = CreateObject<Node>(systemCount == 2 ? 1 : 0);

  // Connecting nodes using a link
  PointToPointHelper p2p;
  p2p.Install(node1, node2);

  // Install NDN stack on all nodes
  ndn::StackHelper ndnHelper;
  ndnHelper.InstallAll();

  ndn::FibHelper::AddRoute(node1, "/prefix/1", node2, 1);
  ndn::FibHelper::AddRoute(node2, "/prefix/2", node1, 1);

  // Installing applications
  ndn::AppHelper consumerHelper("ns3::ndn::ConsumerCbr");
  consumerHelper.SetAttribute("Frequency", StringValue("100")); // 10 interests a second

  ndn::AppHelper producerHelper("ns3::ndn::Producer");
  producerHelper.SetAttribute("PayloadSize", StringValue("1024"));

  // Run consumer application on the first processor only (if running on 2 CPUs)
  if (systemCount != 2 || systemId == 0) {
    consumerHelper.SetPrefix("/prefix/1"); // request /prefix/1/*
    consumerHelper.Install(node1);

    producerHelper.SetPrefix("/prefix/2"); // serve /prefix/2/*
    producerHelper.Install(node1);

    ndn::L3RateTracer::Install(node1, "node1.txt", Seconds(0.5));
  }

  // Run consumer application on the second processor only (if running on 2 CPUs)
  if (systemCount != 2 || systemId == 1) {
    // Producer
    consumerHelper.SetPrefix("/prefix/2"); // request /prefix/2/*
    consumerHelper.Install(node2);

    producerHelper.SetPrefix("/prefix/1"); // serve /prefix/1/*
    producerHelper.Install(node2);

    ndn::L3RateTracer::Install(node2, "node2.txt", Seconds(0.5));
  }

  Simulator::Stop(Seconds(400.0));

  Simulator::Run();
  Simulator::Destroy();

  MpiInterface::Disable();
  return 0;
}

} // namespace ns3


int
main(int argc, char* argv[])
{
  return ns3::main(argc, argv);
}

If this code is placed into scratch/ndn-simple-mpi.cpp or NS-3 is compiled with examples enabled, you can compare runtime on one and two CPUs using the following commands:

# 1 CPU
mpirun -np 1 ./waf --run=ndn-simple-mpi

# 2 CPUs
mpirun -np 2 ./waf --run=ndn-simple-mpi

The following table summarizes 9 executions on OS X 10.10 and 2.3 GHz Intel Core i7 a single CPU, on two CPUs with global synchronization, and on two CPUs with null message synchronization:

# of CPUs Real time, s User time, s System time, s
1 20.9 +- 0.14 20.6 +- 0.13 0.2 +- 0.01
2 (global) 11.1 +- 0.13 21.9 +- 0.24 0.2 +- 0.02
2 (nullmsg) 11.4 +- 0.12 22.4 +- 0.21 0.2 +- 0.02

Note that MPI not always will result in simulation speedup and can actually result in performance degradation. This means that either network is not properly partitioned or the simulation cannot take advantage of the partitioning (e.g., the simulation time is dominated by the application on one node).