SOCIS 2012: Design ideas to improve Gerbil scalability

Disclaimer: Design concepts in this document shall be understood as a first sketch, they are by no means final. There might be errors or defficiencies in the design integrity and pseudocode, which will be gradually resolved during implementation. The purpose of this document is to establish common ground for discussion and gradual refinement of the design.

Synchronizing threads on shared data

Most of the big data structures that are currently owned by one of the widgets need to be modified in background in order to keep GUI responsive. The GUI thread should just read those data for drawing purposes. To support this, data structures have to be encapsulated into wrapper class providing read-write lock allowing multiple readers but only a single writer in the critical section. The readers would read-lock and unlock those structures regularly when refreshing GUI. Background thread would allocate new data structure, read-lock the original data structure, copy any usable data (e.g. intersection with old ROI), unlock, calculate the rest of new data, write-lock the original data structure, swap the internal plain pointer, unlock and deallocate the original data. Regarding the technology used for locks, I would prefer boost rather than Qt as the synchronization support is more carefully designed. Another possibility is to use mutexes from TBB, which somewhat similar to boost mutexes.

template<class T>
class SharedData {
public:
	typedef boost::shared_mutex Guard;
	typedef boost::shared_mutex::scoped_lock Write;
	typedef boost::shared_mutex::scoped_lock_shared Read;
	Guard lock;

	SharedData() : mData(NULL) {}
	SharedData(T *data) : mData(data) {}
	virtual ~SharedData() { delete mData; }
	T *swap(T *new); // set new data and return old data
	T &operator*() { return *mData; }
	T *operator->() { return mData; }
protected:
	T *mData;
private:
	SharedData(const SharedData<T> &other);
	SharedData<T> &operator=(const SharedData<T> &other);
};

To pass such structure between threads, it should be stored in the form of a pointer. Although using even the plain pointer would be manageable in Gerbil, it is more appropriate to use some sort of reference-counting smart pointer, so we don’t have to be concerned what thread or object is the owner of the data structure. Regarding the technology used for smart pointer, I would prefer QSharedPointer, because it is better compatible with the rest of the Qt framework (e.g. sending the pointer through Qt::QueuedConnection signal, if such thing would be ever required).

typedef boost::shared_ptr<SharedData<multi_img> > multi_img_ptr;

class SomeWidget {
public:
	int GetBandCount();
private:
	multi_img_ptr image;
};

int SomeWidget::GetBandCount {
	SharedData::Read lock(image->lock); // RAII style locking
	return *image->size(); // double deference to reach multi_img through wrappers
	// automatic unlock
}

Leveraging existing mechanisms for memory management

Let’s imagine the ideal hw/sw environment for Gerbil and think about what implications would such environment have on the memory management. First, I will enumerate the environment from the hardware perspective. There would obviously be large physical host memory (4-16GB) and graphics card with large dedicated device memory (1-3GB). Additionally, there would be solid-state disk in the system (even the cheapest 64GB SSD would be enough). From software perspective, there would be 64-bit OS, which also implies that graphic driver would be 64-bit as well. As a last requirement, the system page file should be very large and assigned to the solid-state disk.

Now, if Gerbil would be compiled into 64-bit binary, there would be no need for application-level memory management whatsoever because everything would be automatically and transparently handled by low-level paging of OS and graphic driver. Since 64-bit application can address several TB of data (depending on a particular OS) and modern graphic drivers manages their dedicated memory with paging mechanism (see here and here), there would be a very effective memory cascade. More specifically, the graphic driver would offload pages into host memory and OS would offload pages into page file (which would reside on a fast SSD). It is important to stress how incredibly effective this would be - there is almost no extra overhead (because paging is low level), it is fine-grained (pages are significantly smaller than most of the Gerbil data structures), it is adaptable (page allocators employs strategies like least-recently-used) and most importantly it has all the information to make best possible decisions (it has global information from all the processes, it can calculate statistics about past behaviour of processes and extrapolate this behaviour to infer future demands). There is no way of doing something like that in the single user-space process.

It is however very different situtation, if Gerbil is compiled into 32-bit binary. Suddenly, memory addresses become very scarce commodity because the virtual address space would be just 2-3GB large (depending on a particular OS). In such situation, all unnecessary data structures have to be offloaded out of the address space and mapped back only on-demand. There are two possibilities of doing so (and they could be combined together). Either the data would be offloaded into graphic device memory and rest would be handled by memory cascade as described above (because graphic driver runs in 64-bit mode). Or, if all the physical frames (device+host+pagefile) are used, the data would be serialized into binary archive on the SSD (or conventional HDD in worst case).

Freeing 32-bit virtual address space

To support offloading, memory intensive data structures would be encapsulated into wrapper class providing fetch and offload operation. Offload operation would be able to do intelligent decisions based on OpenCV gpu::DeviceInfo (CUDA availability, free/used device memory) and whether the Gerbil runs in 32-bit or 64-bit address space. More specifically, the offload operation would either do nothing (in 64-bit mode) or could decide whether to offload into device memory or to serialize into binary archive on the hard drive. It is important to say, that offloading to device memory is directly available only for multi-dimensional arrays of primitive data types, therefore tree-like data structures would have to be first serialized unsigned char array. The fetch operation would simply do nothing (if data are already available) or would download data either from device memory or hard drive (with possible deserialization step in both cases).

Regarding the required technologies, it is enough to use already existing dependencies of Gerbil. After reading OpenCV docs, it seems that there is no need to add osgCompute into the project - gpu::GpuMat and gpu::CudaMem from OpenCV provides almost exactly the same functionality as osgCompute::Memory. Additionally, OpenCV gpu namespace provides wide range of matrix operations (matrix-wise, element-wise, reduce, etc.) which could be useful for background calculation (whereas in case of osgCompute, we would have to write our own CUDA-kernels). As for serialization, there is very powerful support for that in boost library (including pointer tracking for tree-like structures). When serializing into binary archives, it is also quite fast. If the serialized archive is streamed into file, it is also neccessary to generate and store appropriate filename, so it would be possible to load the archive later. For this purpose, either microsecond timestamps or randomly generated number could be used (both of which are also available in boost).

template<class T>
class OffloadableSharedData : public SharedData<T> {
public:
	// ctors, dtor
	void fetch(); // mostly called automatically through deference operators
	void offload(); // called explicitly
	T &operator*() { fetch(); return *mData; }
	T *operator->() { fetch(); return mData; }
protected:
	std::string filename; // !empty() means data are currently offloaded in file
	std::vector<gpu::GpuMat> devmem; // !empty() means data are currently offloaded on device
	boost::mutex iomtx; // synchronize calls to fetch/offload between more threads
};

To be able to offload the data structure, it has to be serializable (i.e. implement corresponding boost function) and additionally should be able to convert itself into/from vector of gpu::GpuMat. Data structure could decide for itself what primitive type and how many matrices are needed for its storage. This would allow low overhead for structures that are already in the preffered format, as they could just pass themselves without almost any conversion. On the other hand, more complex structures could serialize themselves into a single byte matrix.

class multi_img {
	void store(std::vector<gpu::GpuMat>& storage);
	void restore(std::vector<gpu::GpuMat>& storage);

	friend class boost::serialization::access;
	template<typename Archive>
	void serialize(Archive &ar, const unsigned int version) {
		ar & bands;
	}
};

Because Gerbil is primarily targeted for 64-bit, the whole offloadable concept is of low priority or could be skipped entirely in the final implementation. Still, I consider it interesting to mention how the desperate situation on 32-bit could be improved. In case it would be implemented, it should be used only for data structures that are used infrequently (e.g. multi_img instances) but not for data that are used for rendering (e.g. hash tables in viewport). Also such data would be mostly required in background thread where the overhead of fetch/offload would not block the GUI. If the usage from GUI thread would be still desirable, the fetch/offload could regularly call QCoreApplication::processEvents() to keep GUI at least partially responsive. Anything more complex would have such a big architectural impact that it would not be justifiable against the fact, that all this effort is relevant just for 32-bit address space.

Preparing tasks for background calculation

In order to provide flexible and generic way of specifying task for background calculation, there has to be mechanism to communicate input and output values between GUI and worker thread. Naive approach would be to use Qt::QueuedConnection signal-slot mechanism and leave all the hard stuff to Qt, however then we would have almost no control over the queue and argument copying, which would prevent cancellation and could lead to various inefficiencies. Thus, it is best to declare own data structure:

struct Rectangle {
	Retangle();
	#ifdef GERBIL_GUI
	Retangle(QRect &rect);
	#endif
	Retangle(int x, int y, int w, int h);
	int x; int y; int w; int h;
};

class BackgroundTask {
public:
	BackgroundTask();
	virtual ~BackgroundTask();
	virtual void run() = 0;
	virtual void cancel() = 0;
	void update(int percent); // emit progress(percent) signal
	void done(); // signal condition variable, emit progress(100) signal, emit finished() signal
	void wait(); // wait on the condition variable
	Rectangle roi() { return mRoi; }
#ifdef GERBIL_GUI
signals:
	void progress(int percent);
	void finished();
#endif
protected:
	boost::condition_variable mFuture;
	Rectangle mRoi; // either target ROI for this calculation or empty for non-ROI calculations
};

class SomeSerialTask : public BackgroundTask {
	SomeSerialTask(multi_img_ptr image) : mImage(image) {}
	~SomeSerialTask();
	void run() { SomeExistingGerbilFunction(mImage, this); }
	void cancel() { //no-op }
private:
	multi_img_ptr mImage;
}

class SomeTbbTask : public BackgroundTask {
public:
	SomeTbbTask(Rectangle &roi, multi_img_ptr image) : mRoi(roi), mImage(image) {}
	~SomeTbbTask();
	void run() {
		SharedData::Read lock(mImage->lock);
		multi_img *newImage = new mutli_img(/* size, etc */);
		SomeTbbFunctor algo(mImage, newImage);
		tbb::parallel_for(/* data size */, algo, mStopper);
		lock.unlock();
		if (mStopper->is_group_execution_cancelled()) {
			mStopper->reset();
			return;
		} else {
			SharedData::Write lock(mImage->lock);
			multi_img *oldImage = *mImage->swap(newImage);
			delete oldImage;
		}
	}
	void cancel() { mStopper->cancel_group_execution(); }
protected:
	tbb::task_group_context mStopper;
	class SomeTbbFunctor() {
	public:
		SomeTbbFunctor(multi_img_ptr image, multi_img *newImage) : mImage(image), mNewImage(newImage) {}
		void operator()(/* data range */) const; // copy overlapping parts, calculate new parts
	private:
		multi_img_ptr mImage;
		multi_img *mNewImage;
	};
private:
	multi_img_ptr mImage;
}

class SomeCudaTask : public BackgroundTask {
	SomeCudaTask(Rectangle &roi, multi_img_ptr image) : mRoi(roi), mImage(image) {}
	~SomeCudaTask();
	void run() { /* usage of OpenCV gpu namepace */ }
	void cancel() { //no-op }
private:
	multi_img_ptr mImage;
}

Notice how the sketched classes allow incremental improvement of gerbil code. First of all, before implementing any parallel algorithms, it would be enough to just prepare wrappers for existing Gerbil functions. These wrappers would basically just carry the arguments between threads. So for example if some function in multi_img class should be computed in background, it is enough to write just a few lines of code for the BackgroundTask inheritor, but the original function could be kept without any changes. Only later, the parallel version of the function could be re-implemented into another BackgroundTask inheritors, be it based on TBB or CUDA.

Also notice that the GUI thread have two options how to get notification about finished task. It can call wait() on the dispatched task, which will block until the task is completed, or it can connect the finished() signal to some custom slot via cross-thread Qt::QueuedConnection, which allows non-blocking asynchronous notification. While the second option is the only correct option to keep GUI responsive, the first option could be helpful for the incremental refactoring of Gerbil (i.e. it would be possible to put calculation to the background and test it before being bothered by bugs caused by asynchronous GUI).

Preparing infrastructure for background calculation

The infrastructure itself is actually quite difficult to describe by words (well, any parallelized code is), so I decided to describe it by a series of commented code snippets.

typedef boost::shared_ptr<BackgroundTask> BackgroundTaskPtr;

class BackgroundTaskQueue {
public:
	void Halt() {
		Lock lock(mGuard);
		mHalted = true;
		mTaskQueue.clear(); // Flush the queue, so there is nothing else to pop.
		if (mCurrentTask) {
			mCancelled = true;
			mCurrentTask->cancel();
		}
		lock.unlock(); // Unlock to prevent deadlock when signalling the condition.
		mFuture.notify_all(); // In case the thread is sleeping.
	}

	bool Pop() {
		Lock lock(mGuard);
		while (mTaskQueue.empty()) {
			if (mHalted) {
				return false; // Thread will terminate.
			} else {
				mFuture.wait(lock); // Yields lock until signalled.
			}
		}
		mCurrentTask = mTaskQueue.front(); // Fetch the task.
		mTaskQueue.pop_front();
		mCancelled = false;
		return true;
	}

	void Push(BackgroundTaskPtr &task) {
		Lock lock(mGuard);
		mTaskQueue.push_back(task);
		lock.unlock(); // Unlock to prevent deadlock when signalling the condition.
		mFuture.notify_all();
	}

	void CancelTasks(QRect &roi) {
		Lock lock(mGuard);
		std::deque<BackgroundTaskPtr>::iterator it = mTaskQueue.begin();
		while (it != mTaskQueue.end()) {
			if ((*it)->roi() == roi) {
				mTaskQueue.erase(it);
			}
		}
		if (mCurrentTask->roi() == roi) {
			mCancelled = true;
			mCurrentTask->cancel();
		}
	}

	// Background thread's main().
	void operator()() {
		while (true) {
			if (!mTaskQueue->Pop()) {
				break; // Thread termination.
			}
			mCurrentTask->run();
			{
				Lock lock(mGuard);
				if (!mCancelled) {
					mCurrentTask->done();
				}
			}
		}
	}

	static BackgroundTaskQueue instance; // eager singleton instance

private:
	BackgroundTaskQueue() : mHalted(false), mCancelled(false) {}
	BackgroundTaskQueue(const BackgroundTaskQueue &other); // singleton - do not implement
	BackgroundTaskQueue &operator=(const BackgroundTaskQueue &other); // singleton - do not implement

	typedef boost::mutex Guard;
	typedef boost::unique_lock<Guard> Lock;
	bool mHalted; // Used for background thread termination.
	bool mCancelled; // Discard results of current task.
	boost::condition_variable mFuture; // Wakes sleeping thread.
	Guard mGuard; // Serializes thread access to the queue.
	BackgroundTaskPtr mCurrentTask; // Currently calculated task.
	std::deque<BackgroundTaskPtr> mTaskQueue;
};

int main(int argc, char **argv) {
	QApplication app(argc, argv);
	std::thread background(BackgroundTaskQueue::instance);
	multi_img* image = new multi_img(filename);
	ViewerWindow window(image);
	window.show();
	app.exec();
	BackgroundTaskQueue::instance.Halt();
	background.join();
	return 0;
}

Improving efficiency of recalculations

When all the infrastructure above will be implemented, functions for copying and calculating data should be first made more dynamic (in regard to ROI change) and then parallelized. For multi-dimensional data composed from plain primitive values (e.g. multi_img, cv::Mat, QImage), it is pretty obvious - parallelization is completely natural here and efficient dynamicity could be achieved by copying the intersection with old ROI and calculating new values for the rest of the area. There are however data structures for which it is not possible to recalculate new values - for example labeling must be remembered globally between ROI changes, because it is created by user actions. For such data structures, it will be sufficient to just edit ROI descriptor when ROI is changed, because their data are calculated in between ROI changes by user interaction with the program.

As for more complex data structures, there are only map-like hash tables for pre-processing OpenGL rendering in Viewport. The parallelization could be done by changing QHash to parallel map-like structures available in TBB library (tbb::concurrent_hash_map seems to be best for the job). Dynamicity could be improved by updating histograms only with non-intersecting areas of old and new ROI (however this is beneficial only when those areas are together smaller than the complete area of new ROI).

However, to support this together with possible cancellation (i.e. user will change ROI again, while the previous change is still not calculated for all data), all those data structures need to have internal ROI descriptor so it is always clear what part of the global image coordinate system this particular data structure contains. It is important to say, however, that for some operations (e.g. PCA), the intersection with old ROI is irrelevant, because it must be recomputed completely for the new ROI.

To further improve GUI responsiveness, the OpenGL rendering could be also put to the background. While OpenGL state machine does not support parallelization (i.e. it would be still single-threaded rendering), QGLWidget allows to handle paint event in the background thread (see here).

Removing redundancies from Gerbil code

As described above, when running 64-bit Gerbil on modern hardware and software, there is actually no problem with memory shortage and all the data redundancies and different representations of the same could be actually beneficial for the application to speed-up planned dynamicity. I would say, that if we decide to reuse already calculated data, we should definitely rely on the low-level paging mechanism and not try implement some customized memory management on top of that (as it would be slower and less sophisticated - consider boost serialization versus paging). It is still a good question, however, whether it could be better to just deallocate those background data (i.e. those not used for rendering) and calculate it againg from scratch when needed. While I cannot support my claims by fine-grained measurements from profiler, I think (based on the observation of consumed CPU ticks and memory space by qgerbil process in taskmanager) that loading from pagefile could be significantly faster than data recalculation on symmetric multiprocessor (regardless whether it would be dual-core, quad-core or octa-core). It would be probably whole different story, when such data could be recalculated on GPU, but then Gerbil performance would be completely dependent on CUDA availability. So I still find the best architectural trade-off in letting paging take care off memory management and reuse those data for recalculations (which should be executed on GPU if possible).

Having said all the above, there are however still some places in Gerbil where the data redundancy actually causes slower calculation. First, all the QGLWidgets should be reimplemented in such a way that they work directly with textures on the graphics card. Currently, they work with QPixamps (which resides in host memory) and every paint event copies this QPixmap to the texture in device memory. Also the hashing mechanism in Viewport could be improved so the hash keys are stored just once (currently each hash key exist in two copies - first is in shuffleIdx and the second is implicitly contained in BinSet::bins). Lastly, there are some places in source code where big data structures (QImage, cv::Mat) are passed to functions by value - while those structures internally supports copy-on-write, so the passing by value should be cheap, I would rather check all such occurences in more detail to be sure there is not some unwanted copying.

Implementation extent

The infrastructure for background tasks (SharedData, BackgroundTask, BackgroundTaskQueue) will be put to new source files in Gerbil source tree. It is expected that most of the Gerbil widget classes will be quite heavily influenced by refactoring (see implementation schedule below for details). Also multi_img class will have to be improved to reuse data from the old ROI. Some of the data structures currently declared in the widget classes might have to be moved to their dedicated files (so they can be more cleanly shared between more compilation units) and improved to reuse ROI as well. Regarding BackgroundTask inheritors, they must be implemented in the Gerbil source tree (not Vole) because only Gerbil will be aware of the infrastructure for background tasks. Similarly as with multi_img_ext, the reasonable approach for Vole classes seems to be to create dedicated files for the parallelized functions in the Gerbil tree (e.g. multi_img_parallel.cpp). Parallel versions of functions that are currently in the widget classes might be implemented directly into widget classes next to their serial counterparts.

While there is definitely much functionality in Vole and Gerbil, which could benefit from parallelization and asynchronous calculation in the background, my project will be focused on the calculations involved in ROI changes. More specifically, I will not add such support for optional Gerbil features, or features that are explicitly invoked by user in between ROI changes. Although there might be some exceptions from this rule (e.g. background rendering in OpenGL widgets or some of the essential core Gerbil functionality which has immediate impact on GUI performance in most of the usage scenarios), I am afraid there must be clear line of the project scope to avoid missing deadlines. However, my vision is to provide generic-enough infrastructure and examples, that might be used even after my SOCIS involvement by other Gerbil developers to improve performance for the non-core functionality.

Implementation schedule

August 1 - August 14 Preparing development environment, compiling and installing dependencies, reading required documentation, studying current Gerbil source code, discussing design ideas with mentor.

August 15 - August 21 Vacation.

August 22 - August 31 Initial coding phase. Implementing infrastructure for parallel asynchronous calculations (SharedData, BackgroundTask, BackgroundTaskQueue). After this step, there will appear some new files in Gerbil source tree, but there will be no changes to the original source code.

September 1 - September 15 Porting Gerbil to the new infrastructure. This will involve improving data structures (e.g. multi_img) to leverage old ROI during recalculations, preparing serial versions of background tasks and making GUI asynchronous (i.e. delegation of intensive tasks to the background, implementation of slots to react on task completion, refactoring of GUI code to ensure proper synchronization and to avoid race conditions, deadlocks etc.). There might be some unexpected surprises during thi coding phase, so while majority of the code would be ported accordingly to the sentences above, it might not be compilable or properly debugged yet.

September 15 - September 30 Resolving any problems from the previous period, testing and debugging. Additional refactoring to put OpenGL rendering into dedicated background thread. At the end of September, GUI thread should be free of any calculation intensive tasks (accordingly to the implementation extent described in the previous chapter) and Gerbil shall be able to dynamically and asynchronously recalculate ROI in the background (however all the background computation will be still single-threaded). During this period, I will have to resolve some administrative stuff at my university to begin new academic year and also defend already finished team software project, so considering travelling times etc. I might be unavailable for 3-6 days.

October 1 - October 15 Implementing parallel versions of some background tasks (either by using TBB, OpenCV, or both). Implementation priority for individual tasks will be determined according to profiling and intuition. I am not going to promise all the tasks will be parallelized, but the goal is clearly to improve Gerbil performance as much as possible in the short time frame I am given.

October 16 - October 28 Writing documentation. Polishing. Time reserve for dealing with unexpected problems during development. If time permits, I might try to add experimental support for offloading concept described further above.