We will be offering mothur and R workshops throughout 2019. Learn more.
I'm working with a large dataset, having approximately 1 billion distances. I am able to read the .dist file with a cutoff of .06 and a names file. Read took 1.5 hours.
However, the subsequent cluster command ran for 28,000 minutes on a Linux (CentOSv5) cluster node with 128gb of RAM, using 85gb of RAM. Although there was about 85gb of Virtual memory also utilized, there was no indication nor reason that disk-swapping was occurring. It's difficult to predict the timeframe of the analysis, but certainly less than 400 hours seems reasonable at 100% CPU usage (2.4GHz). PLEASE correct me if I'm wrong
Looking around this site, we noticed this:
If you are analyzing large data sets (e.g. from pyrosequencing) on a 64bit system and you have more than 2 GB of RAM in your computer, you can add a flag to the make file to use 64-bit pointers by opening the makefile and changing the line that reads:
CC_OPTIONS = -O3
CC_OPTIONS = -O3 -mtune=native -march=native -m64
Because the latest versions of C++ don't support the native name, we tried this change to the makefile:
CC_OPTIONS = -O3 -mtune=core2 -march=core2 -m64
LNK_OPTIONS = -O3 -mtune=core2 -march=core2 -m64
After remaking the software, I'm running the .dist file, and again, the read took about 1.5 hours, and the clustering is going, and going, using 85g of RAM.
Can someone suggest: a. A better way to run this cluster (client REALLY wants to run the dataset together if at all possible) b. A better change to the makefile that could help.
Are you positive that you are using the -cutoff=0.06? How many distances are =<0.06?
--Pat Schloss 19:24, 5 September 2009 (EDT)
I just wantd to finish this discussion. I was certain about the cutoff, and the number of distances reported
BUT: It's now clear that if there are spaces in the groups file, or hyphens, they both cause different problems. Spaces in groupnames cause the nunmber of groups to increase (e.g. "ThisGroupIsOneGroup" while "This Group Is Five Groups"). This was causing the large data set to increase in size exponentially. The group naming has to be careful and there may be more characters that are not allowed I'm not aware of.
Thanks for your help. I've had no problem using multiple processors on my 64-bit systems.