MATLAB Help Center
Hadoop cluster for mapreducer, mapreduce and tall arrays
parallel.cluster.Hadoop
A parallel.cluster.Hadoop object provides access to a cluster for configuring mapreducer, mapreduce, and tall arrays.
A parallel.cluster.Hadoop object has the following properties.
AdditionalPaths
AttachedFiles
mapreduce
AutoAttachFiles
ClusterMatlabRoot
HadoopConfigurationFile
HadoopInstallFolder
HadoopProperties
LicenseNumber
RequiresOnlineLicensing
SparkInstallFolder
SparkProperties
When you offload computations to workers, any files that the client needs for computations must also be available on workers. By default, the client attempts to detect and attach these files. To turn off automatic detection, set the AutoAttachFiles property to false. If the software cannot find all the files, or if sending files from client to worker is slow, use one of these options.
false
If the files are in a folder that is not accessible on the workers, set the AttachedFiles property. The cluster copies each file you specify from the client to the workers.
If the files are in a folder that is accessible on the workers, you can set the AdditionalPaths property instead. Use the AdditionalPaths property to add paths to the MATLAB® search path for each worker and avoid copying files unnecessarily from the client to the workers.
HadoopProperties allows you to override configuration properties for Hadoop. See the list of properties in the Hadoop® documentation.
The SparkInstallFolder is by default set to the SPARK_HOME environment variable. This is required for tall array evaluation on Hadoop (but not for mapreduce). For a correctly configured cluster, you only need to set the installation folder.
SPARK_HOME
SparkProperties allows you to override configuration properties for Spark. See the list of properties in the Spark® documentation.
For further help, type:
help parallel.cluster.Hadoop
Spark enabled Hadoop clusters place limits on how much memory is available. You must adjust these limits to support your workflow.
The amount of data gathered to the client is limited by the Spark properties:
spark.driver.memory
spark.executor.memory
The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.
If these properties are set too small, you see an error like the following.
Error using tall/gather (line 50) Out of memory; unable to gather a partition of size 300m from Spark. Adjust the values of the Spark properties spark.driver.memory and spark.executor.memory to fit this partition.
Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:
cluster = parallel.cluster.Hadoop; cluster.SparkProperties('spark.driver.memory') = '2048m'; cluster.SparkProperties('spark.executor.memory') = '2048m'; mapreducer(cluster);
The amount of working memory for a MATLAB Worker is limited by the Spark property:
spark.yarn.executor.memoryOverhead
By default, this is set to 2.5 GB. You typically need to increase this if you use arrayfun, cellfun, or custom datastores to generate large amounts of data in one go. It is advisable to increase this if you come across lost or crashed Spark Executor processes.
arrayfun
cellfun
You can adjust these properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:
cluster = parallel.cluster.Hadoop; cluster.SparkProperties('spark.yarn.executor.memoryOverhead') = '4096m'; mapreducer(cluster);
Introduced in R2014b
parallel.Cluster | parallel.Pool
parallel.Cluster
parallel.Pool
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
Europe
Asia Pacific
Contact your local office