In Hadoop, partitioning a data allows processing of huge volume of data in parallel such that it takes minimum amount of time to process entire dataset. Apache spark decides partitioning based on different factors. Factor that decide default partitioning
- On hadoop split by HDFS cores.
- Filter or map function don’t change partitioning
- Number of cpu cores in cluster when running on non-hadoop mode.
Re-partitioning : increases partition , it re-balance the partition after filter &it increases parallelism.
You can define partition in spark at the time of creating RDD as follow :
val users = sc.textFile(“hdfs://at-r3p11:8020/project/users.csv”,1);
where 2nd argument is nothing but number of partition.
By default if not used hdfs spark creates partition based on number of cores. & if used hdfs path it will create partition based on input split (default block size of hdfs).
To know the partition size , just enter in spark-shell
Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that).
As far as choosing a “good” number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling
Also, the number of partitions determines how many files get generated by actions that save RDDs to files.
The maximum size of a partition is ultimately limited by the available memory of an executor.
In the first RDD transformation, e.g. reading from a file using
sc.textFile(path, partition), the
partition parameter will be applied to all further transformations and actions on this RDD.
textFile with compressed files (
file.txt or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do repartitioning.
Some operations, e.g.
filter, don’t preserve partitioning.
filter operations apply a function to every partition.
rdd = sc.textFile('demo.gz') rdd = rdd.repartition(100)
With the lines, you end up with
rdd to be exactly 100 partitions of roughly equal in size.
shuffleto split data to match
partitioning is done on round robin basis
Note : If partitioning scheme doesn’t work for you, you can write your own custom partitioner.
coalesce transformation is used to change the number of partitions. It can trigger RDD shufflingdepending on the
shuffle flag (disabled by default, i.e.
falseby default and it’s explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD