Spark

Home / Technology / Spark

Apache Spark:

Its fast cluster computing system. Ready to use & scalable cluster framework.
It is available in different sets as per requirement – processing, streaming, SQL, graph etc.

How to install: (Useful Links)
http://spark.apache.org/docs/latest/
http://nishutayaltech.blogspot.in/2015/04/how-to-run-apache-spark-on-windows7-in.html
OR
1. You can run it without installing as well – just needed jars for spark
2. You can add maven dependency
<dependency>
<groupId>spark</groupId>
<artifactId>spark</artifactId>
<version>….</version>
</dependency>

Example in Java:
Aim – Process file and get longest line length
Code:
Dependencies :
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;

Creating RDD:
String testFile = “C://disk//filename”;
SparkConf conf = new SparkConf().setAppName(“Simple Application”).setMaster(“local[2]”);

JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> testData = sc.textFile(testFile).cache();

//Process data and get maximum line length in variable num
System.out.println(num);

With inline functions:  

int num = testData.map(new Function<String, Integer>() {public Integer call(String s) {  return s.length();}})
.reduce(new Function2<Integer, Integer, Integer>() {public Integer call(Integer a, Integer b) {   if (a > b) return a; else     return b;}});

Classes without inline Functions:
class GetLength implements Function<String, Integer>
{
public Integer call(String s) {
return s.length();
}
}

class GetMax implements Function2<Integer, Integer, Integer>
{
public Integer call(Integer a, Integer b) {
if (a > b)
return a;
else
return b;
}
}

int num = testData.map(new GetLength()).reduce(new GetMax());

Sample Example Details:

Input file content:

This is my first line.
This is next.
This is last.
This is one more extra.

Intermediate output:

With 1 thread (local[1]) – It will read file sequentially and start mapping line to length:
This is my first line => 22
This is next => 13
Reduce it to max (22,13) => 22
This is last =>13
Reduce it to max(22,13) => 22
This is one more extra => 23
Reduce it to max(22,23) => 23

With 2 threads (local[2]) – It will divide file in 2 and start mapping line to length:

This is last => 13
This is one more extra => 23
Reduce it to max(13,23) => 23

This is my first line => 22
This is next. => 13
Reduce it to max(22,13) => 22

Reduce it to max(22,23) => 23

Final Output
23