Concurrency Models: Lambda Approach
2022, Mar 12
Lambda Approach is
Map Reduce
Emerged to be the first model where we can use cluster of machines to solve a problem of big data at scale.
Map Reduce Model
- Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines.
- Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function. The number of partitions (R) and the partitioning function are specified by the user.
Map reduce flow
- The MapReduce library in the user program splits the input files into M pieces typically 16MB per piece. It then starts up many copies of the program on a cluster of machines.
- One of the copies of the program is special, the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or reduce task.
- A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
- Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
- When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
- The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
- When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point the MapReduce call in the user program returns back to the user code.
Lambda Architecture
While we got lots of advantages with Map-Reduce in solving data problems, soon there felt a need to solve these problem more quick, fast & real-time. The bigger problem was always chunked into batches and later each batch view needs to be placed indexed at serving layer for better faster querying etc.
graph TB
subgraph m[ ]
d[Raw
data] -.-> m1 d -.-> m2 m1[Mapper] -.->rp[Reducer] m2[Mapper] -.->rp rp -.-> bv([Batch View]) bv -.-> db[(Database)] wb[Web Server] -.-> |Query| db db -.-> |Results| wb wb -.-> us[fa:fa-user-o User] end
data] -.-> m1 d -.-> m2 m1[Mapper] -.->rp[Reducer] m2[Mapper] -.->rp rp -.-> bv([Batch View]) bv -.-> db[(Database)] wb[Web Server] -.-> |Query| db db -.-> |Results| wb wb -.-> us[fa:fa-user-o User] end
To overcome this drawback, a speed layer was added, which will look for new data process and curate a real-time fast subset view which later than merged with batch view
graph TB
subgraph m[ ]
d1[Raw
Data] -.-> |Map Reduce| bv([Batch View]) d2[New
Data] -.-> |Speed Layer| sp([Speed Layer]) d[Incoming
Data] -.-> d1 d[Incoming
Data] -.-> d2 bv -.-> db[(Database)] wb[Web Server] -.-> |Query| db db -.-> |Results| wb wb -.-> us[fa:fa-user-o User] sp -.-> rl(Realtime
View) rl -.-> |results| wb end
Data] -.-> |Map Reduce| bv([Batch View]) d2[New
Data] -.-> |Speed Layer| sp([Speed Layer]) d[Incoming
Data] -.-> d1 d[Incoming
Data] -.-> d2 bv -.-> db[(Database)] wb[Web Server] -.-> |Query| db db -.-> |Results| wb wb -.-> us[fa:fa-user-o User] sp -.-> rl(Realtime
View) rl -.-> |results| wb end