Sunday, 23 November 2025

Kafka The Definitive Guide Notes - Part 4

Connect - kafkata to other systems
OOB - best if code is not in our control
APIs, runtime 
conn plugins - libraries executed- moves data
REST API - config n manage plugins
conn - tasks(parallel, resources)
worker process-conn data obj - tasks - src/sink
convertors - data stored in diff formats
prodn - sep servers from brokers
standalone, distri
boot, group.id(conn cluster), kv converter
conv - schema
mirror maker 2 - plugin 
rebalance - conn del
src - 1 topic, sink - multi topics
conn - githb/custom
conn - how many tasks, split data btn tasks, worker -> config, task
tasks - in & out of kafka
worker - context - task init
worker - container proc, execs conns, tasks, handles http reqs, store conn config, start conn n tasks, auto comit offset, retries
data model, schema
------------------------
cross cluster data mirror - mm2
regionsal & central, redundancy, cloud mign
cross DC comm 
1 cluster/DC
consume from remote than produce to remote
--
hub & spoke-multi local+ 1 central
active-active- 2+DCs share data, sync issues
active-passive-disaster - copy
failover - loss/dups
start after failover - begin/end, auto offset
---
stretch cluster - entrire DC fail
mirro btn 2 dcs
docker image
uber ureplicator
--

Kafka The Definitive Guide Notes - Part 3

reqs - standard header - req type, version, correl id, client id
port - broker listens - acceptor thread - creates conn - processor thread -handles: req in req Q, Q to resp to clients
should be sent to proper partn which is leader
where to send - req type - metadata - list of topics, pns, replicas, leader replica
all brokers - metadata cache
--
PR
validations - priviliges to send, acks -all- enough in sync replicas?
req buffer  - purgatory
leader - responds after replicas
msgs - local disk, linux - FS cache
--
FR
offset, P, T, limit - min, max
zerocopy - dir file to n/w
consumer can see only if replicated everywhere
--
storage unit - partn replica
partn - cant be split btn multi brokers or disks
log.dirs - partn storage
--
T create - partn alloc
round robin, inc offset from leader, alternating racks
add new partn to least amount of partn dir
--
partn - split to segs-retention size/duration
current/active seg never deleted
--
format - seg: file - msg, offset, k, v, size, checksum, timestamp etc
dumlogseg tool
--
index: partn, offset:seg mapping, position
--
compaction--store only latest value
--
reliable data delivery-order within partn, commit - all in sync replicas, commited wont be lost as long as at least one in sync is alive, consumer can read only committed
follower - active z sessions, fetch latest from leader
--
replcn factor-3
unclean leader elec
min in syncs
--
commit configs, freq
offset reset - earliest/latest
auto, explicit 
--
exactly once - UK
--
pipeline - kafka -endpoint/mid
timeline - bulk/ stream
reliability, thruput, in mem, coupling, 
--

Kafka The Definitive Guide Notes - Part 2

Zookeeper - metadata about cluster(brokers, topic)
kraft - replaces Z
Z cluster - ensemble-odd num of servers - 5
dont go beyond 7
common config
initLimit, syncLimit - followers-leader
leader election
controller
---
broker.id - unique in cluster
port - 9092 
z.conn - broker meta
log.dirs
partitions ~ brokers
h/w, rack
---
producer record - topic, value - optional: partition, key
serializer
partitioner
topic
partition
batch
broker
--
prodr - bootstrap.servers, key.serial, val.serial
fire & forget, sync send-p.send(pr).get(), async send - p.send(pr, new CBF())
--
acks - 0,1, all - jhow many replicas must receive b4 prodr can consider it successful
retries
batch size
client.id
--
custom serial, avro
--
PR - t, k, v
msg - (k, v)--null k : round robin, same key -same partition
num of partitions consistent - mapping consistent
custom partitiong
--
consumer group - receives all msgs
consumers - part of them from partitions
rebalance - ownership
group coord - heartbeats to broker
polls n commits consumed recs
JoinGroup req to GC
1st cons to join - Group Leader - list of cons from GC- assigns subset of partns to cons
--
cons - boot, deserial, group id
c.subscrbe(t)
loop - c.poll
1 cons/thread
--
update current position in partn - commit
reads latest committed offset
auto commit, sync, async
--
rebalance code hooks - clean up
--
broker start - reg id Z - ephemereal mode
conn lost - auto node removed, watch list of brokers notified, id still exists in other DS
--
controller - 1st ephe node - broker fnct+ partn leader election-Z watch
leader - serves reqs, follower - replicates
new broker - controller checks for replicas - notifies
--
replication - distri, partnd, replicated commit log -availability,durability
partn - scale, parallel, order
out of sync followers cant become leaders
preferred leader-t creation with balanced load
--


Kafka The Definitive Guide Notes - Part 1

Data -  logs, metrics, activity, messages- notifications etc

DBs, systems - store data,KV stores, search indexes, caches

How to move - flow of data - publisher/subscriber

messaging systems-ActiveMQ, RabbitMQ, MQSeries - pub & sub

big data - hadoop - real time, store, process periodically, continious low atency processing, data warehousing
log aggregations,
ETC/transformation tools-not system - stream centric

continuously evolving and ever growing stream

Linkedin - interna infra - streaming platform - pub & sub to streams of data
store, process
modern distri sys - cluster, scale elastically
storage - guaranteed delivery, replicate persisted data
stream processing - compute derived streams, dynamic datasets - less code

-------------------------------------------------------

pub - classify
broker
sub

kafka - distri commit log/streaming
unit of data - message ~ row/record
optional metadata - hash of key - num of partitions in topic
written in batches - compressed
batch - same topic, partition
--
schema - understand msg
json, xml - type handling, compat btn schema versions
Avro - serialzn fmw - hadoop - compact, schema - payload, type, evoln
consistent data format - decoupls read n write
--
categorize - topics ~ table/folder
partition ~ single commit log
append only - order guaranteed within partition
redundancy, scalability - diff servers - horiz
stream ~ topic
stream procsg - kafka streams, apache samza, storm
--
kafka clients - producer, consumer
client APIs - Connect for integration, Streams for procsg
custom partition based on biz rules
offset - metadata - int - unque within partition
cons group - topic - cons:partition ownership
--
broker - 1 server
cluster of brokers - controller-admin, partitions to brokers, monitor
partitions - replications in multiple brokers - leader and followers
--
retention - period/topic size
expire - delete
--
multi clusters - seggr data, isolation-security, multiple DCs(disaster recovery)
replication - within cluster
mirror maker - between clusters
--
multi prods - aggr
multi cons - group
disk based retention 
scalable - huge data, without going offline
high perf
data ecosys - any i/p, o/p
--