Monday, 8 June 2026

OpenSearch Cheat Sheet

 Index=table/collection of docs.

Document=One JSON record inside an index.

{ "id": "IMP123", "event": "UE107348", "name": "Service A", "customer": [...] }

Field=A property inside a document.

event name customer.customerName customer.billAmount

Mapping=Schema of the index. Defines field types:

TypeUsed for
textFull text search
keywordExact match, sorting, grouping
double, integer, longNumeric sort/range/aggregation
dateDate filter/sort
nestedArray of objects where each object must stay logically separate

text vs keyword

text is analyzed/tokenized.

Good for search:

"Gold Customer" -> "gold", "customer"

keyword is exact.

Good for:

  • exact match
  • grouping
  • sorting
  • aggregations

"customerName": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }

Use:

customerName -> search text customerName.keyword -> exact match/group/sort

Query

Filters/selects documents.

Example:

{ "query": { "term": { "event.keyword": { "value": "UE107348" } } } }

This means:

Find documents where event exactly equals UE107348

term vs match

QueryUse for
termExact value match, usually on keyword, numbers, ids
matchText search on analyzed text fields
rangeNumeric/date range
boolCombine multiple conditions

Example bool:

{ "bool": { "must": [ { "term": { "event.keyword": "UE107348" } }, { "term": { "customer.customerId": "CUST1" } } ] } }

Hits

Normal search results.


size

Controls how many normal documents are returned.

If you run a query with size: 10, OpenSearch returns 10 matching documents in hits.

Good for:

  • search results
  • document pagination
{ "size": 10 }

Means return 10 hits.

size: 0

Means:

Do not return documents, only return aggregations.

Useful when you only need counts/groups/summaries.

from + size

Document pagination.

{ "from": 100, "size": 20 }

Means:

Return documents 101-120

Works well for hits/documents.

Does not directly paginate aggregation buckets.

Aggregation

Groups/summarizes documents.

Like SQL GROUP BY.

Examples:

AggregationMeaning
termsGroup by exact field
maxMax numeric value
minMin numeric value
value_countCount values
top_hitsReturn sample documents inside a bucket
nestedEnter nested object array
reverse_nestedGo back from nested object to parent document

terms aggregation

Group by field.

{ "terms": { "field": "customer.customerName.keyword", "size": 100 } }

Means:

Group matching data by customer name and return top 100 buckets

Metric aggregation

Calculates value per bucket.

Example:

"customer_bill_amount": { "max": { "field": "customer.billAmount" } }

Means:

For each customer bucket, calculate max bill amount

Used so buckets can be sorted by bill amount.

Bucket sorting

Example:

"order": [ { "customer_bill_amount": "desc" } ]

Means:

Sort customer buckets by bill amount descending

nested

Needed when field is an array of objects.

Example:

"customer": [ { "customerId": "C1", "billAmount": 100 }, { "customerId": "C2", "billAmount": 500 } ]

Without nested, OpenSearch can mix values from different array objects incorrectly.

Nested keeps each customer object separate.

nested query

Search inside nested object.

{ "nested": { "path": "customer", "query": { "term": { "customer.customerId": "C1" } } } }

Means:

Find impact docs where one nested customer has customerId C1

nested aggregation

Aggregate inside nested object.

{ "nested": { "path": "customer" } }

Means:

Go into the customer nested array and aggregate customer rows

top_hits

Returns actual documents/objects inside an aggregation bucket.

Example:

"customer_details": { "top_hits": { "size": 1 } }

Means:

For each customer bucket, return one sample hit to get details

Caution:

  • top_hits inside many buckets can become expensive.
  • Inner result size limits apply.

Scoring

OpenSearch gives _score when doing relevance search.

Useful for:

  • text search
  • fuzzy search
  • best match ranking

-------------
GET _cat/indices?v

GET _cat/indices/abcIndex-*?v

GET abcIndex/_mapping , _settings

indexes version - actual index 
without version - alias that points to above

Thursday, 26 March 2026

Helidon

Eclipse Microprofile - additional specs on top of jakarta ee for microservices.

Helidon - one of micorprofile implementors.

Lightweight, fast for microservices.

main() method class - Server server = Server.create().start();

During server startup, application.yaml is referred.

backpressure - slow down, pause, close

 















@Outgoin(toABCChannel)
--map to SubmissionPublisher during app start

channels defined in app yaml





Sunday, 23 November 2025

Kafka The Definitive Guide Notes - Part 4

Connect - kafkata to other systems
OOB - best if code is not in our control
APIs, runtime 
conn plugins - libraries executed- moves data
REST API - config n manage plugins
conn - tasks(parallel, resources)
worker process-conn data obj - tasks - src/sink
convertors - data stored in diff formats
prodn - sep servers from brokers
standalone, distri
boot, group.id(conn cluster), kv converter
conv - schema
mirror maker 2 - plugin 
rebalance - conn del
src - 1 topic, sink - multi topics
conn - githb/custom
conn - how many tasks, split data btn tasks, worker -> config, task
tasks - in & out of kafka
worker - context - task init
worker - container proc, execs conns, tasks, handles http reqs, store conn config, start conn n tasks, auto comit offset, retries
data model, schema
------------------------
cross cluster data mirror - mm2
regionsal & central, redundancy, cloud mign
cross DC comm 
1 cluster/DC
consume from remote than produce to remote
--
hub & spoke-multi local+ 1 central
active-active- 2+DCs share data, sync issues
active-passive-disaster - copy
failover - loss/dups
start after failover - begin/end, auto offset
---
stretch cluster - entrire DC fail
mirro btn 2 dcs
docker image
uber ureplicator
--

Kafka The Definitive Guide Notes - Part 3

reqs - standard header - req type, version, correl id, client id
port - broker listens - acceptor thread - creates conn - processor thread -handles: req in req Q, Q to resp to clients
should be sent to proper partn which is leader
where to send - req type - metadata - list of topics, pns, replicas, leader replica
all brokers - metadata cache
--
PR
validations - priviliges to send, acks -all- enough in sync replicas?
req buffer  - purgatory
leader - responds after replicas
msgs - local disk, linux - FS cache
--
FR
offset, P, T, limit - min, max
zerocopy - dir file to n/w
consumer can see only if replicated everywhere
--
storage unit - partn replica
partn - cant be split btn multi brokers or disks
log.dirs - partn storage
--
T create - partn alloc
round robin, inc offset from leader, alternating racks
add new partn to least amount of partn dir
--
partn - split to segs-retention size/duration
current/active seg never deleted
--
format - seg: file - msg, offset, k, v, size, checksum, timestamp etc
dumlogseg tool
--
index: partn, offset:seg mapping, position
--
compaction--store only latest value
--
reliable data delivery-order within partn, commit - all in sync replicas, commited wont be lost as long as at least one in sync is alive, consumer can read only committed
follower - active z sessions, fetch latest from leader
--
replcn factor-3
unclean leader elec
min in syncs
--
commit configs, freq
offset reset - earliest/latest
auto, explicit 
--
exactly once - UK
--
pipeline - kafka -endpoint/mid
timeline - bulk/ stream
reliability, thruput, in mem, coupling, 
--

Kafka The Definitive Guide Notes - Part 2

Zookeeper - metadata about cluster(brokers, topic)
kraft - replaces Z
Z cluster - ensemble-odd num of servers - 5
dont go beyond 7
common config
initLimit, syncLimit - followers-leader
leader election
controller
---
broker.id - unique in cluster
port - 9092 
z.conn - broker meta
log.dirs
partitions ~ brokers
h/w, rack
---
producer record - topic, value - optional: partition, key
serializer
partitioner
topic
partition
batch
broker
--
prodr - bootstrap.servers, key.serial, val.serial
fire & forget, sync send-p.send(pr).get(), async send - p.send(pr, new CBF())
--
acks - 0,1, all - jhow many replicas must receive b4 prodr can consider it successful
retries
batch size
client.id
--
custom serial, avro
--
PR - t, k, v
msg - (k, v)--null k : round robin, same key -same partition
num of partitions consistent - mapping consistent
custom partitiong
--
consumer group - receives all msgs
consumers - part of them from partitions
rebalance - ownership
group coord - heartbeats to broker
polls n commits consumed recs
JoinGroup req to GC
1st cons to join - Group Leader - list of cons from GC- assigns subset of partns to cons
--
cons - boot, deserial, group id
c.subscrbe(t)
loop - c.poll
1 cons/thread
--
update current position in partn - commit
reads latest committed offset
auto commit, sync, async
--
rebalance code hooks - clean up
--
broker start - reg id Z - ephemereal mode
conn lost - auto node removed, watch list of brokers notified, id still exists in other DS
--
controller - 1st ephe node - broker fnct+ partn leader election-Z watch
leader - serves reqs, follower - replicates
new broker - controller checks for replicas - notifies
--
replication - distri, partnd, replicated commit log -availability,durability
partn - scale, parallel, order
out of sync followers cant become leaders
preferred leader-t creation with balanced load
--


Kafka The Definitive Guide Notes - Part 1

Data -  logs, metrics, activity, messages- notifications etc

DBs, systems - store data,KV stores, search indexes, caches

How to move - flow of data - publisher/subscriber

messaging systems-ActiveMQ, RabbitMQ, MQSeries - pub & sub

big data - hadoop - real time, store, process periodically, continious low atency processing, data warehousing
log aggregations,
ETC/transformation tools-not system - stream centric

continuously evolving and ever growing stream

Linkedin - interna infra - streaming platform - pub & sub to streams of data
store, process
modern distri sys - cluster, scale elastically
storage - guaranteed delivery, replicate persisted data
stream processing - compute derived streams, dynamic datasets - less code

-------------------------------------------------------

pub - classify
broker
sub

kafka - distri commit log/streaming
unit of data - message ~ row/record
optional metadata - hash of key - num of partitions in topic
written in batches - compressed
batch - same topic, partition
--
schema - understand msg
json, xml - type handling, compat btn schema versions
Avro - serialzn fmw - hadoop - compact, schema - payload, type, evoln
consistent data format - decoupls read n write
--
categorize - topics ~ table/folder
partition ~ single commit log
append only - order guaranteed within partition
redundancy, scalability - diff servers - horiz
stream ~ topic
stream procsg - kafka streams, apache samza, storm
--
kafka clients - producer, consumer
client APIs - Connect for integration, Streams for procsg
custom partition based on biz rules
offset - metadata - int - unque within partition
cons group - topic - cons:partition ownership
--
broker - 1 server
cluster of brokers - controller-admin, partitions to brokers, monitor
partitions - replications in multiple brokers - leader and followers
--
retention - period/topic size
expire - delete
--
multi clusters - seggr data, isolation-security, multiple DCs(disaster recovery)
replication - within cluster
mirror maker - between clusters
--
multi prods - aggr
multi cons - group
disk based retention 
scalable - huge data, without going offline
high perf
data ecosys - any i/p, o/p
--



Thursday, 20 October 2022

Chapter 1 , Core Java by Cay Horstmann

An Introduction To Java

Java is not just a programming language, but an entire platform of libraries, execution environment - security, portability, automatic garbage collection.

Brief history:

1991 : Java founders at Sun Microsystems want to create a language suitable for consumer devices(cable) - so it has to be small and also platform neutral.

1994 : Mosaic browser - need for a language like java. Sun creates HotJava browser to show off java and the craze begins.

1996: Java 1.0 - no print

1997: 1.1 - refelection, GUI event model, inner classes

1998: 1.2 - SE, ME(embedded devices), EE

2000-2002: 1.3-1.4 - libraries, performance

2004: 1.5-> 5 - for each, loops, autoboxing, annotations, enums, static import

2006: 6 - libraries, performance

2009: Oracle buys Sun

2011: 7 - switch string, diamond op

2014: 8 - functional style prog, lambda expressions, streams, interfaces with default methods

2017: 9 - modules

2018: 11- var

2021: 17-records


Buzz words:

Simple : cleaned up C++, small(embedded devices)

Object oriented : data=objects, interfaces to objects

Distributed: access objects across net(HTTP/FTP)

Robust: compile time and run time checks, memory cant be overwritten and corrupted

Secure: secure over a network

Architecture neutral : code compiled to bytecode that can run anywhere

Portable: No implementation dependency, datatypes size doesnt vary as in C++

Interpreted : executes bytecode - fast

High performance: frequently executed bytecode to machine code - hotspots - Just In Time Compiler

Multithreaded: fast and real time

Dynamic: changing libraries without impacting dependent code, adding code to running programs

~