Prep

Monday, 8 June 2026

OpenSearch Cheat Sheet

Index=table/collection of docs.

Document=One JSON record inside an index.

{
  "id": "IMP123",
  "event": "UE107348",
  "name": "Service A",
  "customer": [...]
}

Field=A property inside a document.

event
name
customer.customerName
customer.billAmount

Mapping=Schema of the index. Defines field types:

Type	Used for
text	Full text search
keyword	Exact match, sorting, grouping
double, integer, long	Numeric sort/range/aggregation
date	Date filter/sort
nested	Array of objects where each object must stay logically separate

text vs keyword

text is analyzed/tokenized.

Good for search:

"Gold Customer" -> "gold", "customer"

keyword is exact.

Good for:

exact match
grouping
sorting
aggregations

"customerName": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword"
    }
  }
}

Use:

customerName          -> search text
customerName.keyword  -> exact match/group/sort

Query

Filters/selects documents.

Example:

{
  "query": {
    "term": {
      "event.keyword": {
        "value": "UE107348"
      }
    }
  }
}

This means:

Find documents where event exactly equals UE107348

term vs match

Query	Use for
term	Exact value match, usually on keyword, numbers, ids
match	Text search on analyzed text fields
range	Numeric/date range
bool	Combine multiple conditions

Example bool:

{
  "bool": {
    "must": [
      { "term": { "event.keyword": "UE107348" } },
      { "term": { "customer.customerId": "CUST1" } }
    ]
  }
}

Hits

Normal search results.

size

Controls how many normal documents are returned.

If you run a query with size: 10, OpenSearch returns 10 matching documents in hits.

Good for:

search results
document pagination

{
  "size": 10
}

Means return 10 hits.

size: 0

Means:

Do not return documents, only return aggregations.

Useful when you only need counts/groups/summaries.

from + size

Document pagination.

{
  "from": 100,
  "size": 20
}

Means:

Return documents 101-120

Works well for hits/documents.

Does not directly paginate aggregation buckets.

Aggregation

Groups/summarizes documents.

Like SQL GROUP BY.

Examples:

Aggregation	Meaning
terms	Group by exact field
max	Max numeric value
min	Min numeric value
value_count	Count values
top_hits	Return sample documents inside a bucket
nested	Enter nested object array
reverse_nested	Go back from nested object to parent document

terms aggregation

Group by field.

{
  "terms": {
    "field": "customer.customerName.keyword",
    "size": 100
  }
}

Means:

Group matching data by customer name and return top 100 buckets

Metric aggregation

Calculates value per bucket.

Example:

"customer_bill_amount": {
  "max": {
    "field": "customer.billAmount"
  }
}

Means:

For each customer bucket, calculate max bill amount

Used so buckets can be sorted by bill amount.

Bucket sorting

Example:

"order": [
  { "customer_bill_amount": "desc" }
]

Means:

Sort customer buckets by bill amount descending

nested

Needed when field is an array of objects.

Example:

"customer": [
  {
    "customerId": "C1",
    "billAmount": 100
  },
  {
    "customerId": "C2",
    "billAmount": 500
  }
]

Without nested, OpenSearch can mix values from different array objects incorrectly.

Nested keeps each customer object separate.

nested query

Search inside nested object.

{
  "nested": {
    "path": "customer",
    "query": {
      "term": {
        "customer.customerId": "C1"
      }
    }
  }
}

Means:

Find impact docs where one nested customer has customerId C1

nested aggregation

Aggregate inside nested object.

{
  "nested": {
    "path": "customer"
  }
}

Means:

Go into the customer nested array and aggregate customer rows

top_hits

Returns actual documents/objects inside an aggregation bucket.

Example:

"customer_details": {
  "top_hits": {
    "size": 1
  }
}

Means:

For each customer bucket, return one sample hit to get details

Caution:

top_hits inside many buckets can become expensive.
Inner result size limits apply.

Scoring

OpenSearch gives _score when doing relevance search.

Useful for:

text search
fuzzy search
best match ranking

-------------
GET _cat/indices?v

GET _cat/indices/abcIndex-*?v

GET abcIndex/_mapping , _settings

indexes version - actual index
without version - alias that points to above

Thursday, 26 March 2026

Helidon

Eclipse Microprofile - additional specs on top of jakarta ee for microservices.

Helidon - one of micorprofile implementors.

Lightweight, fast for microservices.

main() method class - Server server = Server.create().start();

During server startup, application.yaml is referred.

backpressure - slow down, pause, close

@Outgoin(toABCChannel)

--map to SubmissionPublisher during app start

channels defined in app yaml

Sunday, 23 November 2025

Kafka The Definitive Guide Notes - Part 4

Connect - kafkata to other systems
OOB - best if code is not in our control
APIs, runtime
conn plugins - libraries executed- moves data
REST API - config n manage plugins
conn - tasks(parallel, resources)
worker process-conn data obj - tasks - src/sink
convertors - data stored in diff formats
prodn - sep servers from brokers
standalone, distri
boot, group.id(conn cluster), kv converter
conv - schema
mirror maker 2 - plugin
rebalance - conn del
src - 1 topic, sink - multi topics
conn - githb/custom
conn - how many tasks, split data btn tasks, worker -> config, task
tasks - in & out of kafka
worker - context - task init
worker - container proc, execs conns, tasks, handles http reqs, store conn config, start conn n tasks, auto comit offset, retries
data model, schema
------------------------
cross cluster data mirror - mm2
regionsal & central, redundancy, cloud mign
cross DC comm
1 cluster/DC
consume from remote than produce to remote
--
hub & spoke-multi local+ 1 central
active-active- 2+DCs share data, sync issues
active-passive-disaster - copy
failover - loss/dups
start after failover - begin/end, auto offset
---
stretch cluster - entrire DC fail
mirro btn 2 dcs
docker image
uber ureplicator
--

Kafka The Definitive Guide Notes - Part 3

reqs - standard header - req type, version, correl id, client id
port - broker listens - acceptor thread - creates conn - processor thread -handles: req in req Q, Q to resp to clients
should be sent to proper partn which is leader
where to send - req type - metadata - list of topics, pns, replicas, leader replica
all brokers - metadata cache
--
PR
validations - priviliges to send, acks -all- enough in sync replicas?
req buffer - purgatory
leader - responds after replicas
msgs - local disk, linux - FS cache
--
FR
offset, P, T, limit - min, max
zerocopy - dir file to n/w
consumer can see only if replicated everywhere
--
storage unit - partn replica
partn - cant be split btn multi brokers or disks
log.dirs - partn storage
--
T create - partn alloc
round robin, inc offset from leader, alternating racks
add new partn to least amount of partn dir
--
partn - split to segs-retention size/duration
current/active seg never deleted
--
format - seg: file - msg, offset, k, v, size, checksum, timestamp etc
dumlogseg tool
--
index: partn, offset:seg mapping, position
--
compaction--store only latest value
--
reliable data delivery-order within partn, commit - all in sync replicas, commited wont be lost as long as at least one in sync is alive, consumer can read only committed
follower - active z sessions, fetch latest from leader
--
replcn factor-3
unclean leader elec
min in syncs
--
commit configs, freq
offset reset - earliest/latest
auto, explicit
--
exactly once - UK
--
pipeline - kafka -endpoint/mid
timeline - bulk/ stream
reliability, thruput, in mem, coupling,
--

Kafka The Definitive Guide Notes - Part 2

Zookeeper - metadata about cluster(brokers, topic)
kraft - replaces Z
Z cluster - ensemble-odd num of servers - 5
dont go beyond 7
common config
initLimit, syncLimit - followers-leader
leader election
controller
---
broker.id - unique in cluster
port - 9092
z.conn - broker meta
log.dirs
partitions ~ brokers
h/w, rack
---
producer record - topic, value - optional: partition, key
serializer
partitioner
topic
partition
batch
broker
--
prodr - bootstrap.servers, key.serial, val.serial
fire & forget, sync send-p.send(pr).get(), async send - p.send(pr, new CBF())
--
acks - 0,1, all - jhow many replicas must receive b4 prodr can consider it successful
retries
batch size
client.id
--
custom serial, avro
--
PR - t, k, v
msg - (k, v)--null k : round robin, same key -same partition
num of partitions consistent - mapping consistent
custom partitiong
--
consumer group - receives all msgs
consumers - part of them from partitions
rebalance - ownership
group coord - heartbeats to broker
polls n commits consumed recs
JoinGroup req to GC
1st cons to join - Group Leader - list of cons from GC- assigns subset of partns to cons
--
cons - boot, deserial, group id
c.subscrbe(t)
loop - c.poll
1 cons/thread
--
update current position in partn - commit
reads latest committed offset
auto commit, sync, async
--
rebalance code hooks - clean up
--
broker start - reg id Z - ephemereal mode
conn lost - auto node removed, watch list of brokers notified, id still exists in other DS
--
controller - 1st ephe node - broker fnct+ partn leader election-Z watch
leader - serves reqs, follower - replicates
new broker - controller checks for replicas - notifies
--
replication - distri, partnd, replicated commit log -availability,durability
partn - scale, parallel, order
out of sync followers cant become leaders
preferred leader-t creation with balanced load
--

Kafka The Definitive Guide Notes - Part 1

Data - logs, metrics, activity, messages- notifications etc

DBs, systems - store data,KV stores, search indexes, caches

How to move - flow of data - publisher/subscriber

messaging systems-ActiveMQ, RabbitMQ, MQSeries - pub & sub

big data - hadoop - real time, store, process periodically, continious low atency processing, data warehousing
log aggregations,
ETC/transformation tools-not system - stream centric

continuously evolving and ever growing stream

Linkedin - interna infra - streaming platform - pub & sub to streams of data
store, process
modern distri sys - cluster, scale elastically
storage - guaranteed delivery, replicate persisted data
stream processing - compute derived streams, dynamic datasets - less code

-------------------------------------------------------

pub - classify
broker
sub

kafka - distri commit log/streaming
unit of data - message ~ row/record
optional metadata - hash of key - num of partitions in topic
written in batches - compressed
batch - same topic, partition
--
schema - understand msg
json, xml - type handling, compat btn schema versions
Avro - serialzn fmw - hadoop - compact, schema - payload, type, evoln
consistent data format - decoupls read n write
--
categorize - topics ~ table/folder
partition ~ single commit log
append only - order guaranteed within partition
redundancy, scalability - diff servers - horiz
stream ~ topic
stream procsg - kafka streams, apache samza, storm
--
kafka clients - producer, consumer
client APIs - Connect for integration, Streams for procsg
custom partition based on biz rules
offset - metadata - int - unque within partition
cons group - topic - cons:partition ownership
--
broker - 1 server
cluster of brokers - controller-admin, partitions to brokers, monitor
partitions - replications in multiple brokers - leader and followers
--
retention - period/topic size
expire - delete
--
multi clusters - seggr data, isolation-security, multiple DCs(disaster recovery)
replication - within cluster
mirror maker - between clusters
--
multi prods - aggr
multi cons - group
disk based retention
scalable - huge data, without going offline
high perf
data ecosys - any i/p, o/p
--

Thursday, 20 October 2022

Chapter 1 , Core Java by Cay Horstmann

An Introduction To Java

Java is not just a programming language, but an entire platform of libraries, execution environment - security, portability, automatic garbage collection.

Brief history:

1991 : Java founders at Sun Microsystems want to create a language suitable for consumer devices(cable) - so it has to be small and also platform neutral.

1994 : Mosaic browser - need for a language like java. Sun creates HotJava browser to show off java and the craze begins.

1996: Java 1.0 - no print

1997: 1.1 - refelection, GUI event model, inner classes

1998: 1.2 - SE, ME(embedded devices), EE

2000-2002: 1.3-1.4 - libraries, performance

2004: 1.5-> 5 - for each, loops, autoboxing, annotations, enums, static import

2006: 6 - libraries, performance

2009: Oracle buys Sun

2011: 7 - switch string, diamond op

2014: 8 - functional style prog, lambda expressions, streams, interfaces with default methods

2017: 9 - modules

2018: 11- var

2021: 17-records

Buzz words:

Simple : cleaned up C++, small(embedded devices)

Object oriented : data=objects, interfaces to objects

Distributed: access objects across net(HTTP/FTP)

Robust: compile time and run time checks, memory cant be overwritten and corrupted

Secure: secure over a network

Architecture neutral : code compiled to bytecode that can run anywhere

Portable: No implementation dependency, datatypes size doesnt vary as in C++

Interpreted : executes bytecode - fast

High performance: frequently executed bytecode to machine code - hotspots - Just In Time Compiler

Multithreaded: fast and real time

Dynamic: changing libraries without impacting dependent code, adding code to running programs