Skip to content

Commit afed545

Browse files
committed
backfill oct
1 parent 3862912 commit afed545

File tree

13 files changed

+314
-0
lines changed

13 files changed

+314
-0
lines changed

posts/011025.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: 'b10 h100s'
3+
tags: 'journal'
4+
date: 'Oct 1, 2025'
5+
---
6+
7+
had the opportunity of playing around with b10 inference. you can tweak the min and max replicas, and concurrency targets, and h100's cost .133 dollars per minute, which adds up to ~191.5 per day per replica. if you spin up 20 of these it's 3800 a day. is that expensive or not. i'm not sure. but that is vc money being put to good use for sure, especially given the use case.

posts/021025.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: 'guided decoding'
3+
tags: 'journal, llm'
4+
date: 'Oct 2, 2025'
5+
---
6+
7+
got to finetune llama 8b and 70b and gemma models for the task.
8+
9+
i'm actually doing finetuning rather than just writing prompts now. it's really fun.
10+
11+
also looked into [guided decoding](https://guideddecoding.github.io/) which is how u can guarantee llms output valid structured data
12+
13+
traditional decoding samples from full vocab:
14+
15+
`p(token_t | context)` -> softmax over entire vocab
16+
17+
this means the model can generate anything, sometimes you get valid json, sometimes not
18+
19+
the solution: guided decoding masks invalid tokens at each
20+
21+
`p(token_t | context, constraints)` -> softmax over valid_tokens
22+
23+
how it works: finite state automaton (FSA)- basically a lookup table that says, in state x, these are the valid tokens.
24+
25+
1. compile constraints -> FSA (one time, cached). your json schema becomes a state machine
26+
2. during generation:
27+
- checks current FSA state
28+
- lookup which tokens are valid
29+
- mask everything else
30+
- sample from valid tokens only
31+
3. after each token:
32+
- update fsa state
33+
- repeat
34+
35+
vLLM supports three backends for this
36+
37+
- outlines: good for regex
38+
- lm-format-enforcer: character level
39+
- xgrammar: optimized for nested structures
40+
41+
the con? overhead is a 5-15% slower generation

posts/031025.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
title: 'model evals'
3+
tags: 'journal'
4+
date: 'Oct 3, 2025'
5+
---
6+
7+
the models were finetuned today. with an h100, it only took a few hours for 7b, less so for gemma 4b . the 70b on the other hand is a beefy one. without fsdp or deepspeed which would've sped up finetuning considerably, which is tough to setup.
8+
9+
with claude code, making plots and performance reports is so easy. adhoc scripts are all one shotted.

posts/041025.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: 'poetry, pyenv, direnv'
3+
tags: 'journal'
4+
date: 'Oct 4, 2025'
5+
---
6+
7+
made prs for different parts of my code. it wasn't until i started at oe that i realize the importance of compartmentalizing parts of my code so its easier to review.
8+
9+
also set up poetry development environment with pyenv, and direnv for auto-activation. python env is still a headache. i now just use uv. i believe uv solves everything
10+
11+
went to pickup the pottery we made at mud studio. they actually turned out so well. i would love to do it again with T

posts/051025.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
title: 'aus -> miami'
3+
tags: 'journal'
4+
date: 'Oct 5, 2025'
5+
---
6+
7+
flew southwest to miami. the flight was 3 hours. i sketched out all components i needed to put everything together. i've been flying a lot lately. one of the biggest perks working here is the amount of exposure to new things. i have never been to texas nor florida just 2 months ago. and now i live in austin, and i might even relocate to miami. things are moving fast, and i'm not sure i'm even keeping up with who i'm becoming.
8+
9+
me and R got to east miami and we went for sushi soon after. it was at a mall beside our hotel. the food was food truck quality, but it was a nice hangout with him. the hotel is nice. it has a balcony where you can oversee other fancy high rise apartments with a large balcony as well. miami reminds me of malaysia a lot.

posts/061025.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
title: 'miami office'
3+
tags: 'journal'
4+
date: 'Oct 6, 2025'
5+
---
6+
7+
i got room service. acai bowl and some sausages and a smoothie.
8+
9+
had a chat with e about the plan and updates. the goal was to speed up and try to fit 70b in one h100 instead of two. tried out online dynamic quantization with vllm (fp8). it was actually slower than before and also still required the same amount of memory. pointless.
10+
11+
next explored qlora training with bitsandbytes. kept facing 0.0 grad_norm issues and tried to debug the entire day and couldn't figure out why. a problem for another day to solve.
12+
13+
at night all of us went to the rooftop and everyone shared about their past failures and projects and i felt lucky to be here at a stage where things are more stable and where growth is skyrocketing, but also wished i was part of the early experience and just building fun experimental features and making mistakes. i'm at a point where i feel like i'm the new guy that doesn't fit in anywhere, and it's hard to bond when i have little in common with everyone. but i'm sure i will find my place eventually. insecurity and overthinking stems from thinking too much about myself. i just need to relax and by myself.

posts/071025.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: 'asking for help'
3+
tags: 'journal'
4+
date: 'Oct 7, 2025'
5+
---
6+
7+
finding it hard to reach out to people and ask for help when everyone is busy with their own projects. also finding it hard to communicate and chat with people around me because i'm not that familiar with the language and culture yet. i can feel my brain constantly going, you should participate in this convo, but what should i say, should i say it now? if i say it would i be weird? i'm already new, do i want this to be their impression of me? my overthinking muscle goes hyperactive and i just end up wanting to stick my head under the ground. i am an awkward boy. i'm still learning to accept that fact, and also be more confident in taking little stabs at conversing and making jokes. it is an art. i'm practicing imitation learning 24/7.

posts/081025.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
title: 'partial'
3+
tags: 'python'
4+
date: 'Oct 8, 2025'
5+
---
6+
7+
understanding partial() with async/futures
8+
9+
## the basic idea
10+
11+
partial "freezes" args in a fn so you don't have to pass them every time
12+
13+
```py
14+
from functools import partial
15+
16+
def add(a, b, c):
17+
return a + b + c
18+
19+
add_5_and_10 = partial(add, 5, 10)
20+
add_5_and_10(3) # returns 18 (same as add(5, 10, 3))
21+
```
22+
23+
## the problem: fetching from multiple APIs
24+
25+
imagine you need to fetch user data from 3 different API endpoints at the same time
26+
27+
here's the messy way:
28+
29+
```py
30+
import asyncio
31+
from functools import partial
32+
from concurrent.futures import ThreadPoolExecutor
33+
34+
def fetch_data(user_id, api_endpoint, timeout=30, retry=3, api_key="secret"):
35+
return f"Data from {api_endpoint} for user {user_id}"
36+
37+
async def get_user_data_messy(user_id):
38+
executor = ThreadPoolExecutor()
39+
loop = asyncio.get_event_loop()
40+
41+
# repetition
42+
future1 = loop.run_in_executor(
43+
executor,
44+
lambda: fetch_data(user_id, "profile", 30, 3, "secret")
45+
)
46+
future2 = loop.run_in_executor(
47+
executor,
48+
lambda: fetch_data(user_id, "orders", 30, 3, "secret")
49+
)
50+
future3 = loop.run_in_executor(
51+
executor,
52+
lambda: fetch_data(user_id, "reviews", 30, 3, "secret")
53+
)
54+
55+
results = await asyncio.gather(future1, future2, future3)
56+
return results
57+
```
58+
59+
the clean way with partial:
60+
61+
```py
62+
async def get_user_data_clean(user_id):
63+
executor = ThreadPoolExecutor()
64+
loop = asyncio.get_event_loop()
65+
66+
# common way
67+
fetcher = partial(
68+
fetch_data,
69+
user_id=user_id,
70+
timeout=30,
71+
retry=3,
72+
api_key="secret"
73+
)
74+
75+
endpoints = ["profile", "orders", "reviews"]
76+
77+
futures = [
78+
loop.run_in_executor(executor, partial(fetcher, api_endpoint=ep))
79+
for ep in endpoints
80+
]
81+
82+
results = await asyncio.gather(*futures)
83+
return results
84+
```
85+
86+
## why the double partial
87+
88+
```py
89+
loop.run_in_executor(executor, partial(fetcher, api_endpoint=ep))
90+
```
91+
92+
here's what's actually happening:
93+
94+
```py
95+
# first partial: lock in the common stuff
96+
fetcher = partial(fetch_data, user_id=user_id, timeout=30, retry=3, api_key="secret")
97+
98+
# second partial: add the specific endpoint
99+
profile_fetcher = partial(fetcher, api_endpoint="profile")
100+
101+
# now profile_fetcher() is a zero-argument callable
102+
# calling it is the same as: fetch_data(user_id, "profile", 30, 3, "secret")
103+
```
104+
105+
## seeing it run
106+
107+
```py
108+
import time
109+
110+
def fetch_data(user_id, api_endpoint, timeout=30, retry=3, api_key="secret"):
111+
time.sleep(1) # pretend this is an API call
112+
return f"Data from {api_endpoint} for user {user_id}"
113+
114+
async def main():
115+
start = time.time()
116+
results = await get_user_data_clean(12345)
117+
print(f"completed in {time.time() - start:.2f}s")
118+
print(results)
119+
# completed in 1.01s (all 3 APIs ran at the same time)
120+
# ['Data from profile for user 12345',
121+
# 'Data from orders for user 12345',
122+
# 'Data from reviews for user 12345']
123+
124+
asyncio.run(main())
125+
```

posts/091025.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: 'mia -> sf'
3+
tags: 'journal'
4+
date: 'Oct 9, 2025'
5+
---
6+
7+
flight in the early evening. i went walking around the hotel and got an acai bowl for breakfast. i found i like eating these, the sugar content is worrying though. i checked out, left my bag, and went to work out of capitol one cafe. then went to a slop bowl restaurant for lunch. it then started drizzling, lightly then it came all at once. just like malaysia. i decided to leave for the airport early and then it turned out to be an hour long instead of 20 minutes. i had a premonition perhaps. upon arrival and entering security, i was victim of an incredibly annoying flaw of the miami airport – the checkpoints are segmented by concourse. which means if you went through security for concourse G, you can't enter concourse H. this was my first time having my bags and body checked twice, before i finally got to the right gate. luckily I still had time to get food for my 6 hour and 30 min journey to sf. arriving at W's house after the grueling flight, words came pouring out of my mouth. i trauma dumped for an hour or two. all the emotions and feelings pent up inside while i worked and worked.

posts/101025.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
title: 'performance client'
3+
tags: 'rust, python, sf'
4+
date: 'Oct 10, 2025'
5+
---
6+
7+
learned about batch calls using the b10 [performance client](https://github.com/basetenlabs/truss/tree/main/baseten-performance-client) today
8+
9+
the problem we solve: even with async, you're bottlenecked by
10+
11+
- python's GIL (no true parallelism)
12+
- no smart batching
13+
- no request hedging (p99 latency kills you)
14+
15+
what is request hedging?
16+
17+
imagine you send a request, and 99% of it come back in 100ms, but 1% takes 5s due to network or slow replica
18+
19+
request hedging is : after Xms, send a duplicate request, whichever finishes first wins, slow one gets cancelled
20+
21+
it's like calling an uber, a waymo, and a lyft, and whichever arrives first, you get on, the rest you cancel. (wouldn't that be a great app)
22+
23+
the catch: it costs extra requests. you can cap this at a budget with b10
24+
25+
26+
```py
27+
from baseten_performance_client import PerformanceClient
28+
29+
client = PerformanceClient(
30+
base_url="https://api.openai.com",
31+
api_key="your-key"
32+
)
33+
34+
texts = ["doc " + str(i) for i in range(100000)]
35+
36+
response = client.embed(
37+
input=texts,
38+
model="text-embedding-3-small",
39+
batch_size=128, # pack by count
40+
max_chars_per_request=50000, # or by chars (hits limit first)
41+
max_concurrent_requests=256,
42+
hedge_delay=0.5 # send duplicate after 0.5s
43+
)
44+
```
45+
46+
---
47+
48+
went to the ferry building and tried lunette, the cambodian restaurant. the pork noodle soup was decent, esp for the $28 price tag, i had high expectations after watching that yt video, but i cannot trust youtubers. went to the main library and picked up two books from the bookstore. then worked out of there for a few hours. walked to ikea to get some meatballs. then worked out of saluhall, a modern food hall with tasteful decor and lights. i sat there getting more work done before i rushed to pickup the chicken rice i ordered from Gai and Rice and to catch my waymo home.

0 commit comments

Comments
 (0)