Cirilla is an open source learning project aiming at implementing various LLMs. It is focused mainly on showing how to make, train, infer and deploy a LLM from scratch using PyTorch and a budget-friendly GPU (RTX 4060Ti 16GiB ~500$).
Fig.1 Ciri Gwent card by Bogna Gawrońska
Cirilla Fiona Elen Riannon, known as Ciri, is one of the central characters in The Witcher saga by Andrzej Sapkowski and its adaptations. She is the princess of Cintra, granddaughter of Queen Calanthe, and the sole heir to a powerful lineage marked by the mysterious Elder Blood.
Ciri is defined by her destiny, adaptability, and potential. Unlike kings who wield authority by birthright, her strength comes from surviving chaos, learning from mentors like Geralt and Yennefer, and unlocking extraordinary powers.
Her unique abilities make her one of the most pivotal figures in the saga. Known as the Lady of Space and Time, the Lion Cub of Cintra, and the Child of the Elder Blood, she can manipulate space and time, travel between worlds, and influence the course of events in ways few can.
On a high level: imagine a toddler with a huge amount of knowledge but still possessing a toddler-like way of reasoning and understanding.
On a lower level: an LLM is a neural network trained on big data to recognize patterns, generate human-like responses, and predict the most likely next word in a given context. While it can process and recall information efficiently, it lacks true understanding, reasoning, or consciousness, relying only on statistical correlations rather than genuine comprehension. The reasoning of LLMs is being improved in projects (most notably) like DeepSeek, which focus on enhancing the ability to understand context and simulate human-like reasoning.
Documentation
OllamaCurate
class cirilla.synth_data.OllamaCurate(model, system_prompt, response_template)
Generall class for creating syntetic datasets with ollama
| Argument | Type | Description |
model | str | name of ollama model e.g. "llama3.1:8b" |
system_prompt | str | system prompt for the chosen ollama model, used by its functions: __call__, dynamic_hierarchical_summary and single_pass_summary |
response_template | pydantic.BaseModel | template for structured responses, used by its functions: __call__, dynamic_hierarchical_summary, single_pass_summary and multi_turn |
__call__(paths, save_to, seed, checkpoint, skip) → None
Create synthetic instructions (question-answer pairs)
to see how to turn the instructions into a .jsonl file see the rm_duplicate_instructs function
| Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to create question answer pairs |
save_to | Path | folder to save generated question answer pairs to |
seed | int | seed for the ollama model |
checkpoint | int | save the generated question answer pairs to the save_to folder after checkpoint iterations |
skip | bool | if True, skip the already existing question answer pairs, else save as existing_name_{i} where i is the number of already existing files of the same name |
Example:
import os
import random
class Response(BaseModel):
question: str = Field(description="What question is appropriate to this text?")
answer: str = Field(description="Answer to the question")
sys_prompt = """
You are an expert dataset annotator for instruction-tuning large language models.
Your task is to create high-quality question-answer pairs from provided texts
for training instruct models.
Guidelines:
- Keep the question relevant and informative for learners.
- Avoid using markdown or any unnecessary formatting.
- You can ask to elaborate based on a keyword or phrase in the text.
- You can ask about the plot if the text is a story.
- Do not use overly formal language.
- Use only the information provided in the text.
- If the text states that any part of it is from Netflix, or mentions that a section is from Netflix,
ignore that part and do not include it in the question or answer.
- If user specifies already created question and answer pair, find a different question and answer
pair that is different from the one provided. If this is impossible use different words then the ones
provided.
- Return the output strictly as a JSON with two fields: "question" and "answer".
"""
folder = "./witcher_fandom"
paths = os.listdir(folder)
paths = [os.path.join(folder, p) for p in paths]
print(f"{len(paths)} paths found")
for _ in range(3):
dual = OllamaCurate(
model="qwen3:8b",
system_prompt=sys_prompt,
response_template=Response
)
dual(
paths,
save_to="./witcher_synthetic_instruct/qwen3:8b",
seed=random.randint(0, 1000),
checkpoint=10
skip=False,
)
Example created output file: ./witcher_synthetic_instruct/qwen3:8b/Deadly Plot.json
{
"question": "What location must the player meet Dijkstra at for the A Deadly Plot quest?",
"answer": "Passiflora."
}
The output file's contents depend on whatever was present in the corresponding text file:
./witcher_fandom/Deadly Plot.txt
single_pass_summary(paths, save_to, seed, num_predict, use_response_template) → None
For provided .txt files create a simple summary of them
See how to turn summaries into a dataset: gather_summaries summaries_to_instruct
| Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to summarize |
save_to | Path | folder to save generated summaries to |
seed | int | seed for the ollama model |
num_predict | int | maximum number of tokens the ollama model can generate |
use_response_template | bool | whether to use a response template, that is set when initialing the OllamaCurate class |
Example:
import os
import random
from pydantic import BaseModel, Field
class Response(BaseModel):
summary: str = Field(description="Final summary of the entire text. Pure summary only,
no introduction or reasoning.")
paths = os.listdir("./witcher_fandom")
paths = [os.path.join("./witcher_fandom", p) for p in paths]
print(f"{len(paths)} paths found")
for m in ["granite3.1-moe:3b"]:
model = OllamaCurate(model=m, system_prompt="", response_template=Response)
model.single_pass_summary(
paths,
save_to=f"./witcher_synth_summaries/{m}",
seed=random.randint(0, 1000),
num_predict=4096,
use_response_template=True,
)
dynamic_hierarchical_summary(paths, save_to, chunk_lines, seed, num_predict, max_words_summary, use_response_template) → None
For provided .txt files create a hierarchical summary of them. Meaning that for a text divided into chunks we subsequently generate a summary for each chunk, and then generate a final summary based on the summaries of the chunks
| Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to summarize |
save_to | Path | folder to save generated summaries to |
chunk_lines | int | the number of lines by which to chunk the text e.g. for a text of 1000 lines and chunk_lines=200 the text will be divided into 5 chunks of 200 lines each, all the chunks will get their own summary and based on them a final summary will be generated |
seed | int | seed for the ollama model |
num_predict | int | maximum number of tokens the ollama model can generate |
max_words_summary | int | maximum number of words ollama should target for the summary |
use_response_template | bool | whether to use a response template, that is set when initialing the OllamaCurate class |
Example:
import os
from pydantic import BaseModel, Field
import os
import random
class Response(BaseModel):
summary: str = Field(description="Summary of the text, without the thinking process and without any
introduction. Provide only pure summary, be expressive but stick to the maximum number of words
that were provided.")
paths = os.listdir('./witcher_texts')
paths = [os.path.join('./witcher_texts', p) for p in paths]
print(f"{len(paths)} paths found")
for m in ['llama3.2:3b', 'llama3.1:8b', 'qwen3:8b']:
model = OllamaCurate(model=m,
system_prompt="",
response_template=Response
)
model.dynamic_hierarchical_summary(paths,
save_to=f'./synth_sumarries/witcher_texts/{m}',
chunk_lines=100,
seed=random.randint(0, 1000),
num_predict=2048,
max_words_summary=500,
use_response_template=True
)
multi_turn(paths, save_to, bar, n_turns_range, seed, prob_chance_new_context) → None
Create synthetic multi turn instructions (question-answer pairs)
| Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to create multi turn question answer pairs |
save_to | Path | folder to save generated mult turn question answer pairs to |
bar | tqdm.tqdm | tqdm bar to track progress |
n_turns_range | tuple[int,int] | range of number of turns to generate, e.g. (2, 4) will generate question answer pairs with 2 to 4 turns (the actual number of turns will be randomly chosen from this range and consistent for all the files) |
seed | int | seed for the ollama model |
prob_chance_new_context | float | Probability of starting a new context for the question answer pairs e.g. with a given chance we can continue the multi turn conversation with a new context from another randomly chosen .txt file |
Example:
from pydantic import BaseModel, Field
import os
import random
class Response(BaseModel):
question: str = Field(description="What question is appropriate to this text?")
answer: str = Field(description="Answer to the question")
paths = os.listdir('./witcher_fandom')
paths = [os.path.join('./witcher_fandom', p) for p in paths]
bar = tqdm(total=3*3*len(paths))
for model in ['qwen3:8b', 'phi4', 'llama3.1:8b']:
for _ in range(3):
ol = OllamaCurate(model, "", Response)
ol.multi_turn(paths,
save_to=f'./synth_multi_round/{model}',
bar=bar,
n_turns_range=(2, 5),
seed=random.randint(0, 1000),
prob_chance_new_context=0.3)
rm_duplicate_instructs
func cirilla.synth_data.rm_duplicate_instructs(main_dir, save_to) → None
remove duplicate synthetic instructions
| Argument | Type | Description |
main_dir | Path | path to the directory containing the synthetic instructions |
out_path | Path | path to save the cleaned instructions to |
Example:
main_dir = './witcher_synthetic_instruct'
save_to = './witcher_synthetic_instruct.jsonl'
rm_duplicate_instructs(main_dir, save_to)
Example input file: ./witcher_synthetic_instruct/qwen3:8b/Deadly Plot.json
{
"question": "What is the objective of the A Deadly Plot quest in The Witcher 3: Wild Hunt?",
"answer": "The objective of the A Deadly Plot quest is to help Dijkstra ..."
}
Example input file: ./witcher_synthetic_instruct/qwen3:8b/Deadly Plot_1.json
{
"question": "What is the main objective of the A Deadly Plot quest in The Witcher 3: Wild Hunt?",
"answer": "The objective of the A Deadly Plot quest is to help Dijkstra ..."
}
Since the second instruction is nearly identical, it will be deleted in the final
.jsonl file.
Example created output file (obviously the actual .jsonl file is saved in single lines): ./witcher_synthetic_instruct.jsonl
{"subject": "Silver sword", "text": [
{"role": "user", "content": "What material are silver swords made from in The Witcher?"},
{"role": "assistant", "content": "Meteoric iron coated with silver and inscribed with magical runes"}
], "data type": "conv", "model": "llama3.1:8b"}
...
gather_summaries
func cirilla.synth_data.gather_summaries(in_path, out_path) → None
turn summaries into a training dataset
| Argument | Type | Description |
in_path | Path | path to the directory containing summaries in .txt files, they can be nested |
out_path | Path | path to save the instructions to |
Example:
in_path = './witcher_synth_summaries'
out_path = './witcher_summaries_gathered.jsonl'
gather_summaries(in_path, out_path)
Example input file: ./witcher_synth_summaries/llama3.1:8b/Alchemist.txt
In the Witcher lore, Alchemists are individuals who practice alchemy ...
Example created output file (obviously the actual .jsonl file is saved in single lines): ./witcher_summaries_gathered.jsonl
{
"subject": "sorcerers",
"text":"Mages, or sorcerers/sorceresses, ...",
"data type": "plain text", "source": "fandom", "model": "llama3.2:3b"
}
...
summaries_to_instruct
func cirilla.synth_data.summaries_to_instruct(in_path, out_path) → None
turn summaries into simple instructions. The user questions are chosen as one from a list
| Argument | Type | Description |
in_path | Path | path to the directory containing summaries in .txt files, they can be nested |
out_path | Path | path to save the instructions to |
Example:
in_path = './witcher_synth_summaries'
out_path = './witcher_summaries_gathered_instr.jsonl'
gather_summaries(in_path, out_path)
Example input file: ./witcher_synth_summaries/llama3.1:8b/Alchemist.txt
In the Witcher lore, Alchemists are individuals who practice alchemy ...
Example created output file (obviously the actual .jsonl file is saved in single lines): ./witcher_summaries_gathered_instr.jsonl
{
"subject": "Ravik",
"text": [
{"role": "user", "content": "What is notable about Ravik?"},
{"role": "assistant", "content": "Ravik, also known as Ravvy, was a friend ..."}
],
"data type": "plain text", "source": "fandom", "model": "llama3.2:3b"
}
...
As you may notice the user's question is asking about Ravik, which is the name of the file
./.../Ravik.txt
get_synth_reasoning_dataset
func cirilla.synth_data.get_synth_reasoning_dataset(out_path, n_samples, specs) → None
create synthetic reasoning dataset with reasoning_gym
| Argument | Type | Description |
out_path | Path | path to save the synthetic reasoning dataset to |
n_samples | int | How many samples of the synthetic reasoning dataset to create (each sample contains 100 data points) |
specs | list[reasoning_gym.composite.DataSpec] | specs for creating the synthetic reasoning dataset |
Example:
out_path = './reason_gym_synth.jsonl'
n_samples = 400 # will contain 40'000 data points
get_synth_reasoning_dataset(out_path, n_samples)
Example output file: ./reason_gym_synth.jsonl (obviously the actual .jsonl file is saved in single lines)
{"subject": "simple_equations", "text": [
{"role": "user", "text": "Solve for g: 65*g = 3185"},
{"role": "assistant", "text": "49"}],
"data type": "conv"}
{"subject": "needle_haystack", "text": [
{"role": "user", "text": "Ismaeel worships lions. Yann embraces travel blogging. Malakhy
scorns historical documentaries. Kabeer cherishes sketching. \nWho scorns historical
documentaries? Reply only with a name."},
{"role": "assistant", "text": "Malakhy"}],
"data type": "conv"}
...
vllm_multi_turn
func cirilla.synth_data.vllm_multi_turn(paths, save_to, batch_size, system_prompt, n_turns, template, model, prob_chance_new_context) → None
create synthetic multi turn conversations about given topics with vllm
to see how to turn the instructions into a .jsonl file see the multi_turn_gather function
| Argument | Type | Description |
paths | list[Path] | list of paths to the contexts for the conversations |
save_to | Path | path to save the synthetic multi turn dataset to |
batch_size | int | batch size for the conversations |
system_prompt | str | system prompt for the conversations |
n_turns | int | number of turns for the conversations, it usually means the maximum number of turns, since some of the turns may fail and will thus be empty |
template | pydantic.BaseModel | template for the conversations |
model | str | model form huggingface for the conversations |
prob_chance_new_context | float | probability of a new context being added into the conversations |
Example:
paths_ = ['./summaries/granite3.1-moe:3b',
'./summaries/llama3.1:8b',
'./summaries/llama3.2:3b',
'./summaries/qwen3:8b']
paths = [[os.path.join(p, f) for f in os.listdir(p)] for p in paths_]
for model in ["unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit"]:
for i, sub_paths in enumerate(paths):
for _ in range(1):
vllm_multi_turn(sub_paths,
save_to=f'./synth_multi_round/{model.split("/")[1]}/{paths_[i].split("/")[-1]}',
model=model)
Where the summaries may come from OllamaCurate.single_pass_summary
Example created output file ./synth_multi_round/granite-3.2-2b-.../granite3.1-moe:3b/A Book of Tales.json:
[
{
"question": "Which specific adventure or mystery introduced in The Book of Tales ... ?",
"answer": "The given text does not specify a single adventure where the ...",
"context": "A Book of Tales"
},
{
"question": "Which three new playable races – Gnomes, ... ?",
"answer": "Gnomes are native to the mountainous region of Mount Carbon in Kovir ...",
"context": "A Book of Tales"
},
{
"question": "What central role does Radko hold in The Witcher ... ?",
"answer": "Radko, a soldier under the Bloody Baron's command ...",
"context": "Radko"
}
]
multi_turn_gather
func cirilla.synth_data.multi_turn_gather(input_path, save_to) → None
gather multi turn conversations into a .jsonl file
| Argument | Type | Description |
input_path | Path | path to a folder with .jsonl files containing multi turn conversations, they can be nested |
save_to | Path | path to save the gathered conversations to |
Example:
inp = './synth_multi_round'
outp = './multi_round.jsonl'
multi_turn_gather(inp, outp)
Example created output file ./multi_round.jsonl (obviously the actual .jsonl file is saved in single lines):
{
"subject": "A Little Sacrifice", "text":
[
{"role": "user", "content": "What does Sh'eenaz do to help resolve the ... ?"},
{"role": "assistant", "content": "Sh'eenaz agrees to give up her tail for ..."},
{"role": "user", "content": "What does Sh'eenaz do to ... ?"},
{"role": "assistant", "content": "Sh'eenaz ..."},
{"role": "user", "content": "What is the nature of Geralt's ... ?"},
{"role": "assistant", "content": "Geralt ..."}
],
"data type": "conv", "source": "fandom",
"metadata": {"contexts": ["A Little Sacrifice", "A Little Sacrifice", "A Little Sacrifice"]}
}
get_activation
func cirilla.LLM_pieces.get_activation(path) → HF kernel
get an optimized kernel from Huggingface kernel hub
those kernels mostly work only on cuda
| Argument | Type | Description |
path | Path | path to a Huggingface kernel e.g. "kernels-community/activation" |
Example:
dim=256
activation = get_activation('Motif-Technologies/activation')
rmsnorm = activation.layers.RMSNorm(dim=dim).cuda()
print(rmsnorm(torch.randn(1, dim, device='cuda')).shape)
# torch.Size([1, 256])
BertAttention
class cirilla.LLM_pieces.BertAttention(args, rope, score_mod)
BERT attention with grouped query
| Argument | Type | Description |
args | .BertAttentionArgs | arguments for the BertAttention class |
rope | cirilla.LLM_pieces.Rope.Rope | rotary positional embeddings class |
score_mod | Callable | a function to modify the attention scores (optional) |
Signature:
RMSNorm → Wqkv → RoPE → FlexAttention
Example:
from cirilla.Cirilla_model import benchmark_model_part
att = BertAttention(BertAttentionArgs(), RoPE(128, 512)).cuda().to(torch.bfloat16)
x = torch.randn(4, 512, 128*16, device='cuda', dtype=torch.bfloat16)
benchmark_model_part(att, x, "BertAttention")
# [BertAttention]
# Forward time: 1.85 ms
# Backward time: 4.03 ms
# Forward memory: 56.12 MB
# Backward memory:-36.12 MB
out = att(x)
RoPE
class cirilla.LLM_pieces.RoPE(head_dim, seq_len, device, theta, dtype)
rotary positional embeddings
| Argument | Type | Description |
head_dim | int | size of the head dimension |
seq_len | int | (maximum) sequence length |
device | Union[torch.device, str] | device to use |
theta | float | theta for sin and cos |
dtype | torch.dtype | dtype to use |
Example:
rope = RoPE(128, 512)
xq = torch.randn(2, 512, 4, 128, device='cuda', dtype=torch.bfloat16) # (b, seq_len, head, head_dim)
xk = torch.randn(2, 512, 4, 128, device='cuda', dtype=torch.bfloat16)
xq_out, xk_out = rope.apply_rotary_embeddings(xq, xk)
print(xq.shape, xq_out.shape, xq_out.dtype, xq_out.device)
print(xk.shape, xk_out.shape, xk_out.dtype, xk_out.device)
# torch.Size([2, 512, 4, 128]) torch.Size([2, 512, 4, 128]) torch.bfloat16 cuda:0
# torch.Size([2, 512, 4, 128]) torch.Size([2, 512, 4, 128]) torch.bfloat16 cuda:0
SlidingWindowAttention
class cirilla.LLM_pieces.SlidingWindowAttention(args, rope, mask, score_mod)
sliding window attention with grouped query
| Argument | Type | Description |
args | .AttentionArgs | arguments for the SlidingWindowAttention class |
rope | cirilla.LLM_pieces.Rope.Rope | rotary positional embeddings class |
mask | Union[BlockMask, .create_dynamic_block_mask] | attention mask or a function to create it |
score_mod | Callable | a function to modify the attention scores, e.g. Soft-capping |
Signature:
RMSNorm → Wqkv → RoPE → FlexAttention
Example:
import torch
from cirilla.LLM_pieces import create_dynamic_block_mask, create_static_block_mask
from attn_gym.mods import generate_tanh_softcap
SOFT_CAP = 20
x = torch.rand((1,2048,128*16), device='cuda', dtype=torch.bfloat16) # (b, seq, head_dim*h)
""" static mask """
static_mask = create_static_block_mask(sliding_window_causal, 2048, 2048)
softcap = generate_tanh_softcap(SOFT_CAP, approx=False)
# or you can do: softcap = None
rope = RoPE(128, 2048)
attention_layer = SlidingWindowAttention(AttentionArgs(),
rope,
mask=static_mask,
score_mod=softcap).to('cuda', dtype=torch.bfloat16)
out = attention_layer(x)
print(out.shape) # torch.Size([1, 2048, 2048])
"""" dynamic mask - won't trigger recompilation """
dynamic_args = AttentionArgs(static_mask=False)
attention_layer = SlidingWindowAttention(dynamic_args,
mask=create_dynamic_block_mask,
rope=rope,
score_mod=softcap).to('cuda', dtype=torch.bfloat16)
out = attention_layer(x)
print(out.shape) # torch.Size([1, 2048, 2048])
x = torch.rand((1,512,128*16), device='cuda', dtype=torch.bfloat16) # (b, seq, head_dim*h)
out = attention_layer(x)
print(out.shape) # torch.Size([1, 512, 2048])
x = torch.rand((1,256,128*16), device='cuda', dtype=torch.bfloat16) # (b, seq, head_dim*h)
out = attention_layer(x)
print(out.shape) # torch.Size([1, 256, 2048])
x = torch.rand((1,2048,128*16), device='cuda', dtype=torch.bfloat16) # (b, seq, head_dim*h)
out = attention_layer(x)
print(out.shape) # torch.Size([1, 2048, 2048])
print(create_dynamic_block_mask.cache_info()) # how many times the mask template was reused
# CacheInfo(hits=1, misses=3, maxsize=32, currsize=3)
SwiGLU
class cirilla.LLM_pieces.SwiGLU(args)
Standalone SwiGLU feed-forward block used as the expert FFN inside SMoE. Can also be used independently as a feed-forward layer.
Signature:
(W1a(x) ⊙ W1b(x)) → SiLU → W2
| Argument | Type | Description |
args | .SwiGLUArgs | dim (input/output dim), d_ff (hidden dim, must be even), drop (dropout, default 0.1) |
Example:
from cirilla.LLM_pieces import SwiGLU
from cirilla.LLM_pieces.SMoE import SwiGLUArgs
import torch
ffn = SwiGLU(SwiGLUArgs(dim=256, d_ff=512)).cuda().to(torch.bfloat16)
x = torch.randn(4, 128, 256, device='cuda', dtype=torch.bfloat16)
print(ffn(x).shape) # torch.Size([4, 128, 256])
SMoE
class cirilla.LLM_pieces.SMoE(args, experts)
pytorch implementation of SMoE
| Argument | Type | Description |
args | .SMoEArgs | arguments for the SMoE class |
experts | list[Expert] | list of experts to use |
Signature:
RMSNorm → Gating → Experts
Example:
import torch
moe = SMoE(
SMoEArgs(num_experts=4, k=2),
[SwiGLU(SwiGLUArgs()) for _ in range(4)]
).to("cuda") # hf kernel only work on cuda
x = torch.randn(4, 1024, 128,
device='cuda', requires_grad=True) # (b, seq_len, dim) ; requires grad for smoe
out = moe(x)
Dynamic_erf
class cirilla.LLM_pieces.Dynamic_erf(normalized_shape, alpha_init_value, shift_init_value)
Normalization-free alternative to RMSNorm from the Derf paper. Uses a learnable erf activation as the normalizer: weight · erf(α·x + shift) + bias. Use as a drop-in replacement by passing layer_norm='Derf' in any model Args dataclass.
| Argument | Type | Description |
normalized_shape | int | size of the last dimension to normalize |
alpha_init_value | float | initial value for the learnable scaling parameter α (default 0.5) |
shift_init_value | float | initial value for the learnable shift (default 0.0) |
Example:
from cirilla.LLM_pieces import Dynamic_erf
import torch
norm = Dynamic_erf(256).cuda().to(torch.bfloat16)
x = torch.randn(4, 128, 256, device='cuda', dtype=torch.bfloat16)
print(norm(x).shape) # torch.Size([4, 128, 256])
# Enable in any model via Args:
model = Cirilla(Args(layer_norm='Derf'))
DynamicTanh
class cirilla.LLM_pieces.DynamicTanh(normalized_shape, alpha_init_value)
Normalization-free alternative to RMSNorm from the Derf paper. Uses a learnable tanh activation: weight · tanh(α·x) + bias. Enable with layer_norm='DyT' in any model Args dataclass.
| Argument | Type | Description |
normalized_shape | int | size of the last dimension to normalize |
alpha_init_value | float | initial value for the learnable scaling parameter α (default 0.5) |
Example:
from cirilla.LLM_pieces import DynamicTanh
import torch
norm = DynamicTanh(256).cuda().to(torch.bfloat16)
x = torch.randn(4, 128, 256, device='cuda', dtype=torch.bfloat16)
print(norm(x).shape) # torch.Size([4, 128, 256])
# Enable in any model via Args:
model = Cirilla(Args(layer_norm='DyT'))
CirillaBERT
class cirilla.Cirilla_model.CirillaBERT(args)
implementation of ModernBERT with Cirilla LLM pieces
| Argument | Type | Description |
args | .BertArgs | arguments for the CirillaBERT class |
Signature:
Input Embeddings → N×(BertAttention → MoE) → Mean Pooling / tokens / classes / (RMSNorm → Wout)
Example:
model = CirillaBERT(BertArgs(output_what='classify'))
targs = TrainingArgs(hf_repo_id='AnthonyPa57/HF-torch-demo-R', local_checkpoint_folder='./test_model_bert')
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example_bert.jsonl', './example_bert.jsonl'],
shuffle_path=True, tokenizer=tokenizer, max_len=model.args.context_window)
from types import MethodType
def new_training_step(self, data):
out = self.model.pred(data[0], data[1]) # tokens, mask
loss = self.criterion(out, data[2])
return loss
trainer.training_step = MethodType(new_training_step, trainer)
trainer.criterion = nn.CrossEntropyLoss()
trainer.train(dl, dl)
pred(x, attention_mask) → torch.Tensor
forward pass of the model, can return mean pooling / tokens / classes / llm output
| Argument | Type | Description |
x | torch.Tensor | tensor of shape (b, seq_len) |
attention_mask | Optional[torch.Tensor] | attention mask, used for mean pooling |
Example:
vocab_size = 60_000
context_window = 2048
x = torch.randint(0, vocab_size,
(4, context_window),
dtype=torch.long)
mask = torch.zeros_like(x)
mask[x != 0] = 1
model = CirillaBERT(BertArgs(output_what='classify', n_classes=2))
print(model.pred(x, mask).shape) # torch.Size([4, 2])
pull_model_from_hub(hf_repo_id, **kwargs) → None
pull model from huggingface, the pulled model has to be compatible with CirillaBERT
| Argument | Type | Description |
hf_repo_id | str | huggingface repo id to pull model from |
Example:
model = CirillaBERT(BertArgs())
mode.pull_model_from_hub('AnthonyPa57/HF-torch-demo-R')
JSONLDataset
class cirilla.Cirilla_model.JSONLDataset(path, shuffle_path, device, tokenizer, max_len, pad_token, eos_token, sos_token, user_token, suffix_tokens, prefix_tokens, random_spelling_mistake_prob, random_missing_char_prob)
basic dataset for training Cirilla models with .jsonl files
| Argument | Type | Description |
path | Union[Path, tuple[Path]] | path to a folder with .jsonl file(s) that contain training data |
shuffle_path | bool | whether to shuffle the order of the .jsonl files (in place) |
device | torch.device | what device to transfer data to |
tokenizer | CirillaTokenizer | tokenizer to use |
max_len | int | maximum length of the sequence |
pad_token | str | padding token <pad> |
eos_token | str | end of sequence token <eos> |
sos_token | str | start of sequence token <sos> |
user_token | str | user token <|user|> |
suffix_tokens | list[str] | tokens to append to the end of the sequence |
prefix_tokens | list[str] | tokens to prepend to the beginning of the sequence |
random_spelling_mistake_prob | float | probability of adding a random spelling mistake |
random_missing_char_prob | float | probability of removing a random character |
Example:
import time
import numpy as np
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example.jsonl', './example.jsonl'],
shuffle_path=True, tokenizer=tokenizer, max_len=32)
print(len(dl))
times = []
dl = DataLoader(dl, batch_size=2)
for _ in range(2):
times.append(time.time())
for i in dl:
print('-'*50)
print(i[0], i[1], sep='\n')
times.append(time.time())
print(np.mean(np.diff(times)))
dataset = JSONLDataset(['./example.jsonl', './example.jsonl'],
random_missing_char_prob=0.01, random_spelling_mistake_prob=0.02)
for _ in range(4):
print(dataset._apply_random_spelling_mistake('hello world, I am a sentence'))
# hwllo world, I am a sentence
# hello worls, I am a sentence
# helo world, I am a sentemce
# helloworld, I am a sentence
Cirilla
class cirilla.Cirilla_model.Cirilla(args)
implementation of modern transformer model with Cirilla LLM pieces
| Argument | Type | Description |
args | .Args | arguments for the Cirilla class |
Signature:
Input Embeddings → N×(SlidingWindowAttention → MoE) → RMSNorm → Wout
Example:
model = Cirilla(Args())
targs = TrainingArgs(hf_repo_id='AnthonyPa57/HF-torch-demo-R',
local_checkpoint_folder='./test_model')
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example.jsonl', './example.jsonl'],
shuffle_path=True, tokenizer=tokenizer,
max_len=model.args.context_window)
trainer.train(dl, dl)
pred(x) → torch.Tensor
forward pass of the model, can return llm output of shape (b, seq_len, vocab_size)
| Argument | Type | Description |
x | torch.Tensor | tensor of shape (b, seq_len) |
Example:
vocab_size = 60_000
context_window = 2048
x = torch.randint(0, vocab_size,
(4, context_window),
dtype=torch.long)
model = Cirilla(Args())
print(model.pred(x).shape) # torch.Size([4, context_window, vocab_size])
pull_model_from_hub(hf_repo_id, **kwargs) → None
pull model from huggingface, the pulled model has to be compatible with Cirilla
| Argument | Type | Description |
hf_repo_id | str | huggingface repo id to pull model from |
Example:
model = Cirilla(Args())
mode.pull_model_from_hub('AnthonyPa57/HF-torch-demo-R')
benchmark_model_part
func cirilla.Cirilla_model.benchmark_model_part(model, x, label) → None
benchmark a part of the model
| Argument | Type | Description |
model | Callable | model or a piece of the model to benchmark |
x | torch.Tensor | input to the model or piece of the model |
label | str | label for the benchmark |
Example:
att = BertAttention(BertAttentionArgs(), RoPE(128, 512)).cuda().to(torch.bfloat16)
x = torch.randn(4, 512, 128*16, device='cuda', dtype=torch.bfloat16)
benchmark_model_part(att, x, "BertAttention")
# [BertAttention]
# Forward time: 1.85 ms
# Backward time: 4.03 ms
# Forward memory: 56.12 MB
# Backward memory:-36.12 MB
CirillaTokenizer
class cirilla.Cirilla_model.CirillaTokenizer(path, hub_url)
tokenizer for the Cirilla models, it retains the functionality of Huggingface tokenizers
| Argument | Type | Description |
path | Path | local path to a tokenizer file |
hub_url | str | huggingface repo id to pull tokenizer from |
Example:
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
chat = [
{'role': 'system', 'content': 'What is the capital of France?'},
{'role': 'user', 'content': 'What is the capital of France?'},
]
print(tokenizer.decode(tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False)))
print(tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=False))
train(dataset, special_tokens, save_to_path, **kwargs) → self.tokenizer
train the tokenizer on an iterator dataset
| Argument | Type | Description |
dataset | Union[Iterator[str], Iterator[Iterator[str]]] | dataset to train on |
special_tokens | dict[str, str] | special tokens to add to the tokenizer |
save_to_path | Path | local path to save tokenizer to |
**kwargs | Any | **kwargs for tokenizer training |
Example:
dl = JSONLDataset('./example.jsonl', shuffle_path=True)
tokenizer = CirillaTokenizer()
tokenizer.train(dl, special_tokens=SPECIAL_TOKENS, min_frequency=2)
tokenizer.push_to_hub('AnthonyPa57/HF-torch-demo2')
print(tokenizer.decode(tokenizer.encode('hello world')))
print(tokenizer.encode('What is the capital of France?'))
print(tokenizer.decode(tokenizer.encode('What is the capital of France?')))
CirillaTrainer
class cirilla.Cirilla_model.CirillaTrainer(model, training_args)
trainer for the Cirilla models
| Argument | Type | Description |
model | torch.nn.Module | model to train |
training_args | .TrainingArgs | arguments for the trainer |
Example:
model = Cirilla(Args())
targs = TrainingArgs(hf_repo_id='AnthonyPa57/HF-torch-demo-R', local_checkpoint_folder='./test_model')
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example.jsonl', './example.jsonl'], shuffle_path=True,
tokenizer=tokenizer, max_len=model.args.context_window)
trainer.train(dl, dl)
train(dataset, valid_dataset) → None
train the model on on a dataset
| Argument | Type | Description |
dataset | JSONLDataset | dataset to train on |
valid_dataset | JSONLDataset | dataset to validate on |
Example:
model = CirillaBERT(BertArgs(output_what='classify'))
targs = TrainingArgs(hf_repo_id='AnthonyPa57/HF-torch-demo-R',
local_checkpoint_folder='./test_model_bert')
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example_bert.jsonl', './example_bert.jsonl'],
shuffle_path=True, tokenizer=tokenizer, max_len=model.args.context_window)
from types import MethodType
def new_training_step(self, data):
out = self.model.pred(data[0], data[1]) # tokens, mask
loss = self.criterion(out, data[2])
return loss
trainer.training_step = MethodType(new_training_step, trainer)
trainer.criterion = nn.CrossEntropyLoss()
trainer.train(dl, dl)
benchmark() → None
benchmark the model on a randomly generated data of batch size 4
Example:
model = Cirilla(Args())
trainer.benchmark()
# average time for epoch: 0.8215 (seconds)
CirillaMTP
class cirilla.Cirilla_model.CirillaMTP(args)
Multi-Token Prediction model. Extends Cirilla with n_token_heads additional lightweight single-layer decoder heads, each predicting the next k-th token ahead in parallel. This improves convergence speed and sample efficiency with minimal extra compute.
| Argument | Type | Description |
args | .MTPArgs | arguments for CirillaMTP; extends DecoderArgs with n_token_heads (number of extra prediction heads, default 4) |
Signature:
Input Embeddings → N×(SlidingWindowAttention → MoE) → RMSNorm → Wout
+ n×(Decoder → RMSNorm → Wout) [one head per future token]
Example:
from cirilla.Cirilla_model import (
CirillaMTP, MTPArgs, CirillaTrainer, TrainingArgs,
CirillaTokenizer, JSONLDataset, mtp_training_step, mtp_inference_step
)
from types import MethodType
from functools import partial
model = CirillaMTP(MTPArgs(n_layers=4, dim=128, d_ff=256,
n_heads=8, context_window=128,
n_token_heads=4))
targs = TrainingArgs(n_epoch=1000)
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
pad_id = tokenizer.tokenizer.pad_token_id
dl = JSONLDataset(['./example.jsonl'], tokenizer=tokenizer,
max_len=model.args.context_window)
trainer.training_step = MethodType(partial(mtp_training_step, pad_id=pad_id), trainer)
trainer.inference_step = MethodType(partial(mtp_inference_step, pad_id=pad_id), trainer)
trainer.criterion = None
trainer.train(dl, dl)
CirillaTRM
class cirilla.Cirilla_model.CirillaTRM(network, args)
Tiny Recursive Model. Wraps any nn.Module in an adaptive computation loop: the same network refines an answer vector y_hat and latent state z across multiple steps. A learnable halt gate decides when to stop early at inference time — shallow inputs get fewer steps, harder inputs get more.
| Argument | Type | Description |
network | nn.Module | any network to use as the recursive core (e.g. MLPMixer1D, Encoder) |
args | .TRMArgs | arguments: n_total_refinements (default 4), n_latent_refinements (default 2), vocab_size, dim |
Example:
from cirilla.Cirilla_model import (
CirillaTRM, TRMArgs, MLPMixer1D, MixerArgs,
CirillaTrainer, TrainingArgs, CirillaTokenizer, JSONLDataset,
trm_training_step, trm_inference_step
)
from ema_pytorch import EMA
from types import MethodType
from functools import partial
mixer = MLPMixer1D(MixerArgs())
model = CirillaTRM(mixer, TRMArgs())
ema_model = EMA(model, beta=0.999,
update_model_with_ema_every=1_000,
forward_method_names=('predict',))
targs = TrainingArgs(n_epoch=1000)
trainer = CirillaTrainer(model, targs)
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
dl = JSONLDataset(['./example.jsonl'], tokenizer=tokenizer,
max_len=mixer.args.context_window)
trainer.training_step = MethodType(
partial(trm_training_step, max_recurrent_step=16,
halt_weight=0.5, halt_thresh=0.5, ema_model=ema_model), trainer)
trainer.inference_step = MethodType(
partial(trm_inference_step, max_recurrent_step=16, halt_thresh=0.5), trainer)
trainer.criterion = None
trainer.train(dl, dl)
CirillaVision
class cirilla.Cirilla_model.CirillaVision(args)
Vision-Language model. A Swin Transformer encodes image patches into token embeddings which are prepended to the text token sequence and passed into a Cirilla decoder — similar architecture to PaliGemma.
| Argument | Type | Description |
args | .VisionArgs | extends DecoderArgs with Swin encoder params: img_size (default 224), patch_size (default 4), in_channels (default 3), swin_embed_dim (default 96), swin_depths, swin_num_heads, swin_window_size |
Signature:
Image → Swin Transformer → image tokens → concat(image tokens, text tokens) → N×(SlidingWindowAttention → MoE) → RMSNorm → Wout
HybridCirilla
class cirilla.Cirilla_model.HybridCirilla(args)
Transformer–Mamba hybrid model. Replaces the standard transformer decoder with HybridDecoder blocks that interleave Mamba SSM layers with transformer attention layers (similar to IBM Granite 4.0). Requires the optional mamba dependency.
| Argument | Type | Description |
args | .HybridArgs | extends HybridDecoderArgs; key extra arg: layer_pattern — a string like '5M' meaning 5 transformer layers then 1 Mamba layer, repeated |
Example:
# requires: uv add Cirilla[mamba]
from cirilla.Cirilla_model import HybridCirilla, HybridArgs
model = HybridCirilla(HybridArgs(n_layers=5, layer_pattern='5M'))
print(model.n_params / 1e6, "M")
get_optims
func cirilla.Cirilla_model.get_optims(model, use_muon_optim, optim, lr, weight_decay) → tuple[Optimizer, ...]
Configure optimizers for a Cirilla model. Optionally combines a standard optimizer (default AdamW) for embedding and output layers with the Muon optimizer for hidden layers, which can improve convergence on transformer hidden weights.
| Argument | Type | Description |
model | nn.Module | model to configure optimizers for |
use_muon_optim | bool | pair a Muon optimizer for hidden layers alongside the base optimizer |
optim | torch.optim.Optimizer | base optimizer class (default AdamW) |
lr | float | learning rate |
weight_decay | float | weight decay |
Example:
from cirilla.Cirilla_model import Cirilla, Args, get_optims
model = Cirilla(Args())
optims = get_optims(model, use_muon_optim=True, lr=3e-4, weight_decay=0.1)
load_balancing_loss
func cirilla.Cirilla_model.load_balancing_loss(expert_weights) → torch.Tensor
Auxiliary load-balancing loss for MoE models. Encourages uniform expert utilisation by penalising skewed routing distributions. Add a small multiple (e.g. 0.01) of this to the main training loss. Requires output_moe_weights=True in the model args.
| Argument | Type | Description |
expert_weights | torch.Tensor | per-expert routing weights returned by the model when output_moe_weights=True |
Example:
from cirilla.Cirilla_model import Cirilla, Args, load_balancing_loss
import torch.nn as nn
model = Cirilla(Args(output_moe_weights=True))
criterion = nn.CrossEntropyLoss()
out, moe_weights = model.pred(x)
lb_losses = [
load_balancing_loss(w,
num_experts=model.args.num_experts,
top_k=model.args.k)
for w in moe_weights
]
lb_loss = torch.stack(lb_losses).mean()
loss = criterion(out.view(-1, out.shape[-1]),
targets.view(-1))
loss = loss + 0.01 * lb_loss
loss.backward()
Encoder
class cirilla.Cirilla_model.Encoder(args)
Non-causal (bidirectional) encoder block: N layers of BertAttention interleaved with SMoE. Used as the recursive core in CirillaTRM and as a standalone encoder for classification / embedding tasks.
Signature:
N×(BertAttention → SMoE)
| Argument | Type | Description |
args | .EncoderArgs | dim, d_ff, n_layers, n_heads, n_kv_heads, context_window, num_experts, k, layer_norm, dtype_str, use_sparse, fp8_recipe |
Example:
from cirilla.Cirilla_model import Encoder, EncoderArgs
import torch
enc = Encoder(EncoderArgs(dim=256, n_layers=4, context_window=512)).cuda()
x = torch.randint(0, 60_000, (2, 512))
emb = torch.randn(2, 512, 256, device='cuda', dtype=torch.bfloat16)
out = enc.pred(emb)
print(out.shape) # torch.Size([2, 512, 256])
Decoder
class cirilla.Cirilla_model.Decoder(args)
Causal decoder block: N layers of SlidingWindowAttention interleaved with SMoE. The building block inside Cirilla, CirillaMTP, and the token heads of MTP.
Signature:
N×(SlidingWindowAttention → SMoE)
| Argument | Type | Description |
args | .DecoderArgs | dim, d_ff, n_layers, n_heads, n_kv_heads, context_window, window_size, static_mask, soft_cap, num_experts, k, layer_norm, dtype_str, use_sparse, fp8_recipe |
Example:
from cirilla.Cirilla_model import Decoder, DecoderArgs
import torch
dec = Decoder(DecoderArgs(dim=256, n_layers=4, context_window=512,
torch_compile=False)).cuda()
x = torch.randn(2, 512, 256, device='cuda', dtype=torch.bfloat16)
out = dec.pred(x)
print(out.shape) # torch.Size([2, 512, 256])
MLPMixer1D
class cirilla.Cirilla_model.MLPMixer1D(args)
1-D MLP-Mixer: alternating token-mixing (Conv1d across sequence) and channel-mixing (Linear across features) blocks with pre-norm residual connections. Commonly used as the recursive core network inside CirillaTRM.
| Argument | Type | Description |
args | .MixerArgs | dim (default 256), depth (number of mixer layers, default 8), context_window (default 512), expansion_factor (token-mix hidden ratio, default 4), expansion_factor_token (channel-mix hidden ratio, default 0.5), dropout |
Example:
from cirilla.Cirilla_model import MLPMixer1D, MixerArgs, CirillaTRM, TRMArgs
import torch
mixer = MLPMixer1D(MixerArgs(dim=256, depth=6, context_window=512))
model = CirillaTRM(mixer, TRMArgs(dim=256))
print(model.n_params / 1e6, "M")
JSONDynamicDatset
class cirilla.Cirilla_model.JSONDynamicDatset(paths, shuffle_path, device, tokenizer, max_len, ...)
Variant of JSONLDataset that yields variable-length tensors (no padding). Use together with DynamicCollator as the DataLoader collate function to pad each batch to its longest sequence rather than padding to a global max_len.
Accepts the same arguments as JSONLDataset.
Example:
from cirilla.Cirilla_model import (
JSONDynamicDatset, DynamicCollator, CirillaTokenizer
)
from torch.utils.data import DataLoader
tokenizer = CirillaTokenizer(hub_url='AnthonyPa57/HF-torch-demo2')
pad_id = tokenizer.tokenizer.pad_token_id
ds = JSONDynamicDatset(['./example.jsonl'], tokenizer=tokenizer, max_len=512)
dl = DataLoader(ds, batch_size=8, collate_fn=DynamicCollator(pad_id))
DynamicCollator
class cirilla.Cirilla_model.DynamicCollator(pad_token_id)
Collate function for use with JSONDynamicDatset. Pads each batch to the length of its longest sequence using pad_token_id, rather than padding all sequences to a global maximum.
| Argument | Type | Description |
pad_token_id | int | token id to use for padding (e.g. tokenizer.tokenizer.pad_token_id) |
mtp_training_step
func cirilla.Cirilla_model.mtp_training_step(self, data, pad_id) → float
Training step for CirillaMTP. Computes cross-entropy loss across all n_token_heads prediction heads. Bind to CirillaTrainer via MethodType.
| Argument | Type | Description |
self | CirillaTrainer | the trainer instance (bound via MethodType) |
data | tuple[Tensor, Tensor] | batch of (input_tokens, target_tokens) from the dataloader |
pad_id | int | padding token id used to mask loss on padding positions |
Example:
from types import MethodType
from functools import partial
from cirilla.Cirilla_model import mtp_training_step, mtp_inference_step
trainer.training_step = MethodType(partial(mtp_training_step, pad_id=pad_id), trainer)
trainer.inference_step = MethodType(partial(mtp_inference_step, pad_id=pad_id), trainer)
trainer.criterion = None
mtp_inference_step
func cirilla.Cirilla_model.mtp_inference_step(self, data, pad_id) → float
Validation/inference step for CirillaMTP. Same signature as mtp_training_step — computes loss using only the first token head (the main next-token prediction head). Bind to CirillaTrainer via MethodType.
| Argument | Type | Description |
self | CirillaTrainer | the trainer instance |
data | tuple[Tensor, Tensor] | batch of (input_tokens, target_tokens) |
pad_id | int | padding token id |
trm_training_step
func cirilla.Cirilla_model.trm_training_step(self, data, max_recurrent_step, halt_weight, halt_thresh, ema_model) → float
Training step for CirillaTRM. Unrolls the recursive refinement loop, computes classification cross-entropy loss, and adds a weighted halt loss that encourages the model to stop as early as possible.
| Argument | Type | Description |
self | CirillaTrainer | the trainer instance |
data | tuple[Tensor, ...] | batch from the dataloader |
max_recurrent_step | int | maximum number of recursive refinement steps (default 16) |
halt_weight | float | weight applied to the halt loss (default 0.5) |
halt_thresh | float | threshold above which the model considers itself done (default 0.5) |
ema_model | EMA | EMA wrapper around the model used to generate soft training targets |
Example:
from types import MethodType
from functools import partial
from cirilla.Cirilla_model import trm_training_step, trm_inference_step
trainer.training_step = MethodType(
partial(trm_training_step, max_recurrent_step=16,
halt_weight=0.5, halt_thresh=0.5, ema_model=ema_model), trainer)
trainer.inference_step = MethodType(
partial(trm_inference_step, max_recurrent_step=16, halt_thresh=0.5), trainer)
trainer.criterion = None
trm_inference_step
func cirilla.Cirilla_model.trm_inference_step(self, data, max_recurrent_step, halt_thresh) → float
Validation/inference step for CirillaTRM. Runs the adaptive halt loop and reports accuracy. Bind to CirillaTrainer via MethodType.
| Argument | Type | Description |
self | CirillaTrainer | the trainer instance |
data | tuple[Tensor, ...] | batch from the dataloader |
max_recurrent_step | int | maximum refinement steps (default 16) |
halt_thresh | float | halt threshold (default 0.5) |
bert_training_step
func cirilla.Cirilla_model.bert_training_step(self, data) → float
Default training step for CirillaBERT classification tasks. Calls model.pred(tokens, mask) and computes cross-entropy against the provided labels. Bind to CirillaTrainer via MethodType to override the default LM training step.
Example:
from types import MethodType
from cirilla.Cirilla_model import bert_training_step, bert_inference_step
import torch.nn as nn
trainer.training_step = MethodType(bert_training_step, trainer)
trainer.inference_step = MethodType(bert_inference_step, trainer)
trainer.criterion = nn.CrossEntropyLoss()
bert_inference_step
func cirilla.Cirilla_model.bert_inference_step(self, data) → float
scrape_fandom
func fandom_scraper.scrape_fandom(in_path, out_path, instruct_path, n_workers, wiki, lang) → None
scrape a given fandom wiki. Use huggingface's span maker (Named Entity Recognition) to search for new pages to scrape
fandom_scraper is a standalone package
| Argument | Type | Description |
in_path | Path | path to a folder with .json files containing lists with key words to search for first (so-called seeds) |
out_path | Path | path to save the scraped texts to |
instruct_path | Path | path to save the scraped instructions to |
n_workers | int | how many async wokers to use to fetch the fandom pages |
wiki | str | what fandom wiki to scrape |
lang | str | what language to use for the fandom |
Example:
in_path = "./witcher_json"
out_path = "./async_fandom"
instruct_path = "./async_fandom_instruct"
scrape_fandom(in_path,
out_path,
instruct_path,
n_workers = 50,
wiki = "Witcher",
lang = "en"
)
Example input file: ./witcher_json/witcher_1.json
[
"Geralt of Rivia", "Triss Merigold", "Vesemir", "Leo", "Lambert",
"Eskel", "Alvin", "Shani", "Zoltan Chivay", "Dandelion (Jaskier)",
"King Foltest", "Adda the White", ...
]
Example created output file: ./async_fandom/Geralt of Rivia.txt
Geralt of Rivia
Sub-Pages:
Main
Biography
Geralt was born as the son of the sorceress Visenna and presumably, the warrior Korin. Shortly after his
birth, his mother left him with the School of the Wolf at the stronghold of Kaer Morhen. There, Geralt
was made and trained to become a Witcher.
As a child, he ...
Example created output file: ./async_fandom_instruct/Geralt of Rivia.json
{
"Who is Geralt of Rivia?": "Geralt of Rivia is a witcher and the protagonist from The Witcher ...",
"What is Geralt's nickname?": "Geralt has a number of nicknames, with the more well known ones ...",
"Who trained Geralt to become a Witcher?": "Geralt was trained by his mentor Vesemir, who he ...",
"What is Geralt's profession?": "Geralt is a monster slayer for hire. He travels the world on ...",
"What is the Trial of The Grasses?": "The Trial of The Grasses was a painful process that ..."
}
instructions_into_conv
func fandom_scraper.instructions_into_conv(input_path, out_path) → None
convert scraped fandom instructions into a .jsonl file
| Argument | Type | Description |
input_path | Path | path to a folder with scraped instructions |
out_path | Path | path to save the .jsonl file to |
Example:
instructions_into_conv(input_path="./async_fandom_instruct",
out_path="./async_fandom_instruct_gathered.jsonl")
Example input file: ./async_fandom_instruct/Geralt of Rivia.json
{
"Who is Geralt of Rivia?": "Geralt of Rivia is a witcher and the protagonist from The Witcher ...",
"What is Geralt's nickname?": "Geralt has a number of nicknames, with the more well known ones ...",
"Who trained Geralt to become a Witcher?": "Geralt was trained by his mentor Vesemir, who he ...",
"What is Geralt's profession?": "Geralt is a monster slayer for hire. He travels the world on ...",
"What is the Trial of The Grasses?": "The Trial of The Grasses was a painful process that ..."
}
Example created output file (obviously the actual .jsonl file is saved in single lines): ./async_fandom_instruct_gathered.jsonl
{"subject": "Geralt of Rivia", "text": [
{"role": "user", "content": "Who is Geralt of Rivia?"},
{"role": "assistant", "content": "Geralt of Rivia is a witcher and the protagonist from ..."}
], "data type": "conv", "source": "fandom"}
{"subject": "Geralt of Rivia", "text": [
{"role": "user", "content": "What is Geralt's nickname?"},
{"role": "assistant", "content": "Geralt has a number of nicknames, with the more well known ones ..."}
], "data type": "conv", "source": "fandom"}
...
Cirilla Vibe
Terminal chat UI. After installing Cirilla, launch a full TUI to chat with any Cirilla model pulled from HuggingFace Hub — no extra setup required.
uv run python -m cirilla.cli
On launch, choose a featured model or type any HuggingFace repo id (e.g. AnthonyPa57/Cirilla-0.3B-4E-grpo). Adjust generation settings live during chat with /set key=value:
| Command | Description |
/set temperature=0.8 | sampling temperature |
/set top_p=0.9 | nucleus sampling probability |
/set top_k=50 | top-k sampling (None to disable) |
/set n_beams=5 | number of beams / parallel samples |
/set kv_cache=true | toggle KV-cache generation |
/set auto_clear=true | clear history and KV-cache before each new message |
/clear | manually clear history and reset KV-cache |
Featured models:
Cirilla-0.3B-4E — base pretrained model
Cirilla-0.3B-4E-grpo — GRPO fine-tuned
Cirilla-0.3B-4E-grpo-icl — GRPO + in-context learning fine-tuned