Generall class for creating syntetic datasets with ollama
Argument | Type | Description |
model | str | name of ollama model e.g. "llama3.1:8b" |
system_prompt | str | system prompt for the chosen ollama model, used by its functions: __call__ , dynamic_hierarchical_summary and single_pass_summary |
response_template | pydantic.BaseModel | template for structured responses, used by its functions: __call__ , dynamic_hierarchical_summary , single_pass_summary and multi_turn |
__call__(paths, save_to, seed, checkpoint, skip) → None
Create synthetic instructions (question-answer pairs)
to see how to turn the instructions into a .jsonl
file see the rm_duplicate_instructs function
Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to create question answer pairs |
save_to | Path | folder to save generated question answer pairs to |
seed | int | seed for the ollama model |
checkpoint | int | save the generated question answer pairs to the save_to folder after checkpoint iterations |
skip | bool | if True , skip the already existing question answer pairs, else save as existing_name_{i} where i is the number of already existing files of the same name |
Example:
import os
import random
class Response(BaseModel):
question: str = Field(description="What question is appropriate to this text?")
answer: str = Field(description="Answer to the question")
sys_prompt = """
You are an expert dataset annotator for instruction-tuning large language models.
Your task is to create high-quality question-answer pairs from provided texts
for training instruct models.
Guidelines:
- Keep the question relevant and informative for learners.
- Avoid using markdown or any unnecessary formatting.
- You can ask to elaborate based on a keyword or phrase in the text.
- You can ask about the plot if the text is a story.
- Do not use overly formal language.
- Use only the information provided in the text.
- If the text states that any part of it is from Netflix, or mentions that a section is from Netflix,
ignore that part and do not include it in the question or answer.
- If user specifies already created question and answer pair, find a different question and answer
pair that is different from the one provided. If this is impossible use different words then the ones
provided.
- Return the output strictly as a JSON with two fields: "question" and "answer".
"""
folder = "./witcher_fandom"
paths = os.listdir(folder)
paths = [os.path.join(folder, p) for p in paths]
print(f"{len(paths)} paths found")
for _ in range(3):
dual = OllamaCurate(
model="qwen3:8b",
system_prompt=sys_prompt,
response_template=Response
)
dual(
paths,
save_to="./witcher_synthetic_instruct/qwen3:8b",
seed=random.randint(0, 1000),
checkpoint=10
skip=False,
)
Example created output file: ./witcher_synthetic_instruct/qwen3:8b/Deadly Plot.json
{
"question": "What location must the player meet Dijkstra at for the A Deadly Plot quest?",
"answer": "Passiflora."
}
The output file's contents depend on whatever was present in the corresponding text file:
./witcher_fandom/Deadly Plot.txt
single_pass_summary(paths, save_to, seed, num_predict, use_response_template) → None
For provided .txt
files create a simple summary of them
See how to turn summaries into a dataset: gather_summaries summaries_to_instruct
Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to summarize |
save_to | Path | folder to save generated summaries to |
seed | int | seed for the ollama model |
num_predict | int | maximum number of tokens the ollama model can generate |
use_response_template | bool | whether to use a response template, that is set when initialing the OllamaCurate class |
Example:
import os
import random
from pydantic import BaseModel, Field
class Response(BaseModel):
summary: str = Field(description="Final summary of the entire text. Pure summary only,
no introduction or reasoning.")
paths = os.listdir("./witcher_fandom")
paths = [os.path.join("./witcher_fandom", p) for p in paths]
print(f"{len(paths)} paths found")
for m in ["granite3.1-moe:3b"]:
model = OllamaCurate(model=m, system_prompt="", response_template=Response)
model.single_pass_summary(
paths,
save_to=f"./witcher_synth_summaries/{m}",
seed=random.randint(0, 1000),
num_predict=4096,
use_response_template=True,
)
dynamic_hierarchical_summary(paths, save_to, chunk_lines, seed, num_predict, max_words_summary, use_response_template) → None
For provided .txt
files create a hierarchical summary of them. Meaning that for a text divided into chunks we subsequently generate a summary for each chunk, and then generate a final summary based on the summaries of the chunks
Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to summarize |
save_to | Path | folder to save generated summaries to |
chunk_lines | int | the number of lines by which to chunk the text e.g. for a text of 1000 lines and chunk_lines=200 the text will be divided into 5 chunks of 200 lines each, all the chunks will get their own summary and based on them a final summary will be generated |
seed | int | seed for the ollama model |
num_predict | int | maximum number of tokens the ollama model can generate |
max_words_summary | int | maximum number of words ollama should target for the summary |
use_response_template | bool | whether to use a response template, that is set when initialing the OllamaCurate class |
Example:
import os
from pydantic import BaseModel, Field
import os
import random
class Response(BaseModel):
summary: str = Field(description="Summary of the text, without the thinking process and without any
introduction. Provide only pure summary, be expressive but stick to the maximum number of words
that were provided.")
paths = os.listdir('./witcher_texts')
paths = [os.path.join('./witcher_texts', p) for p in paths]
print(f"{len(paths)} paths found")
for m in ['llama3.2:3b', 'llama3.1:8b', 'qwen3:8b']:
model = OllamaCurate(model=m,
system_prompt="",
response_template=Response
)
model.dynamic_hierarchical_summary(paths,
save_to=f'./synth_sumarries/witcher_texts/{m}',
chunk_lines=100,
seed=random.randint(0, 1000),
num_predict=2048,
max_words_summary=500,
use_response_template=True
)
multi_turn(paths, save_to, bar, n_turns_range, seed, prob_chance_new_context) → None
Create synthetic multi turn instructions (question-answer pairs)
Argument | Type | Description |
paths | list[Path] | paths to .txt files containing texts to create multi turn question answer pairs |
save_to | Path | folder to save generated mult turn question answer pairs to |
bar | tqdm.tqdm | tqdm bar to track progress |
n_turns_range | tuple[int,int] | range of number of turns to generate, e.g. (2, 4) will generate question answer pairs with 2 to 4 turns (the actual number of turns will be randomly chosen from this range and consistent for all the files) |
seed | int | seed for the ollama model |
prob_chance_new_context | float | Probability of starting a new context for the question answer pairs e.g. with a given chance we can continue the multi turn conversation with a new context from another randomly chosen .txt file |
Example:
from pydantic import BaseModel, Field
import os
import random
class Response(BaseModel):
question: str = Field(description="What question is appropriate to this text?")
answer: str = Field(description="Answer to the question")
paths = os.listdir('./witcher_fandom')
paths = [os.path.join('./witcher_fandom', p) for p in paths]
bar = tqdm(total=3*3*len(paths))
for model in ['qwen3:8b', 'phi4', 'llama3.1:8b']:
for _ in range(3):
ol = OllamaCurate(model, "", Response)
ol.multi_turn(paths,
save_to=f'./synth_multi_round/{model}',
bar=bar,
n_turns_range=(2, 5),
seed=random.randint(0, 1000),
prob_chance_new_context=0.3)