Updated Jan 18, 2026 · Engineering

Adding steps to a process using AI - and where it fails

AI can generate workflow steps, but it misses 50-70% of what makes a process useful. Form fields, automations, and context-specific logic still need human design.

Summary

AI step generation at Tallyfy - the real story from GitHub issues, performance logs, and production failures. What the demos do not show you.

50-70 percent of template value requires manual work - AI creates steps but misses form fields and automations entirely
25 seconds to generate, 40 seconds to continue - sequential API calls create painful wait times in production
80 percent field accuracy, 70 percent automation relevance - those are our actual success metrics when AI tries to help
Format bugs break trust - AI outputs markdown but systems expect HTML, creating visible rendering failures. See how we handle AI in templates

AI can generate workflow steps. That sentence is technically true and wildly misleading at the same time.

The reality is more nuanced. AI generates step titles and descriptions reasonably well. It misses almost everything else that makes a process actually useful. Form fields, automations, conditional logic - the components that transform a checklist into an intelligent workflow - still require human design.

This is our experience building and iterating on AI step suggestions at Tallyfy. The GitHub issues, the performance problems, and the fundamental gap between what AI can do and what users expect it to do. This reflects our experience at a specific point in time. Some details may have evolved since, and we have omitted certain private aspects that made the story equally interesting.

Despite these limitations, AI assistance still speeds up the process of building workflows. Here is how we approach it.

Solution Workflow & Process

Workflow Automation Software

Workflow Automation Software Made Easy & Simple

Save Time On Workflows

Track & Delegate Tasks

Consistency

Explore this solution

When AI step generation fails completely

We have seen AI step generation fail in three distinct ways. Total silence. Partial completion. And content that looks wrong.

From an internal ticket about a timezone display bug for process due times, documenting complete failures:

“No response when creating a template using ‘Use an AI-generated template’”

Just nothing. User clicks the button. Spinner spins. Nothing happens. The AI call times out, the error handling fails, and the user stares at a screen that offers no explanation.

Same issue, different symptom:

“Unable to generate a step description via AI - Generating a step description using the AI feature fails.”

The broader template generation might work, but individual step enhancement does not. You have a step called “Review contract” and you ask AI to write a description. Failure. No description. No error message that explains what went wrong.

And the systematic version:

“AI is not generating suggested steps for newly created templates”

This one hit after a deployment. AI worked on existing templates. AI failed on new templates. The difference was a database flag we forgot to initialize. Classic edge case that testing did not catch because testers were working with existing templates.

SOP interface showing step structure — The SOP interface where AI-generated steps appear - when they appear at all

These failures are fixable bugs. We fixed them. But they illustrate something important about AI features: the failure modes are different from traditional software.

Traditional software fails predictably. A button breaks. An API returns an error. The failure has a clear cause and effect.

AI features fail ambiguously. Did the AI not understand the input? Did the model timeout? Did the prompt engineering miss an edge case? Did the context window overflow? The debugging is harder because the failure mode is often “the AI just did not do what we expected.”

The 50-70 percent gap

This is the number that should define how you think about AI workflow generation. From our Cloudflare Workers documentation:

“Currently, AI template creation only generates steps. Users must manually add: 1) Form fields for data collection 2) Automations for conditional logic”

Steps are maybe 30-50 percent of a useful template. The rest is everything the AI cannot generate.

The same documentation quantified the impact:

“This manual work negates 50-70% of AI automation benefits.”

Think about what a good template actually contains:

Steps - the sequence of activities. AI handles this reasonably well.

In our experience with healthcare organizations running patient onboarding workflows, this gap becomes painfully clear. One group managing 98 active workflows told us they had consolidated four different tools into one system - but the form fields capturing patient data, insurance verification, and compliance documentation had to be designed by humans who understood their specific regulatory requirements.

Form fields - the data collected at each step. Employee name. Order number. Approval decision. Shipping address. The AI does not know what data your process needs.

Automations - the rules that make workflows intelligent. If order value exceeds threshold, route to manager. If customer location is international, add compliance step. If approval is rejected, loop back to revision. These are business logic that requires domain knowledge.

Assignments - who does each step. AI can guess at roles but does not know your organization structure.

Deadlines - how long each step should take. AI can generate generic timeframes but does not know your SLAs.

The AI generates the skeleton. You still need to add everything that makes the skeleton move.

Example Procedure

Medical Insurance Billing and Claims Processing

1Patient check-in and demographics verification

2Insurance Eligibility and Verification

3Medical Coding of Diagnosis, Procedures and Modifiers

4Charge Entry

5Claims submission via clearinghouse

+4 more steps

View template

Example Procedure

Podcast Episode Production and Publishing Workflow

1Record podcast episode audio

2Produce intro and outro audio segments

3Edit and master podcast audio

4Add ID3 metadata tags to audio file

5Set up podcast hosting and RSS feed

+7 more steps

View template

These templates illustrate the gap. The medical billing workflow has six form fields capturing patient data, insurance payer, and authorization status - none of which AI would generate. The podcast workflow has twelve steps spanning recording to promotion - AI might generate the step titles, but the specific technical details about audio formats, metadata tagging, and platform requirements come from domain expertise.

Types of work that processes handle — Different work types require different process structures - context AI cannot infer

Performance that breaks the experience

Demo speeds are not production speeds. This was a hard lesson.

From an internal ticket about bulk template creation via API for better performance, documenting real performance:

“Template creation through AI-generated templates and document upload is slow because steps are created one by one via API rather than in bulk.”

The architectural problem: we made individual API calls for each step instead of batching. Ten steps meant ten API calls. Twenty steps meant twenty calls. Sequential, not parallel.

The actual numbers from production:

“Generate stage: Takes 25 seconds (AI-generated) or 23 seconds (document upload)”

Twenty-five seconds for initial generation. Not terrible, but not instant either. The user is watching a spinner, wondering if anything is happening.

Then the continuation phase:

“Continue stage: Takes almost 40 seconds (AI-generated) or 27 seconds (document upload)”

Another forty seconds if you want to refine or extend. Over a minute total for what demos make look like three seconds.

The perception problem is worse than the raw numbers suggest. Users have been trained by consumer apps to expect instant responses. A twenty-five second wait feels like something is broken. We added progress indicators, step-by-step feedback, anything to make the wait feel productive rather than dead.

But the fundamental problem was architecture. Sequential API calls are slow. We eventually moved to batch creation, but the initial version shipped with the slow path because we optimized for correctness first.

When the format is wrong

Even when AI generates content successfully, it can look broken. From an internal ticket about AI step description generation producing HTML instead of markdown:

“TE > Step > Description > Generate creates markdown content instead of HTML, causing rendering issues after launch.”

The AI outputs markdown. Our description renderer expected HTML. The result was visible formatting characters instead of formatted text.

Users would see something like:

**Important:** Review all *contract terms* before proceeding to [next step](#).

Instead of:

Important: Review all contract terms before proceeding to next step.

The asterisks and brackets remained visible. The content was technically correct. The presentation was obviously broken.

This is a symptom of the broader AI integration challenge. AI outputs text. Your system expects structured data. The translation between them has edge cases. Markdown versus HTML. Newlines versus paragraph breaks. Unicode characters that render correctly in training data but break in production systems.

We fixed this specific bug by normalizing output formats. But the category of bug - AI output format does not match system expectations - keeps appearing in different forms.

The prompt engineering underneath

What does the AI actually get told to do? From an internal discussion about AI generation capability for snippets, the step description prompt:

“You are an expert at describing how to do something within business processes. Assuming that someone has no prior knowledge…”

This prompt optimizes for completeness. Generate descriptions that assume the reader knows nothing. That produces helpful content for new employees but verbose content for experienced workers who just need a quick reference.

The process ideation prompt from an internal discussion about AI-powered template suggestions for new trial organizations shows how we try to guide creative generation:

“You must brainstorm the departments or teams that are common in that company at least 8 times”

Numeric constraints in prompts. Tell the AI to generate at least eight options. Without this, the AI tends toward minimal output - one or two ideas instead of a useful range.

Same issue, different constraint:

“Ensure every process_idea is for a repeatable process - not a one-off task or one off project”

We had to explicitly exclude one-off tasks. Without this instruction, the AI would suggest things like “Plan office move” or “Launch product” - activities that happen once, not repeatable processes.

These prompts reveal how much engineering goes into getting useful AI output. The model has capabilities. Extracting those capabilities for specific use cases requires careful instruction design.

Confirmation screen for process steps — Human confirmation remains the final gate for AI-generated content

Why users accept bad defaults

One of the more subtle problems from an internal ticket about AI-generated automation names for better identification:

“Users accept default automation names rather than creating custom ones, resulting in virtually useless identification labels.”

The AI generates something. The user accepts it unchanged. The result is generic and unhelpful.

Automation names like “Automation 1” or “Send notification” tell you nothing about what the automation actually does. Six months later, nobody remembers. But in the moment of creation, the default seemed fine.

This is a user experience problem, not an AI problem. But AI makes it worse because AI generates reasonable-looking defaults. A human creating an automation from scratch might pause to think about naming. A human editing an AI suggestion often just clicks accept.

The fix is partially design - make users think about naming - and partially AI - generate more specific default names. At Tallyfy, we have seen this pattern repeatedly: AI suggestions reduce friction, and reduced friction means less thoughtful decision-making.

The accuracy we actually achieve

When AI does generate fields and automations (in our more advanced configurations), what accuracy do we see?

From the same Cloudflare documentation:

“Field Generation Accuracy: 80%+ of generated fields are relevant and usable. Automation Relevance: 70%+ of generated automations match workflow intent.”

Eighty percent field accuracy sounds good. It means one in five fields is wrong or unnecessary. In a template with twenty fields, that is four fields you need to remove or modify.

Seventy percent automation relevance is worse. Three out of ten automations miss the mark. Given that automations control workflow logic, a wrong automation can break process execution.

These numbers assume good input. Vague process descriptions produce much worse results. The metrics come from reasonably detailed source documents.

The implication: AI assistance requires human review. Always. The 20-30 percent error rate is too high to trust AI output blindly.

What we left out

Several capabilities we considered but did not build:

Automatic field type inference. AI could potentially look at a field name and determine the appropriate type - date, number, text, dropdown. We decided the risk of wrong inferences was too high. Wrong field types break data collection.

Cross-template learning. Train on a company’s existing templates to generate new templates that match their style. Privacy concerns killed this. We would need to use customer data for training, which creates data handling complications.

Real-time step suggestions. As users build templates, suggest next steps based on common patterns. We prototyped this and found it distracting. The suggestions interrupted template building flow more than they helped.

Automation generation from step descriptions. Infer if-then rules from natural language descriptions. The accuracy was too low. Wrong automations cause process failures in production.

The theme across all these: we kept hitting accuracy thresholds that made automatic generation risky. Human oversight became the design principle because the alternative was shipping features that would fail in production.

The honest assessment

AI step generation is useful. That statement needs qualification.

It is useful when you have good source documents. Upload a detailed SOP, get a reasonable step structure. Upload a vague description, get garbage.

It is useful as a starting point. The AI draft gives you something to edit rather than a blank page. Editing is easier than creating.

Based on conversations we have had with media production companies running content workflows, this editing-versus-creating distinction matters enormously. One podcast production firm with a 60-task workflow spanning six departments told us they tripled their output after using AI to generate initial step structures - but the hand-offs between audio, writing, design, and video teams still required human judgment about sequencing and dependencies.

It is not useful as a replacement for human template design. The 50-70 percent gap is real. Form fields, automations, assignments, deadlines - these require human judgment about your specific business.

It is not useful when you need reliability. The failure modes are too unpredictable. Twenty-five second wait times break user experience. Format mismatches break content display. Timeout failures break the entire flow.

The BYO AI integration lets you connect your own AI providers, which helps with reliability - you can use providers you already trust and monitor. But it does not solve the fundamental accuracy limitations.

Where this goes next

Better models help. GPT-4 is more accurate than GPT-3.5. Future models will presumably be better still. The 80 percent field accuracy might become 90 percent. The 70 percent automation relevance might become 85 percent.

But I doubt we will see 99 percent accuracy anytime soon. The problem is not model capability - it is information availability. The AI does not know your business. It cannot know that your compliance team requires three-day review windows, or that your enterprise customers need approval chains that smaller customers do not.

That context lives in human heads. Some of it can be captured in prompts. Some of it requires human review of AI output.

The architecture we have settled on: AI generates drafts, humans review and approve. That loop is not going away. What changes over time is how complete the drafts are and how much human modification they require.

For now, expect to spend 50-70 percent of template building time on things the AI cannot do. That is still faster than building from scratch. It is not the autonomous workflow generation that marketing language implies.

The demos look magical. Production looks like work.

Updated Jan 18, 2026 · Engineering

engineeringai-workflow-generationai-step-creation-limitsworkflow-ai-failures

About the Author

Amit is the CEO of Tallyfy. He is a workflow expert and specializes in process automation and the next generation of business process management in the post-flowchart age. He has decades of consulting experience in task and workflow automation, continuous improvement (all the flavors) and AI-driven workflows for small and large companies. Amit did a Computer Science degree at the University of Bath and moved from the UK to St. Louis, MO in 2014. He loves watching American robins and their nesting behaviors!

Follow Amit on his website, LinkedIn, Facebook, Reddit, X (Twitter) or YouTube.

Automate your workflows with Tallyfy

Stop chasing status updates. Track and automate your processes in one place.

SCHEDULE A DEMO