The Human Element: Building Domain-Specific Intelligence for Enterprise HR

Chih-Po Wen, CTO and Co-Founder, 7 min read

The Human Element: Building Domain-Specific Intelligence for Enterprise HR

At Wisq, we believe the shift to AI will enable HR teams to scale without increasing headcount.

Table of content

For years, HR teams have been buried in tactical requests, from policy questions to compliance tracking, limiting their ability to drive meaningful workforce initiatives.

HR professionals can spend nearly 60% of their time on administrative tasks, according to Deloitte—but AI is changing that. At Wisq, we saw AI as an opportunity to give these teams time back to innovate and truly build a better future of work for everyone.

AI agents are quickly becoming essential across enterprises. They’re already proving highly effective in customer service, sales, legal, and other business functions.

With the rise of AI teammates and increasing pressure on HR teams to do more with fewer resources, we expect this year to be a transformative one for HR. There will be a shift from AI that helps you get work done to AI that actually does work for you. ‍

At Wisq, we believe this shift will enable HR teams to scale without increasing headcount. HR teams can finally spend less time answering repetitive questions and more time shaping workplace culture, workforce planning, and employee development. And they can do this through an AI agent purpose-built for them, not a generic wrapper built atop an LLM.

Introducing Harper, the AI HR Generalist

We recently announced Harper, the world's first AI HR Generalist specifically designed to help HR teams automate repetitive, manual tasks and focus on strategic impact.

Built to think and work like an HR generalist, Harper combines advanced language understanding with deep HR domain expertise to handle your day-to-day administrative workload.

When we set out to build Harper, we knew that we wanted to go beyond a simple question-and-answer chatbot or a standard AI intranet, and we knew that would demand far more from a technological perspective.

Harper was designed with a few principles in mind:

Harper had to meet the quality demands of mission critical applications: We knew we had to have a platform that would adapt to each customer’s bespoke processes, grounding the results to align with authoritative domain knowledge.
Harper had to understand bespoke documents and policies: HR teams require technology that understands its specific policies and processes, which tend to require documentation. We knew it would be important to not only ingest but also interpret the structural nuances of custom documents beyond mere pages of text. Also, to make Wisq more valuable for customers, we needed Harper to be able to enforce relevant policies rather than just display related content.

Harper had to stay in-the-flow-of-work and keep the humans in the loop: We needed to ensure that Harper could seamlessly integrate into existing workflows, collaborating with employees, HR teams, and HRIS to accomplish tasks and deliverables.

We wanted to build Harper to integrate directly into HR teams without a hitch so that, from Day 1, Harper can:

Answer employee questions about policies, benefits, and procedures
Generate key HR artifacts and correspondence
Track and manage compliance requirements
Do tasks for HR involving personalization and context, such as compiling work history and other personal data for tasks involving an employee
Provide first-level support and triage for common HR requests and interact with a user like HR would, including asking questions and providing feedback
And escalate complex issues to your human HR team when needed

What sets Harper apart is how it handles requests. Instead of providing simple answers, Harper engages in natural conversations, understands context, and can follow complex HR processes just like a fully integrated team member would.

Building Harper to Maximize Flexibility and Accuracy

HR technology requires both innovative problem-solving and strict factual accuracy, so we built Harper with a dual-minded approach.

Creativity in Generative AI isn’t inherently bad. You need creativity to “connect the dots” and provide solutions to previously unseen problems, but HR teams don’t want AI to be “creative” regarding facts and references.

We built Wisq to strictly follow policies for certain documents while allowing flexibility and creativity for others. Wisq has the ability to differentiate between when strict accuracy is needed versus when creative interpretation is acceptable.

The LLM’s native response mashes up two tangentially related pieces of information, leading to an erroneous claim that’s not in the original Gallup article.

An excerpt from original article say says, "The exact system you use becomes less important when managers know how to have regular and constructive conversations with employees about how to improve performance... If performance feedback only occurs a few times a year, it's unlikely to be meaningful. In contrast, when formal progress reviews are accompanied by frequent, honest feedback—and the review is consistent with what you've heard all year—they can be affirming, motivating and, at the very least, much less awkward.

Gallup has found that when managers provide weekly (vs. annual) feedback, team members are:

5.2x more likely to strongly agree that they receive meaningful feedback
3.2x more likely to strongly agree they are motivated to do outstanding work
2.7x more likely to be engaged at work

To mitigate hallucinations, we used the following strategies:

Implemented a Self-Validation Step: We asked the LLM to review its own answers for consistency. However, while this helps align the answer with the instructions, it is not sufficient for thoroughly fact-checking the LLM’s own bias (see below).
Ground Answers in a Fact Database: Use a curated fact database specific to the HR domain and customer to ensure all answers are aligned with authoritative knowledge, e.g., by leveraging Retrieval-Augmented Generation (RAG).‍
Know When Not to Help: For specific inquiries that require a human in the process, such as those involving specific regulations or company policies, suppress the AI response and re-direct the query to a human.

Why HR Technology Needs More than Standard RAG

As we set out to build Harper, we realized that standard Retrieval-Augmented Generation (RAG) wasn't enough for HR's complex needs.

A standard “RAG stack” may be adequate for a simple question and answering interface, but it falls short when it comes to complex, specialized HR data.

While LLMs incorporate vast, encyclopedic knowledge, they are limited by a small working memory known as context length. Even with advancements on longer context lengths (e.g., 200,000 tokens or even over 1 million for media), an LLM’s ability to effectively utilize context diminishes as input size increases. This limitation is further compounded by uneven attention to different positions within the input. In other words, simply overwhelming the LLM with large amounts of data is ineffective, not only because it may exceed the context length, but also because the model struggles to utilize large inputs.

RAG enhances an LLM’s ability to access and leverage vast or private datasets. Unlike traditional search engines, RAG employs newer techniques such as embeddings and vector databases to pinpoint the most relevant pieces of information. This sifted content is then provided to the LLM as context, for example, to generate a response to a specific question.

Many turnkey RAG platforms are available, including the OpenAI Assistant. Additionally, it is relatively simple to build a basic RAG-enabled chatbot for question answering. While these generic solutions can handle short Q&A tasks adequately, they fall short in several key areas:

Documents in RAG systems are typically processed as plain text and divided into manageable chunks for embedding. However, generic chunking methods often fail to preserve the document's inherent structure and relationships. Consider a comprehensive employee performance review: when broken into chunks, critical metadata like the employee's name and reporting relationship might become disconnected from the detailed feedback at the end of the document, making it difficult to assess relevance.
RAG's retrieval algorithm returns the top-K most relevant chunks for a question, hoping to find enough passages that contain the answer. However, this approach treats each chunk as an isolated piece of information, disregarding the need to obtain sufficient coverage for the context required to answer the question. While this method may suffice for short answers, it struggles with queries that require enumeration or aggregated informatio
- For example, if you ask an off-the-shelf RAG to summarize employee growth areas based on a performance review, it may overlook key information in the review. If RAG pulls excerpts from policy documents based on a topic (e.g., “PTO”), it might provide inaccurate answers because RAG omits rules on eligibility.

Importantly, effective knowledge retrieval in enterprise settings often requires an understanding of business logic and basic planning capabilities—both of which are beyond the scope of generic RAG solutions. Consider a seemingly simple question such as, "How many vacation days can I take this year?" Answering this requires:
1. Retrieving the employee’s profile,
2. Identifying the applicable vacation policy based on the employee’s location, tenure, and other attributes, and
3. Interpreting the policy to provide an accurate response.

The response might not align with your policies. For example, in scenarios like coaching and learning assistance, it can be acceptable for RAG to provide an educated guess when an exact answer is unavailable. Conversely, company policies mandate that RAG should refrain from providing assistance in some instances when a precise answer cannot be provided.

Example: RAG retrieval failure

The OpenAI assistant failed to retrieve most of the goals, and no amount of prodding in the prompt could fix this because the retrieval step did not provide chunks covering all the goals. In contrast, when the user includes the entire file in the prompt, the same LLM successfully listed all eight goals. However, approaching it like this is not scalable, and that’s why we want to go about it in a smarter way.

Knowledge management is essential for mission-critical agentic applications. To overcome the limitations of standard RAG in handling complex HR data, we use the following:

Extract and index structured content based on domain knowledge and business logic.
Avoiding one-size-fits-all indexing that treats all files the same by designing ingestion, indexing and retrieval data pipeline to match what’s in documents.
Invest in proper classification and metadata to enable multi-step reasoning in knowledge retrieval.
Design guardrails based on the content of the documents to ensure alignment with company policies.

Building for HR: Minimizing Latency without Sacrificing Accuracy

While advances in reasoning models and test-time computation are promising, they may not suit every use case for HR. Widely used LLMs (e.g., GPT-4o, Claude 3.5) function as “System 1 thinkers,” relying on a decoder-only transformer architecture that predicts tokens based on prior input sequence. This limits their ability to handle complex reasoning and long conversations.

Chain of Thought (CoT) is a widely adopted method that prompts LLMs to explicitly articulate intermediate reasoning steps, resulting in more accurate and interpretable responses. CoT effectively leverages the autoregressive nature of LLMs, which generate tokens one-by-one or a few at a time, conditioned by prior context. In other words, an LLM literally makes things up as it goes, CoT coaxes the LLM to think out loud before providing the final answer.

In newer reasoning-focused models (e.g., OpenAI’s o1 models), CoT is sometimes combined with other methods like beam search, Monte Carlos Tree Search, or reinforcement learning, in a process often referred to as test-time computation. These models simulate “System 2 thinking” by exploring and evaluating multiple solutions during inference, leading to much improved results out-of-the-box.

As a team, we had to consider if we should go with the newest, most advanced reasoning models. Here are the downsides:

Cost: They can be expensive, as you’re charged a higher base rate that includes intermediate chain-of-thought (CoT) tokens not included in the final answer.
Speed: They are slow. The extra reasoning tokens add to the user’s wait time, and because the thought process is opaque, the test-time-compute latency can be unpredictable. Therefore, these models are not ideal for interactive applications.
Focus: They may not add value in your domain. The current generation of reasoning models is primarily trained for STEM tasks (e.g., coding and math) because these problems have objective, verifiable answers and abundant human and synthetic data.

Here’s an example to illustrate these downsides:

We decided that it makes more sense to reserve these advanced reasoning models for complicated STEM tasks where cost and latency are not a priority, such as offline research or coding assistance, or technical analysis. For real-time, interactive tasks, it makes more sense to enhance “System 1” models with a task-specific reasoning prompt, which allows us to strike a balance between speed and quality. We can avoid verbose CoT prompts like, “Let’s think step by step,” and instead, use a structured "thought template" aligned with the task’s requirements.

The issue with Generic CoT Prompting is that it produces 7 times more tokens than necessary, increasing latency without improving accuracy. Our solution was to apply a concise, custom reasoning prompt to minimize latency without sacrificing accuracy.

Imagine we’re asking an LLM to take a SHRM certification test. Let’s take a look at the example below, which shows a sample question from the SHRM test. The correct answer is B, and it requires minimal reasoning to get there.

In the end, it’s clear that newer reasoning-focused models imitate System Two thinking. It’s talking out loud and using a linguistic method, and this method doesn’t always serve our customers like a thought template might. Expensive reasoning models aren’t always necessary, and simpler models can handle some questions at a lower cost.

At Wisq, our AI agent is tailored to follow structured conversational steps rather than using generic, lengthy reasoning processes. This approach makes the responses faster, more cost-effective, and better suited to HR tasks.

Our Vision for Harper

We see a world where HR teams work alongside fully integrated digital teammates to deliver better experiences for all employees.

Get in touch with the Wisq team to learn more about adding Harper to your organization.

References

System 1 vs. System 2 thinking

https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

Speculative decoding

Reasoning models

Grounding

https://www.anthropic.com/news/introducing-citations-api

Long context

Chatbot w/ RAG

View all

Meet Harper: The World's First AI HR Generalist

Harper is an AI teammate powered by Wisq’s Agentic AI Platform designed specifically for enterprise HR.

Company

HR's Ultimate Guide to AI

Skip the AI learning curve with HR's ultimate guide to AI, your complete guide to bringing AI into your organization, from building your team to procurement.

CHROs: Don’t Get Left Behind by AI

Innovative leaders know they need to rapidly adopt AI—and not just experiment with it, but implement it in a way that drives immediate and tangible business value.