Tribblix: manual page: slm

SLM_EVENTS(3CPC) CPU Performance Counters Library Functions

NAME

slm_events - processor model specific performance counter events

DESCRIPTION

This manual page describes events specific to the following Intel CPU
models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.

CPU models described by this document:

+o Family 0x6, Model 0x4c

+o Family 0x6, Model 0x4d

+o Family 0x6, Model 0x37

The following events are supported:

br_inst_retired.all_branches
ALL_BRANCHES counts the number of any branch instructions
retired. Branch prediction predicts the branch target and
enables the processor to begin executing instructions long
before the branch true execution path is known. All branches
utilize the branch prediction unit (BPU) for prediction. This
unit predicts the target address not only based on the EIP of
the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and
jumps, indirect calls and jumps, returns.

br_inst_retired.jcc
JCC counts the number of conditional branch (JCC) instructions
retired. Branch prediction predicts the branch target and
enables the processor to begin executing instructions long
before the branch true execution path is known. All branches
utilize the branch prediction unit (BPU) for prediction. This
unit predicts the target address not only based on the EIP of
the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and
jumps, indirect calls and jumps, returns.

br_inst_retired.taken_jcc
TAKEN_JCC counts the number of taken conditional branch (JCC)
instructions retired. Branch prediction predicts the branch
target and enables the processor to begin executing
instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU)
for prediction. This unit predicts the target address not only
based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types: conditional
branches, direct calls and jumps, indirect calls and jumps,
returns.

br_inst_retired.call
CALL counts the number of near CALL branch instructions
retired. Branch prediction predicts the branch target and
enables the processor to begin executing instructions long
before the branch true execution path is known. All branches
utilize the branch prediction unit (BPU) for prediction. This
unit predicts the target address not only based on the EIP of
the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and
jumps, indirect calls and jumps, returns.

br_inst_retired.rel_call
REL_CALL counts the number of near relative CALL branch
instructions retired. Branch prediction predicts the branch
target and enables the processor to begin executing
instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU)
for prediction. This unit predicts the target address not only
based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types: conditional
branches, direct calls and jumps, indirect calls and jumps,
returns.

br_inst_retired.ind_call
IND_CALL counts the number of near indirect CALL branch
instructions retired. Branch prediction predicts the branch
target and enables the processor to begin executing
instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU)
for prediction. This unit predicts the target address not only
based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types: conditional
branches, direct calls and jumps, indirect calls and jumps,
returns.

br_inst_retired.return
RETURN counts the number of near RET branch instructions
retired. Branch prediction predicts the branch target and
enables the processor to begin executing instructions long
before the branch true execution path is known. All branches
utilize the branch prediction unit (BPU) for prediction. This
unit predicts the target address not only based on the EIP of
the branch but also based on the execution path through which
execution reached this EIP. The BPU can efficiently predict the
following branch types: conditional branches, direct calls and
jumps, indirect calls and jumps, returns.

br_inst_retired.non_return_ind
NON_RETURN_IND counts the number of near indirect JMP and near
indirect CALL branch instructions retired. Branch prediction
predicts the branch target and enables the processor to begin
executing instructions long before the branch true execution
path is known. All branches utilize the branch prediction unit
(BPU) for prediction. This unit predicts the target address not
only based on the EIP of the branch but also based on the
execution path through which execution reached this EIP. The
BPU can efficiently predict the following branch types:
conditional branches, direct calls and jumps, indirect calls
and jumps, returns.

br_inst_retired.far_branch
FAR counts the number of far branch instructions retired.
Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the
branch true execution path is known. All branches utilize the
branch prediction unit (BPU) for prediction. This unit predicts
the target address not only based on the EIP of the branch but
also based on the execution path through which execution
reached this EIP. The BPU can efficiently predict the following
branch types: conditional branches, direct calls and jumps,
indirect calls and jumps, returns.

br_misp_retired.all_branches
ALL_BRANCHES counts the number of any mispredicted branch
instructions retired. This umask is an architecturally defined
event. This event counts the number of retired branch
instructions that were mispredicted by the processor,
categorized by type. A branch misprediction occurs when the
processor predicts that the branch would be taken, but it is
not, or vice-versa. When the misprediction is discovered, all
the instructions executed in the wrong (speculative) path must
be discarded, and the processor must start fetching from the
correct path.

br_misp_retired.jcc
JCC counts the number of mispredicted conditional branches
(JCC) instructions retired. This event counts the number of
retired branch instructions that were mispredicted by the
processor, categorized by type. A branch misprediction occurs
when the processor predicts that the branch would be taken, but
it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong
(speculative) path must be discarded, and the processor must
start fetching from the correct path.

br_misp_retired.taken_jcc
TAKEN_JCC counts the number of mispredicted taken conditional
branch (JCC) instructions retired. This event counts the
number of retired branch instructions that were mispredicted by
the processor, categorized by type. A branch misprediction
occurs when the processor predicts that the branch would be
taken, but it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong
(speculative) path must be discarded, and the processor must
start fetching from the correct path.

br_misp_retired.ind_call
IND_CALL counts the number of mispredicted near indirect CALL
branch instructions retired. This event counts the number of
retired branch instructions that were mispredicted by the
processor, categorized by type. A branch misprediction occurs
when the processor predicts that the branch would be taken, but
it is not, or vice-versa. When the misprediction is
discovered, all the instructions executed in the wrong
(speculative) path must be discarded, and the processor must
start fetching from the correct path.

br_misp_retired.return
RETURN counts the number of mispredicted near RET branch
instructions retired. This event counts the number of retired
branch instructions that were mispredicted by the processor,
categorized by type. A branch misprediction occurs when the
processor predicts that the branch would be taken, but it is
not, or vice-versa. When the misprediction is discovered, all
the instructions executed in the wrong (speculative) path must
be discarded, and the processor must start fetching from the
correct path.

br_misp_retired.non_return_ind
NON_RETURN_IND counts the number of mispredicted near indirect
JMP and near indirect CALL branch instructions retired. This
event counts the number of retired branch instructions that
were mispredicted by the processor, categorized by type. A
branch misprediction occurs when the processor predicts that
the branch would be taken, but it is not, or vice-versa. When
the misprediction is discovered, all the instructions executed
in the wrong (speculative) path must be discarded, and the
processor must start fetching from the correct path.

uops_retired.ms
This event counts the number of micro-ops retired that were
supplied from MSROM.

uops_retired.all
This event counts the number of micro-ops retired. The
processor decodes complex macro instructions into a sequence of
simpler micro-ops. Most instructions are composed of one or two
micro-ops. Some instructions are decoded into longer sequences
such as repeat instructions, floating point transcendental
instructions, and assists. In some cases micro-op sequences are
fused or whole instructions are fused into one micro-op. See
other UOPS_RETIRED events for differentiating retired fused and
non-fused micro-ops.

machine_clears.smc
This event counts the number of times that a program writes to
a code section. Self-modifying code causes a severe penalty in
all Intel? architecture processors.

machine_clears.memory_ordering
This event counts the number of times that pipeline was cleared
due to memory ordering issues.

machine_clears.fp_assist
This event counts the number of times that pipeline stalled due
to FP operations needing assists.

machine_clears.all
Machine clears happen when something happens in the machine
that causes the hardware to need to take special care to get
the right answer. When such a condition is signaled on an
instruction, the front end of the machine is notified that it
must restart, so no more instructions will be decoded from the
current path. All instructions "older" than this one will be
allowed to finish. This instruction and all "younger"
instructions must be cleared, since they must not be allowed to
complete. Essentially, the hardware waits until the
problematic instruction is the oldest instruction in the
machine. This means all older instructions are retired, and
all pending stores (from older instructions) are completed.
Then the new path of instructions from the front end are
allowed to start into the machine. There are many conditions
that might cause a machine clear (including the receipt of an
interrupt, or a trap or a fault). All those conditions
(including but not limited to MACHINE_CLEARS.MEMORY_ORDERING,
MACHINE_CLEARS.SMC, and MACHINE_CLEARS.FP_ASSIST) are captured
in the ANY event. In addition, some conditions can be
specifically counted (i.e. SMC, MEMORY_ORDERING, FP_ASSIST).
However, the sum of SMC, MEMORY_ORDERING, and FP_ASSIST machine
clears will not necessarily equal the number of ANY.

no_alloc_cycles.rob_full
Counts the number of cycles when no uops are allocated and the
ROB is full (less than 2 entries available).

no_alloc_cycles.mispredicts
Counts the number of cycles when no uops are allocated and the
alloc pipe is stalled waiting for a mispredicted jump to
retire. After the misprediction is detected, the front end
will start immediately but the allocate pipe stalls until the
mispredicted.

no_alloc_cycles.rat_stall
Counts the number of cycles when no uops are allocated and a
RATstall is asserted.

no_alloc_cycles.not_delivered
The NO_ALLOC_CYCLES.NOT_DELIVERED event is used to measure
front-end inefficiencies, i.e. when front-end of the machine is
not delivering micro-ops to the back-end and the back-end is
not stalled. This event can be used to identify if the machine
is truly front-end bound. When this event occurs, it is an
indication that the front-end of the machine is operating at
less than its theoretical peak performance. Background: We can
think of the processor pipeline as being divided into 2 broader
parts: Front-end and Back-end. Front-end is responsible for
fetching the instruction, decoding into micro-ops (uops) in
machine understandable format and putting them into a micro-op
queue to be consumed by back end. The back-end then takes these
micro-ops, allocates the required resources. When all
resources are ready, micro-ops are executed. If the back-end is
not ready to accept micro-ops from the front-end, then we do
not want to count these as front-end bottlenecks. However,
whenever we have bottlenecks in the back-end, we will have
allocation unit stalls and eventually forcing the front-end to
wait until the back-end is ready to receive more UOPS. This
event counts the cycles only when back-end is requesting more
uops and front-end is not able to provide them. Some examples
of conditions that cause front-end efficiencies are: Icache
misses, ITLB misses, and decoder restrictions that limit the
the front-end bandwidth.

no_alloc_cycles.all
The NO_ALLOC_CYCLES.ALL event counts the number of cycles when
the front-end does not provide any instructions to be allocated
for any reason. This event indicates the cycles where an
allocation stalls occurs, and no UOPS are allocated in that
cycle.

rs_full_stall.mec
Counts the number of cycles and allocation pipeline is stalled
and is waiting for a free MEC reservation station entry. The
cycles should be appropriately counted in case of the cracked
ops e.g. In case of a cracked load-op, the load portion is sent
to M.

rs_full_stall.all
Counts the number of cycles the Alloc pipeline is stalled when
any one of the RSs (IEC, FPC and MEC) is full. This event is a
superset of all the individual RS stall event counts.

inst_retired.any_p
This event counts the number of instructions that retire
execution. For instructions that consist of multiple micro-ops,
this event counts the retirement of the last micro-op of the
instruction. The counter continues counting during hardware
interrupts, traps, and inside interrupt handlers.

cycles_div_busy.all
Cycles the divider is busy.This event counts the cycles when
the divide unit is unable to accept a new divide UOP because it
is busy processing a previously dispatched UOP. The cycles will
be counted irrespective of whether or not another divide UOP is
waiting to enter the divide unit (from the RS). This event
might count cycles while a divide is in progress even if the RS
is empty. The divide instruction is one of the longest latency
instructions in the machine. Hence, it has a special event
associated with it to help determine if divides are delaying
the retirement of instructions.

cpu_clk_unhalted.core_p
This event counts the number of core cycles while the core is
not in a halt state. The core enters the halt state when it is
running the HLT instruction. In mobile systems the core
frequency may change from time to time. For this reason this
event may have a changing ratio with regards to time.

cpu_clk_unhalted.ref
This event counts the number of reference cycles that the core
is not in a halt state. The core enters the halt state when it
is running the HLT instruction. In mobile systems the core
frequency may change from time. This event is not affected by
core frequency changes but counts as if the core is running at
the maximum frequency all the time.

l2_reject_xq.all
This event counts the number of demand and prefetch
transactions that the L2 XQ rejects due to a full or near full
condition which likely indicates back pressure from the IDI
link. The XQ may reject transactions from the L2Q (non-
cacheable requests), BBS (L2 misses) and WOB (L2 write-back
victims).

core_reject_l2q.all
Counts the number of (demand and L1 prefetchers) core requests
rejected by the L2Q due to a full or nearly full w condition
which likely indicates back pressure from L2Q. It also counts
requests that would have gone directly to the XQ, but are
rejected due to a full or nearly full condition, indicating
back pressure from the IDI link. The L2Q may also reject
transactions from a core to insure fairness between cores, or
to delay a core?s dirty eviction when the address conflicts
incoming external snoops. (Note that L2 prefetcher requests
that are dropped are not counted by this event.)

longest_lat_cache.reference
This event counts requests originating from the core that
references a cache line in the L2 cache.

longest_lat_cache.miss
This event counts the total number of L2 cache references and
the number of L2 cache misses respectively.

icache.accesses
This event counts all instruction fetches, not including most
uncacheable fetches.

icache.hit
This event counts all instruction fetches from the instruction
cache.

icache.misses
This event counts all instruction fetches that miss the
Instruction cache or produce memory requests. This includes
uncacheable fetches. An instruction fetch miss is counted only
once and not once for every cycle it is outstanding.

fetch_stall.itlb_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding ITLB
miss. That is, the decoder queue is able to accept bytes, but
the fetch unit is unable to provide bytes due to an ITLB miss.
Note: this event is not the same as page walk cycles to
retrieve an instruction translation.

fetch_stall.icache_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding
ICache miss. That is, the decoder queue is able to accept
bytes, but the fetch unit is unable to provide bytes due to an
ICache miss. Note: this event is not the same as the total
number of cycles spent retrieving instruction cache lines from
the memory hierarchy. Counts cycles that fetch is stalled due
to any reason. That is, the decoder queue is able to accept
bytes, but the fetch unit is unable to provide bytes. This
will include cycles due to an ITLB miss, ICache miss and other
events.

fetch_stall.all
Counts cycles that fetch is stalled due to any reason. That is,
the decoder queue is able to accept bytes, but the fetch unit
is unable to provide bytes. This will include cycles due to an
ITLB miss, ICache miss and other events.

baclears.all
The BACLEARS event counts the number of times the front end is
resteered, mainly when the Branch Prediction Unit cannot
provide a correct prediction and this is corrected by the
Branch Address Calculator at the front end. The BACLEARS.ANY
event counts the number of baclears for any type of branch.

baclears.return
The BACLEARS event counts the number of times the front end is
resteered, mainly when the Branch Prediction Unit cannot
provide a correct prediction and this is corrected by the
Branch Address Calculator at the front end. The
BACLEARS.RETURN event counts the number of RETURN baclears.

baclears.cond
The BACLEARS event counts the number of times the front end is
resteered, mainly when the Branch Prediction Unit cannot
provide a correct prediction and this is corrected by the
Branch Address Calculator at the front end. The BACLEARS.COND
event counts the number of JCC (Jump on Condtional Code)
baclears.

ms_decoded.ms_entry
Counts the number of times the MSROM starts a flow of UOPS. It
does not count every time a UOP is read from the microcode ROM.
The most common case that this counts is when a micro-coded
instruction is encountered by the front end of the machine.
Other cases include when an instruction encounters a fault,
trap, or microcode assist of any sort. The event will count
MSROM startups for UOPS that are speculative, and subsequently
cleared by branch mispredict or machine clear. Background:
UOPS are produced by two mechanisms. Either they are generated
by hardware that decodes instructions into UOPS, or they are
delivered by a ROM (called the MSROM) that holds UOPS
associated with a specific instruction. MSROM UOPS might also
be delivered in response to some condition such as a fault or
other exceptional condition. This event is an excellent
mechanism for detecting instructions that require the use of
MSROM instructions.

decode_restriction.predecode_wrong
Counts the number of times a decode restriction reduced the
decode throughput due to wrong instruction length prediction.

rehabq.ld_block_st_forward
This event counts the number of retired loads that were
prohibited from receiving forwarded data from the store because
of address mismatch.

rehabq.ld_block_std_notready
This event counts the cases where a forward was technically
possible, but did not occur because the store data was not
available at the right time.

rehabq.st_splits
This event counts the number of retire stores that experienced
cache line boundary splits.

rehabq.ld_splits
This event counts the number of retire loads that experienced
cache line boundary splits.

rehabq.lock
This event counts the number of retired memory operations with
lock semantics. These are either implicit locked instructions
such as the XCHG instruction or instructions with an explicit
LOCK prefix (0xF0).

rehabq.sta_full
This event counts the number of retired stores that are delayed
because there is not a store address buffer available.

rehabq.any_ld
This event counts the number of load uops reissued from Rehabq.

rehabq.any_st
This event counts the number of store uops reissued from
Rehabq.

mem_uops_retired.l1_miss_loads
This event counts the number of load ops retired that miss in
L1 Data cache. Note that prefetch misses will not be counted.

mem_uops_retired.l2_hit_loads
This event counts the number of load ops retired that hit in
the L2.

mem_uops_retired.l2_miss_loads
This event counts the number of load ops retired that miss in
the L2.

mem_uops_retired.dtlb_miss_loads
This event counts the number of load ops retired that had DTLB
miss.

mem_uops_retired.utlb_miss
This event counts the number of load ops retired that had UTLB
miss.

mem_uops_retired.hitm
This event counts the number of load ops retired that got data
from the other core or from the other module.

mem_uops_retired.all_loads
This event counts the number of load ops retired.

mem_uops_retired.all_stores
This event counts the number of store ops retired.

page_walks.d_side_walks
This event counts when a data (D) page walk is completed or
started. Since a page walk implies a TLB miss, the number of
TLB misses can be counted by counting the number of pagewalks.

page_walks.d_side_cycles
This event counts every cycle when a D-side (walks due to a
load) page walk is in progress. Page walk duration divided by
number of page walks is the average duration of page-walks.

page_walks.i_side_walks
This event counts when an instruction (I) page walk is
completed or started. Since a page walk implies a TLB miss,
the number of TLB misses can be counted by counting the number
of pagewalks.

page_walks.i_side_cycles
This event counts every cycle when a I-side (walks due to an
instruction fetch) page walk is in progress. Page walk duration
divided by number of page walks is the average duration of
page-walks.

page_walks.walks
This event counts when a data (D) page walk or an instruction
(I) page walk is completed or started. Since a page walk
implies a TLB miss, the number of TLB misses can be counted by
counting the number of pagewalks.

page_walks.cycles
This event counts every cycle when a data (D) page walk or
instruction (I) page walk is in progress. Since a pagewalk
implies a TLB miss, the approximate cost of a TLB miss can be
determined from this event.

br_inst_retired.all_taken_branches
ALL_TAKEN_BRANCHES counts the number of all taken branch
instructions retired. Branch prediction predicts the branch
target and enables the processor to begin executing
instructions long before the branch true execution path is
known. All branches utilize the branch prediction unit (BPU)
for prediction. This unit predicts the target address not only
based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types: conditional
branches, direct calls and jumps, indirect calls and jumps,
returns.

NAME

DESCRIPTION

SEE ALSO