Tribblix: manual page: glm

GLM_EVENTS(3CPC) CPU Performance Counters Library Functions

NAME

glm_events - processor model specific performance counter events

DESCRIPTION

This manual page describes events specific to the following Intel CPU
models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.

CPU models described by this document:

+o Family 0x6, Model 0x5f

+o Family 0x6, Model 0x5c

The following events are supported:

ld_blocks.data_unknown
Counts a load blocked from using a store forward, but did not
occur because the store data was not available at the right
time. The forward might occur subsequently when the data is
available.

ld_blocks.store_forward
Counts a load blocked from using a store forward because of an
address/size mismatch, only one of the loads blocked from each
store will be counted.

ld_blocks.4k_alias
Counts loads that block because their address modulo 4K matches
a pending store.

ld_blocks.utlb_miss
Counts loads blocked because they are unable to find their
physical address in the micro TLB (UTLB).

ld_blocks.all_block
Counts anytime a load that retires is blocked for any reason.

page_walks.d_side_cycles
Counts every core cycle when a Data-side (walks due to a data
operation) page walk is in progress.

page_walks.i_side_cycles
Counts every core cycle when a Instruction-side (walks due to
an instruction fetch) page walk is in progress.

page_walks.cycles
Counts every core cycle a page-walk is in progress due to
either a data memory operation or an instruction fetch.

uops_issued.any
Counts uops issued by the front end and allocated into the back
end of the machine. This event counts uops that retire as well
as uops that were speculatively executed but didn't retire. The
sort of speculative uops that might be counted includes, but is
not limited to those uops issued in the shadow of a miss-
predicted branch, those uops that are inserted during an assist
(such as for a denormal floating point result), and (previously
allocated) uops that might be canceled during a machine clear.

misalign_mem_ref.load_page_split
Counts when a memory load of a uop spans a page boundary (a
split) is retired.

misalign_mem_ref.store_page_split
Counts when a memory store of a uop spans a page boundary (a
split) is retired.

longest_lat_cache.miss
Counts memory requests originating from the core that miss in
the L2 cache.

longest_lat_cache.reference
Counts memory requests originating from the core that reference
a cache line in the L2 cache.

l2_reject_xq.all
Counts the number of demand and prefetch transactions that the
L2 XQ rejects due to a full or near full condition which likely
indicates back pressure from the intra-die interconnect (IDI)
fabric. The XQ may reject transactions from the L2Q (non-
cacheable requests), L2 misses and L2 write-back victims.

core_reject_l2q.all
Counts the number of demand and L1 prefetcher requests rejected
by the L2Q due to a full or nearly full condition which likely
indicates back pressure from L2Q. It also counts requests that
would have gone directly to the XQ, but are rejected due to a
full or nearly full condition, indicating back pressure from
the IDI link. The L2Q may also reject transactions from a core
to ensure fairness between cores, or to delay a core's dirty
eviction when the address conflicts with incoming external
snoops.

cpu_clk_unhalted.core_p
Core cycles when core is not halted. This event uses a
(_P)rogrammable general purpose performance counter.

cpu_clk_unhalted.ref
Reference cycles when core is not halted. This event uses a
programmable general purpose performance counter.

dl1.dirty_eviction
Counts when a modified (dirty) cache line is evicted from the
data L1 cache and needs to be written back to memory. No count
will occur if the evicted line is clean, and hence does not
require a writeback.

icache.hit
Counts requests to the Instruction Cache (ICache) for one or
more bytes in an ICache Line and that cache line is in the
ICache (hit). The event strives to count on a cache line
basis, so that multiple accesses which hit in a single cache
line count as one ICACHE.HIT. Specifically, the event counts
when straight line code crosses the cache line boundary, or
when a branch target is to a new line, and that cache line is
in the ICache. This event counts differently than Intel
processors based on Silvermont microarchitecture.

icache.misses
Counts requests to the Instruction Cache (ICache) for one or
more bytes in an ICache Line and that cache line is not in the
ICache (miss). The event strives to count on a cache line
basis, so that multiple accesses which miss in a single cache
line count as one ICACHE.MISS. Specifically, the event counts
when straight line code crosses the cache line boundary, or
when a branch target is to a new line, and that cache line is
not in the ICache. This event counts differently than Intel
processors based on Silvermont microarchitecture.

icache.accesses
Counts requests to the Instruction Cache (ICache) for one or
more bytes in an ICache Line. The event strives to count on a
cache line basis, so that multiple fetches to a single cache
line count as one ICACHE.ACCESS. Specifically, the event
counts when accesses from straight line code crosses the cache
line boundary, or when a branch target is to a new line. This
event counts differently than Intel processors based on
Silvermont microarchitecture.

itlb.miss
Counts the number of times the machine was unable to find a
translation in the Instruction Translation Lookaside Buffer
(ITLB) for a linear address of an instruction fetch. It counts
when new translation are filled into the ITLB. The event is
speculative in nature, but will not count translations (page
walks) that are begun and not finished, or translations that
are finished but not filled into the ITLB.

fetch_stall.all
Counts cycles that fetch is stalled due to any reason. That is,
the decoder queue is able to accept bytes, but the fetch unit
is unable to provide bytes. This will include cycles due to an
ITLB miss, ICache miss and other events.

fetch_stall.itlb_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding ITLB
miss. That is, the decoder queue is able to accept bytes, but
the fetch unit is unable to provide bytes due to an ITLB miss.
Note: this event is not the same as page walk cycles to
retrieve an instruction translation.

fetch_stall.icache_fill_pending_cycles
Counts cycles that fetch is stalled due to an outstanding
ICache miss. That is, the decoder queue is able to accept
bytes, but the fetch unit is unable to provide bytes due to an
ICache miss. Note: this event is not the same as the total
number of cycles spent retrieving instruction cache lines from
the memory hierarchy.

uops_not_delivered.any
This event used to measure front-end inefficiencies. I.e. when
front-end of the machine is not delivering uops to the back-end
and the back-end has is not stalled. This event can be used to
identify if the machine is truly front-end bound. When this
event occurs, it is an indication that the front-end of the
machine is operating at less than its theoretical peak
performance. Background: We can think of the processor pipeline
as being divided into 2 broader parts: Front-end and Back-end.
Front-end is responsible for fetching the instruction, decoding
into uops in machine understandable format and putting them
into a uop queue to be consumed by back end. The back-end then
takes these uops, allocates the required resources. When all
resources are ready, uops are executed. If the back-end is not
ready to accept uops from the front-end, then we do not want to
count these as front-end bottlenecks. However, whenever we
have bottlenecks in the back-end, we will have allocation unit
stalls and eventually forcing the front-end to wait until the
back-end is ready to receive more uops. This event counts only
when back-end is requesting more uops and front-end is not able
to provide them. When 3 uops are requested and no uops are
delivered, the event counts 3. When 3 are requested, and only 1
is delivered, the event counts 2. When only 2 are delivered,
the event counts 1. Alternatively stated, the event will not
count if 3 uops are delivered, or if the back end is stalled
and not requesting any uops at all. Counts indicate missed
opportunities for the front-end to deliver a uop to the back
end. Some examples of conditions that cause front-end
efficiencies are: ICache misses, ITLB misses, and decoder
restrictions that limit the front-end bandwidth. Known Issues:
Some uops require multiple allocation slots. These uops will
not be charged as a front end 'not delivered' opportunity, and
will be regarded as a back end problem. For example, the INC
instruction has one uop that requires 2 issue slots. A stream
of INC instructions will not count as UOPS_NOT_DELIVERED, even
though only one instruction can be issued per clock. The low
uop issue rate for a stream of INC instructions is considered
to be a back end issue.

inst_retired.any_p
Counts the number of instructions that retire execution. For
instructions that consist of multiple uops, this event counts
the retirement of the last uop of the instruction. The event
continues counting during hardware interrupts, traps, and
inside interrupt handlers. This is an architectural
performance event. This event uses a (_P)rogrammable general
purpose performance counter. *This event is Precise Event
capable: The EventingRIP field in the PEBS record is precise
to the address of the instruction which caused the event.
Note: Because PEBS records can be collected only on IA32_PMC0,
only one event can use the PEBS facility at a time.

uops_retired.any
Counts uops which retired.

uops_retired.ms
Counts uops retired that are from the complex flows issued by
the micro-sequencer (MS). Counts both the uops from a micro-
coded instruction, and the uops that might be generated from a
micro-coded assist.

uops_retired.fpdiv
Counts the number of floating point divide uops retired.

uops_retired.idiv
Counts the number of integer divide uops retired.

machine_clears.all
Counts machine clears for any reason.

machine_clears.smc
Counts the number of times that the processor detects that a
program is writing to a code section and has to perform a
machine clear because of that modification. Self-modifying
code (SMC) causes a severe penalty in all Intel(R) architecture
processors.

machine_clears.memory_ordering
Counts machine clears due to memory ordering issues. This
occurs when a snoop request happens and the machine is
uncertain if memory ordering will be preserved as another core
is in the process of modifying the data.

machine_clears.fp_assist
Counts machine clears due to floating point (FP) operations
needing assists. For instance, if the result was a floating
point denormal, the hardware clears the pipeline and reissues
uops to produce the correct IEEE compliant denormal result.

machine_clears.disambiguation
Counts machine clears due to memory disambiguation. Memory
disambiguation happens when a load which has been issued
conflicts with a previous unretired store in the pipeline whose
address was not known at issue time, but is later resolved to
be the same as the load address.

br_inst_retired.all_branches
Counts branch instructions retired for all branch types. This
is an architectural performance event.

br_inst_retired.jcc
Counts retired Jcc (Jump on Conditional Code/Jump if Condition
is Met) branch instructions retired, including both when the
branch was taken and when it was not taken.

br_inst_retired.all_taken_branches
Counts the number of taken branch instructions retired.

br_inst_retired.far_branch
Counts far branch instructions retired. This includes far
jump, far call and return, and Interrupt call and return.

br_inst_retired.non_return_ind
Counts near indirect call or near indirect jmp branch
instructions retired.

br_inst_retired.return
Counts near return branch instructions retired.

br_inst_retired.call
Counts near CALL branch instructions retired.

br_inst_retired.ind_call
Counts near indirect CALL branch instructions retired.

br_inst_retired.rel_call
Counts near relative CALL branch instructions retired.

br_inst_retired.taken_jcc
Counts Jcc (Jump on Conditional Code/Jump if Condition is Met)
branch instructions retired that were taken and does not count
when the Jcc branch instruction were not taken.

br_misp_retired.all_branches
Counts mispredicted branch instructions retired including all
branch types.

br_misp_retired.jcc
Counts mispredicted retired Jcc (Jump on Conditional Code/Jump
if Condition is Met) branch instructions retired, including
both when the branch was supposed to be taken and when it was
not supposed to be taken (but the processor predicted the
opposite condition).

br_misp_retired.non_return_ind
Counts mispredicted branch instructions retired that were near
indirect call or near indirect jmp, where the target address
taken was not what the processor predicted.

br_misp_retired.return
Counts mispredicted near RET branch instructions retired, where
the return address taken was not what the processor predicted.

br_misp_retired.ind_call
Counts mispredicted near indirect CALL branch instructions
retired, where the target address taken was not what the
processor predicted.

br_misp_retired.taken_jcc
Counts mispredicted retired Jcc (Jump on Conditional Code/Jump
if Condition is Met) branch instructions retired that were
supposed to be taken but the processor predicted that it would
not be taken.

issue_slots_not_consumed.any
Counts the number of issue slots per core cycle that were not
consumed by the backend due to either a full resource in the
backend (RESOURCE_FULL) or due to the processor recovering from
some event (RECOVERY).

issue_slots_not_consumed.resource_full
Counts the number of issue slots per core cycle that were not
consumed because of a full resource in the backend. Including
but not limited to resources such as the Re-order Buffer (ROB),
reservation stations (RS), load/store buffers, physical
registers, or any other needed machine resource that is
currently unavailable. Note that uops must be available for
consumption in order for this event to fire. If a uop is not
available (Instruction Queue is empty), this event will not
count.

issue_slots_not_consumed.recovery
Counts the number of issue slots per core cycle that were not
consumed by the backend because allocation is stalled waiting
for a mispredicted jump to retire or other branch-like
conditions (e.g. the event is relevant during certain microcode
flows). Counts all issue slots blocked while within this
window including slots where uops were not available in the
Instruction Queue.

hw_interrupts.received
Counts hardware interrupts received by the processor.

hw_interrupts.masked
Counts the number of core cycles during which interrupts are
masked (disabled). Increments by 1 each core cycle that
EFLAGS.IF is 0, regardless of whether interrupts are pending or
not.

hw_interrupts.pending_and_masked
Counts core cycles during which there are pending interrupts,
but interrupts are masked (EFLAGS.IF = 0).

cycles_div_busy.all
Counts core cycles if either divide unit is busy.

cycles_div_busy.idiv
Counts core cycles the integer divide unit is busy.

cycles_div_busy.fpdiv
Counts core cycles the floating point divide unit is busy.

mem_uops_retired.dtlb_miss_loads
Counts load uops retired that caused a DTLB miss.

mem_uops_retired.dtlb_miss_stores
Counts store uops retired that caused a DTLB miss.

mem_uops_retired.dtlb_miss
Counts uops retired that had a DTLB miss on load, store or
either. Note that when two distinct memory operations to the
same page miss the DTLB, only one of them will be recorded as a
DTLB miss.

mem_uops_retired.lock_loads
Counts locked memory uops retired. This includes regular locks
and bus locks. (To specifically count bus locks only, see the
Offcore response event.) A locked access is one with a lock
prefix, or an exchange to memory. See the SDM for a complete
description of which memory load accesses are locks.

mem_uops_retired.split_loads
Counts load uops retired where the data requested spans a 64
byte cache line boundary.

mem_uops_retired.split_stores
Counts store uops retired where the data requested spans a 64
byte cache line boundary.

mem_uops_retired.split
Counts memory uops retired where the data requested spans a 64
byte cache line boundary.

mem_uops_retired.all_loads
Counts the number of load uops retired.

mem_uops_retired.all_stores
Counts the number of store uops retired.

mem_uops_retired.all
Counts the number of memory uops retired that is either a loads
or a store or both.

mem_load_uops_retired.l1_hit
Counts load uops retired that hit the L1 data cache.

mem_load_uops_retired.l2_hit
Counts load uops retired that hit in the L2 cache.

mem_load_uops_retired.l1_miss
Counts load uops retired that miss the L1 data cache.

mem_load_uops_retired.l2_miss
Counts load uops retired that miss in the L2 cache.

mem_load_uops_retired.hitm
Counts load uops retired where the cache line containing the
data was in the modified state of another core or modules cache
(HITM). More specifically, this means that when the load
address was checked by other caching agents (typically another
processor) in the system, one of those caching agents indicated
that they had a dirty copy of the data. Loads that obtain a
HITM response incur greater latency than most is typical for a
load. In addition, since HITM indicates that some other
processor had this data in its cache, it implies that the data
was shared between processors, or potentially was a lock or
semaphore value. This event is useful for locating sharing,
false sharing, and contended locks.

mem_load_uops_retired.wcb_hit
Counts memory load uops retired where the data is retrieved
from the WCB (or fill buffer), indicating that the load found
its data while that data was in the process of being brought
into the L1 cache. Typically a load will receive this
indication when some other load or prefetch missed the L1 cache
and was in the process of retrieving the cache line containing
the data, but that process had not yet finished (and written
the data back to the cache). For example, consider load X and
Y, both referencing the same cache line that is not in the L1
cache. If load X misses cache first, it obtains and WCB (or
fill buffer) and begins the process of requesting the data.
When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what time the request to Y
occurs.

mem_load_uops_retired.dram_hit
Counts memory load uops retired where the data is retrieved
from DRAM. Event is counted at retirement, so the speculative
loads are ignored. A memory load can hit (or miss) the L1
cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or
receive a HITM response.

baclears.all
Counts the number of times a BACLEAR is signaled for any
reason, including, but not limited to indirect branch/call,
Jcc (Jump on Conditional Code/Jump if Condition is Met) branch,
unconditional branch/call, and returns.

baclears.return
Counts BACLEARS on return instructions.

baclears.cond
Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if
Condition is Met) branches.

ms_decoded.ms_entry
Counts the number of times the Microcode Sequencer (MS) starts
a flow of uops from the MSROM. It does not count every time a
uop is read from the MSROM. The most common case that this
counts is when a micro-coded instruction is encountered by the
front end of the machine. Other cases include when an
instruction encounters a fault, trap, or microcode assist of
any sort that initiates a flow of uops. The event will count
MS startups for uops that are speculative, and subsequently
cleared by branch mispredict or a machine clear.

decode_restriction.predecode_wrong
Counts the number of times the prediction (from the predecode
cache) for instruction length is incorrect.

NAME

DESCRIPTION

SEE ALSO