Tribblix: manual page: hsw

HSW_EVENTS(3CPC) CPU Performance Counters Library Functions

NAME

hsw_events - processor model specific performance counter events

DESCRIPTION

This manual page describes events specific to the following Intel CPU
models and is derived from Intel's perfmon data. For more information,
please consult the Intel Software Developer's Manual or Intel's perfmon
website.

CPU models described by this document:

+o Family 0x6, Model 0x46

+o Family 0x6, Model 0x45

+o Family 0x6, Model 0x3c

The following events are supported:

ld_blocks.store_forward
This event counts loads that followed a store to the same
address, where the data could not be forwarded inside the
pipeline from the store to the load. The most common reason
why store forwarding would be blocked is when a load's address
range overlaps with a preceding smaller uncompleted store. The
penalty for blocked store forwarding is that the load must wait
for the store to write its value to the cache before it can be
issued.

ld_blocks.no_sr
The number of times that split load operations are temporarily
blocked because all resources for handling the split accesses
are in use.

misalign_mem_ref.loads
Speculative cache-line split load uops dispatched to L1D.

misalign_mem_ref.stores
Speculative cache-line split store-address uops dispatched to
L1D.

ld_blocks_partial.address_alias
Aliasing occurs when a load is issued after a store and their
memory addresses are offset by 4K. This event counts the
number of loads that aliased with a preceding store, resulting
in an extended address check in the pipeline which can have a
performance impact.

dtlb_load_misses.miss_causes_a_walk
Misses in all TLB levels that cause a page walk of any page
size.

dtlb_load_misses.walk_completed_4k
Completed page walks due to demand load misses that caused 4K
page walks in any TLB levels.

dtlb_load_misses.walk_completed_2m_4m
Completed page walks due to demand load misses that caused
2M/4M page walks in any TLB levels.

dtlb_load_misses.walk_completed_1g
Load miss in all TLB levels causes a page walk that completes.
(1G)

dtlb_load_misses.walk_completed
Completed page walks in any TLB of any page size due to demand
load misses.

dtlb_load_misses.walk_duration
This event counts cycles when the page miss handler (PMH) is
servicing page walks caused by DTLB load misses.

dtlb_load_misses.stlb_hit_4k
This event counts load operations from a 4K page that miss the
first DTLB level but hit the second and do not cause page
walks.

dtlb_load_misses.stlb_hit_2m
This event counts load operations from a 2M page that miss the
first DTLB level but hit the second and do not cause page
walks.

dtlb_load_misses.stlb_hit
Number of cache load STLB hits. No page walk.

dtlb_load_misses.pde_cache_miss
DTLB demand load misses with low part of linear-to-physical
address translation missed.

int_misc.recovery_cycles
This event counts the number of cycles spent waiting for a
recovery after an event such as a processor nuke, JEClear,
assist, hle/rtm abort etc.

int_misc.recovery_cycles_any
Core cycles the allocator was stalled due to recovery from
earlier clear event for any thread running on the physical core
(e.g. misprediction or memory nuke).

uops_issued.any
This event counts the number of uops issued by the Front-end of
the pipeline to the Back-end. This event is counted at the
allocation stage and will count both retired and non-retired
uops.

uops_issued.stall_cycles
Cycles when Resource Allocation Table (RAT) does not issue Uops
to Reservation Station (RS) for the thread.

uops_issued.core_stall_cycles
Cycles when Resource Allocation Table (RAT) does not issue Uops
to Reservation Station (RS) for all threads.

uops_issued.flags_merge
Number of flags-merge uops allocated. Such uops add delay.

uops_issued.slow_lea
Number of slow LEA or similar uops allocated. Such uop has 3
sources (for example, 2 sources + immediate) regardless of
whether it is a result of LEA instruction or not.

uops_issued.single_mul
Number of multiply packed/scalar single precision uops
allocated.

arith.divider_uops
Any uop executed by the Divider. (This includes all divide
uops, sqrt, ...)

l2_rqsts.demand_data_rd_miss
Demand data read requests that missed L2, no rejects.

The following errata may apply to this: HSD78, HSM80

l2_rqsts.rfo_miss
Counts the number of store RFO requests that miss the L2 cache.

l2_rqsts.code_rd_miss
Number of instruction fetches that missed the L2 cache.

l2_rqsts.all_demand_miss
Demand requests that miss L2 cache.

The following errata may apply to this: HSD78, HSM80

l2_rqsts.l2_pf_miss
Counts all L2 HW prefetcher requests that missed L2.

l2_rqsts.miss
All requests that missed L2.

The following errata may apply to this: HSD78, HSM80

l2_rqsts.demand_data_rd_hit
Counts the number of demand Data Read requests, initiated by
load instructions, that hit L2 cache

The following errata may apply to this: HSD78, HSM80

l2_rqsts.rfo_hit
Counts the number of store RFO requests that hit the L2 cache.

l2_rqsts.code_rd_hit
Number of instruction fetches that hit the L2 cache.

l2_rqsts.l2_pf_hit
Counts all L2 HW prefetcher requests that hit L2.

l2_rqsts.all_demand_data_rd
Counts any demand and L1 HW prefetch data load requests to L2.

The following errata may apply to this: HSD78, HSM80

l2_rqsts.all_rfo
Counts all L2 store RFO requests.

l2_rqsts.all_code_rd
Counts all L2 code requests.

l2_rqsts.all_demand_references
Demand requests to L2 cache.

The following errata may apply to this: HSD78, HSM80

l2_rqsts.all_pf
Counts all L2 HW prefetcher requests.

l2_rqsts.references
All requests to L2 cache.

The following errata may apply to this: HSD78, HSM80

l2_demand_rqsts.wb_hit
Not rejected writebacks that hit L2 cache.

longest_lat_cache.miss
This event counts each cache miss condition for references to
the last level cache.

longest_lat_cache.reference
This event counts requests originating from the core that
reference a cache line in the last level cache.

cpu_clk_unhalted.thread_p
Counts the number of thread cycles while the thread is not in a
halt state. The thread enters the halt state when it is running
the HLT instruction. The core frequency may change from time to
time due to power or thermal throttling.

cpu_clk_unhalted.thread_p_any
Core cycles when at least one thread on the physical core is
not in halt state.

cpu_clk_thread_unhalted.ref_xclk
Increments at the frequency of XCLK (100 MHz) when not halted.

cpu_clk_thread_unhalted.ref_xclk_any
Reference cycles when the at least one thread on the physical
core is unhalted (counts at 100 MHz rate).

cpu_clk_unhalted.ref_xclk
Reference cycles when the thread is unhalted. (counts at 100
MHz rate)

cpu_clk_unhalted.ref_xclk_any
Reference cycles when the at least one thread on the physical
core is unhalted (counts at 100 MHz rate).

cpu_clk_thread_unhalted.one_thread_active
Count XClk pulses when this thread is unhalted and the other
thread is halted.

cpu_clk_unhalted.one_thread_active
Count XClk pulses when this thread is unhalted and the other
thread is halted.

l1d_pend_miss.pending
Increments the number of outstanding L1D misses every cycle.
Set Cmask = 1 and Edge =1 to count occurrences.

l1d_pend_miss.pending_cycles
Cycles with L1D load Misses outstanding.

l1d_pend_miss.pending_cycles_any
Cycles with L1D load Misses outstanding from any thread on
physical core.

l1d_pend_miss.request_fb_full
Number of times a request needed a FB entry but there was no
entry available for it. That is the FB unavailability was
dominant reason for blocking the request. A request includes
cacheable/uncacheable demands that is load, store or SW
prefetch. HWP are e.

l1d_pend_miss.fb_full
Cycles a demand request was blocked due to Fill Buffers
inavailability.

dtlb_store_misses.miss_causes_a_walk
Miss in all TLB levels causes a page walk of any page size
(4K/2M/4M/1G).

dtlb_store_misses.walk_completed_4k
Completed page walks due to store misses in one or more TLB
levels of 4K page structure.

dtlb_store_misses.walk_completed_2m_4m
Completed page walks due to store misses in one or more TLB
levels of 2M/4M page structure.

dtlb_store_misses.walk_completed_1g
Store misses in all DTLB levels that cause completed page
walks. (1G)

dtlb_store_misses.walk_completed
Completed page walks due to store miss in any TLB levels of any
page size (4K/2M/4M/1G).

dtlb_store_misses.walk_duration
This event counts cycles when the page miss handler (PMH) is
servicing page walks caused by DTLB store misses.

dtlb_store_misses.stlb_hit_4k
This event counts store operations from a 4K page that miss the
first DTLB level but hit the second and do not cause page
walks.

dtlb_store_misses.stlb_hit_2m
This event counts store operations from a 2M page that miss the
first DTLB level but hit the second and do not cause page
walks.

dtlb_store_misses.stlb_hit
Store operations that miss the first TLB level but hit the
second and do not cause page walks.

dtlb_store_misses.pde_cache_miss
DTLB store misses with low part of linear-to-physical address
translation missed.

load_hit_pre.sw_pf
Non-SW-prefetch load dispatches that hit fill buffer allocated
for S/W prefetch.

load_hit_pre.hw_pf
Non-SW-prefetch load dispatches that hit fill buffer allocated
for H/W prefetch.

ept.walk_cycles
Cycle count for an Extended Page table walk.

l1d.replacement
This event counts when new data lines are brought into the L1
Data cache, which cause other lines to be evicted from the
cache.

tx_mem.abort_conflict
Number of times a transactional abort was signaled due to a
data conflict on a transactionally accessed address.

tx_mem.abort_capacity_write
Number of times a transactional abort was signaled due to a
data capacity limitation for transactional writes.

tx_mem.abort_hle_store_to_elided_lock
Number of times a HLE transactional region aborted due to a non
XRELEASE prefixed instruction writing to an elided lock in the
elision buffer.

tx_mem.abort_hle_elision_buffer_not_empty
Number of times an HLE transactional execution aborted due to
NoAllocatedElisionBuffer being non-zero.

tx_mem.abort_hle_elision_buffer_mismatch
Number of times an HLE transactional execution aborted due to
XRELEASE lock not satisfying the address and value requirements
in the elision buffer.

tx_mem.abort_hle_elision_buffer_unsupported_alignment
Number of times an HLE transactional execution aborted due to
an unsupported read alignment from the elision buffer.

tx_mem.hle_elision_buffer_full
Number of times HLE lock could not be elided due to
ElisionBufferAvailable being zero.

move_elimination.int_eliminated
Number of integer move elimination candidate uops that were
eliminated.

move_elimination.simd_eliminated
Number of SIMD move elimination candidate uops that were
eliminated.

move_elimination.int_not_eliminated
Number of integer move elimination candidate uops that were not
eliminated.

move_elimination.simd_not_eliminated
Number of SIMD move elimination candidate uops that were not
eliminated.

cpl_cycles.ring0
Unhalted core cycles when the thread is in ring 0.

cpl_cycles.ring0_trans
Number of intervals between processor halts while thread is in
ring 0.

cpl_cycles.ring123
Unhalted core cycles when the thread is not in ring 0.

tx_exec.misc1
Counts the number of times a class of instructions that may
cause a transactional abort was executed. Since this is the
count of execution, it may not always cause a transactional
abort.

tx_exec.misc2
Counts the number of times a class of instructions (e.g.,
vzeroupper) that may cause a transactional abort was executed
inside a transactional region.

tx_exec.misc3
Counts the number of times an instruction execution caused the
transactional nest count supported to be exceeded.

tx_exec.misc4
Counts the number of times a XBEGIN instruction was executed
inside an HLE transactional region.

tx_exec.misc5
Counts the number of times an HLE XACQUIRE instruction was
executed inside an RTM transactional region.

rs_events.empty_cycles
This event counts cycles when the Reservation Station ( RS ) is
empty for the thread. The RS is a structure that buffers
allocated micro-ops from the Front-end. If there are many
cycles when the RS is empty, it may represent an underflow of
instructions delivered from the Front-end.

rs_events.empty_end
Counts end of periods where the Reservation Station (RS) was
empty. Could be useful to precisely locate Frontend Latency
Bound issues.

offcore_requests_outstanding.demand_data_rd
Offcore outstanding demand data read transactions in SQ to
uncore. Set Cmask=1 to count cycles.

The following errata may apply to this: HSD78, HSD62, HSD61,
HSM63, HSM80

offcore_requests_outstanding.cycles_with_demand_data_rd
Cycles when offcore outstanding Demand Data Read transactions
are present in SuperQueue (SQ), queue to uncore.

The following errata may apply to this: HSD78, HSD62, HSD61,
HSM63, HSM80

offcore_requests_outstanding.demand_data_rd_ge_6
Cycles with at least 6 offcore outstanding Demand Data Read
transactions in uncore queue.

The following errata may apply to this: HSD78, HSD62, HSD61,
HSM63, HSM80

offcore_requests_outstanding.demand_code_rd
Offcore outstanding Demand code Read transactions in SQ to
uncore. Set Cmask=1 to count cycles.

The following errata may apply to this: HSD62, HSD61, HSM63

offcore_requests_outstanding.demand_rfo
Offcore outstanding RFO store transactions in SQ to uncore. Set
Cmask=1 to count cycles.

The following errata may apply to this: HSD62, HSD61, HSM63

offcore_requests_outstanding.cycles_with_demand_rfo
Offcore outstanding demand rfo reads transactions in SuperQueue
(SQ), queue to uncore, every cycle.

The following errata may apply to this: HSD62, HSD61, HSM63

offcore_requests_outstanding.all_data_rd
Offcore outstanding cacheable data read transactions in SQ to
uncore. Set Cmask=1 to count cycles.

The following errata may apply to this: HSD62, HSD61, HSM63

offcore_requests_outstanding.cycles_with_data_rd
Cycles when offcore outstanding cacheable Core Data Read
transactions are present in SuperQueue (SQ), queue to uncore.

The following errata may apply to this: HSD62, HSD61, HSM63

lock_cycles.split_lock_uc_lock_duration
Cycles in which the L1D and L2 are locked, due to a UC lock or
split lock.

lock_cycles.cache_lock_duration
Cycles in which the L1D is locked.

idq.empty
Counts cycles the IDQ is empty.

The following errata may apply to this: HSD135

idq.mite_uops
Increment each cycle # of uops delivered to IDQ from MITE path.
Set Cmask = 1 to count cycles.

idq.mite_cycles
Cycles when uops are being delivered to Instruction Decode
Queue (IDQ) from MITE path.

idq.dsb_uops
Increment each cycle. # of uops delivered to IDQ from DSB path.
Set Cmask = 1 to count cycles.

idq.dsb_cycles
Cycles when uops are being delivered to Instruction Decode
Queue (IDQ) from Decode Stream Buffer (DSB) path.

idq.ms_dsb_uops
Increment each cycle # of uops delivered to IDQ when MS_busy by
DSB. Set Cmask = 1 to count cycles. Add Edge=1 to count # of
delivery.

idq.ms_dsb_cycles
Cycles when uops initiated by Decode Stream Buffer (DSB) are
being delivered to Instruction Decode Queue (IDQ) while
Microcode Sequenser (MS) is busy.

idq.ms_dsb_occur
Deliveries to Instruction Decode Queue (IDQ) initiated by
Decode Stream Buffer (DSB) while Microcode Sequenser (MS) is
busy.

idq.all_dsb_cycles_4_uops
Counts cycles DSB is delivered four uops. Set Cmask = 4.

idq.all_dsb_cycles_any_uops
Counts cycles DSB is delivered at least one uops. Set Cmask =
1.

idq.ms_mite_uops
Increment each cycle # of uops delivered to IDQ when MS_busy by
MITE. Set Cmask = 1 to count cycles.

idq.all_mite_cycles_4_uops
Counts cycles MITE is delivered four uops. Set Cmask = 4.

idq.all_mite_cycles_any_uops
Counts cycles MITE is delivered at least one uop. Set Cmask =
1.

idq.ms_uops
This event counts uops delivered by the Front-end with the
assistance of the microcode sequencer. Microcode assists are
used for complex instructions or scenarios that can't be
handled by the standard decoder. Using other instructions, if
possible, will usually improve performance.

idq.ms_cycles
This event counts cycles during which the microcode sequencer
assisted the Front-end in delivering uops. Microcode assists
are used for complex instructions or scenarios that can't be
handled by the standard decoder. Using other instructions, if
possible, will usually improve performance.

idq.ms_switches
Number of switches from DSB (Decode Stream Buffer) or MITE
(legacy decode pipeline) to the Microcode Sequencer.

idq.mite_all_uops
Number of uops delivered to IDQ from any path.

icache.hit
Number of Instruction Cache, Streaming Buffer and Victim Cache
Reads. both cacheable and noncacheable, including UC fetches.

icache.misses
This event counts Instruction Cache (ICACHE) misses.

icache.ifetch_stall
Cycles where a code fetch is stalled due to L1 instruction-
cache miss.

icache.ifdata_stall
Cycles where a code fetch is stalled due to L1 instruction-
cache miss.

itlb_misses.miss_causes_a_walk
Misses in ITLB that causes a page walk of any page size.

itlb_misses.walk_completed_4k
Completed page walks due to misses in ITLB 4K page entries.

itlb_misses.walk_completed_2m_4m
Completed page walks due to misses in ITLB 2M/4M page entries.

itlb_misses.walk_completed_1g
Store miss in all TLB levels causes a page walk that completes.
(1G)

itlb_misses.walk_completed
Completed page walks in ITLB of any page size.

itlb_misses.walk_duration
This event counts cycles when the page miss handler (PMH) is
servicing page walks caused by ITLB misses.

itlb_misses.stlb_hit_4k
ITLB misses that hit STLB (4K).

itlb_misses.stlb_hit_2m
ITLB misses that hit STLB (2M).

itlb_misses.stlb_hit
ITLB misses that hit STLB. No page walk.

ild_stall.lcp
This event counts cycles where the decoder is stalled on an
instruction with a length changing prefix (LCP).

ild_stall.iq_full
Stall cycles due to IQ is full.

br_inst_exec.nontaken_conditional
Not taken macro-conditional branches.

br_inst_exec.taken_conditional
Taken speculative and retired macro-conditional branches.

br_inst_exec.taken_direct_jump
Taken speculative and retired macro-conditional branch
instructions excluding calls and indirects.

br_inst_exec.taken_indirect_jump_non_call_ret
Taken speculative and retired indirect branches excluding calls
and returns.

br_inst_exec.taken_indirect_near_return
Taken speculative and retired indirect branches with return
mnemonic.

br_inst_exec.taken_direct_near_call
Taken speculative and retired direct near calls.

br_inst_exec.taken_indirect_near_call
Taken speculative and retired indirect calls.

br_inst_exec.all_conditional
Speculative and retired macro-conditional branches.

br_inst_exec.all_direct_jmp
Speculative and retired macro-unconditional branches excluding
calls and indirects.

br_inst_exec.all_indirect_jump_non_call_ret
Speculative and retired indirect branches excluding calls and
returns.

br_inst_exec.all_indirect_near_return
Speculative and retired indirect return branches.

br_inst_exec.all_direct_near_call
Speculative and retired direct near calls.

br_inst_exec.all_branches
Counts all near executed branches (not necessarily retired).

br_misp_exec.nontaken_conditional
Not taken speculative and retired mispredicted macro
conditional branches.

br_misp_exec.taken_conditional
Taken speculative and retired mispredicted macro conditional
branches.

br_misp_exec.taken_indirect_jump_non_call_ret
Taken speculative and retired mispredicted indirect branches
excluding calls and returns.

br_misp_exec.taken_return_near
Taken speculative and retired mispredicted indirect branches
with return mnemonic.

br_misp_exec.taken_indirect_near_call
Taken speculative and retired mispredicted indirect calls.

br_misp_exec.all_conditional
Speculative and retired mispredicted macro conditional
branches.

br_misp_exec.all_indirect_jump_non_call_ret
Mispredicted indirect branches excluding calls and returns.

br_misp_exec.all_branches
Counts all near executed branches (not necessarily retired).

idq_uops_not_delivered.core
This event count the number of undelivered (unallocated) uops
from the Front-end to the Resource Allocation Table (RAT) while
the Back-end of the processor is not stalled. The Front-end can
allocate up to 4 uops per cycle so this event can increment 0-4
times per cycle depending on the number of unallocated uops.
This event is counted on a per-core basis.

The following errata may apply to this: HSD135

idq_uops_not_delivered.cycles_0_uops_deliv.core
This event counts the number cycles during which the Front-end
allocated exactly zero uops to the Resource Allocation Table
(RAT) while the Back-end of the processor is not stalled. This
event is counted on a per-core basis.

The following errata may apply to this: HSD135

idq_uops_not_delivered.cycles_le_1_uop_deliv.core
Cycles per thread when 3 or more uops are not delivered to
Resource Allocation Table (RAT) when backend of the machine is
not stalled.

The following errata may apply to this: HSD135

idq_uops_not_delivered.cycles_le_2_uop_deliv.core
Cycles with less than 2 uops delivered by the front end.

The following errata may apply to this: HSD135

idq_uops_not_delivered.cycles_le_3_uop_deliv.core
Cycles with less than 3 uops delivered by the front end.

The following errata may apply to this: HSD135

idq_uops_not_delivered.cycles_fe_was_ok
Counts cycles FE delivered 4 uops or Resource Allocation Table
(RAT) was stalling FE.

The following errata may apply to this: HSD135

uops_executed_port.port_0
Cycles which a uop is dispatched on port 0 in this thread.

uops_executed_port.port_0_core
Cycles per core when uops are exectuted in port 0.

uops_dispatched_port.port_0
Cycles per thread when uops are executed in port 0.

uops_executed_port.port_1
Cycles which a uop is dispatched on port 1 in this thread.

uops_executed_port.port_1_core
Cycles per core when uops are exectuted in port 1.

uops_dispatched_port.port_1
Cycles per thread when uops are executed in port 1.

uops_executed_port.port_2
Cycles which a uop is dispatched on port 2 in this thread.

uops_executed_port.port_2_core
Cycles per core when uops are dispatched to port 2.

uops_dispatched_port.port_2
Cycles per thread when uops are executed in port 2.

uops_executed_port.port_3
Cycles which a uop is dispatched on port 3 in this thread.

uops_executed_port.port_3_core
Cycles per core when uops are dispatched to port 3.

uops_dispatched_port.port_3
Cycles per thread when uops are executed in port 3.

uops_executed_port.port_4
Cycles which a uop is dispatched on port 4 in this thread.

uops_executed_port.port_4_core
Cycles per core when uops are exectuted in port 4.

uops_dispatched_port.port_4
Cycles per thread when uops are executed in port 4.

uops_executed_port.port_5
Cycles which a uop is dispatched on port 5 in this thread.

uops_executed_port.port_5_core
Cycles per core when uops are exectuted in port 5.

uops_dispatched_port.port_5
Cycles per thread when uops are executed in port 5.

uops_executed_port.port_6
Cycles which a uop is dispatched on port 6 in this thread.

uops_executed_port.port_6_core
Cycles per core when uops are exectuted in port 6.

uops_dispatched_port.port_6
Cycles per thread when uops are executed in port 6.

uops_executed_port.port_7
Cycles which a uop is dispatched on port 7 in this thread.

uops_executed_port.port_7_core
Cycles per core when uops are dispatched to port 7.

uops_dispatched_port.port_7
Cycles per thread when uops are executed in port 7.

resource_stalls.any
Cycles allocation is stalled due to resource related reason.

The following errata may apply to this: HSD135

resource_stalls.rs
Cycles stalled due to no eligible RS entry available.

resource_stalls.sb
This event counts cycles during which no instructions were
allocated because no Store Buffers (SB) were available.

resource_stalls.rob
Cycles stalled due to re-order buffer full.

cycle_activity.cycles_l2_pending
Cycles with pending L2 miss loads. Set Cmask=2 to count cycle.

The following errata may apply to this: HSD78, HSM63, HSM80

cycle_activity.cycles_ldm_pending
Cycles with pending memory loads. Set Cmask=2 to count cycle.

cycle_activity.cycles_no_execute
This event counts cycles during which no instructions were
executed in the execution stage of the pipeline.

cycle_activity.stalls_l2_pending
Number of loads missed L2.

The following errata may apply to this: HSM63, HSM80

cycle_activity.stalls_ldm_pending
This event counts cycles during which no instructions were
executed in the execution stage of the pipeline and there were
memory instructions pending (waiting for data).

cycle_activity.cycles_l1d_pending
Cycles with pending L1 data cache miss loads. Set Cmask=8 to
count cycle.

cycle_activity.stalls_l1d_pending
Execution stalls due to L1 data cache miss loads. Set
Cmask=0CH.

lsd.uops
Number of uops delivered by the LSD.

lsd.cycles_active
Cycles Uops delivered by the LSD, but didn't come from the
decoder.

lsd.cycles_4_uops
Cycles 4 Uops delivered by the LSD, but didn't come from the
decoder.

dsb2mite_switches.penalty_cycles
Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles.

itlb.itlb_flush
Counts the number of ITLB flushes, includes 4k/2M/4M pages.

offcore_requests.demand_data_rd
Demand data read requests sent to uncore.

The following errata may apply to this: HSD78, HSM80

offcore_requests.demand_code_rd
Demand code read requests sent to uncore.

offcore_requests.demand_rfo
Demand RFO read requests sent to uncore, including regular
RFOs, locks, ItoM.

offcore_requests.all_data_rd
Data read requests sent to uncore (demand and prefetch).

uops_executed.stall_cycles
Counts number of cycles no uops were dispatched to be executed
on this thread.

The following errata may apply to this: HSD144, HSD30, HSM31

uops_executed.cycles_ge_1_uop_exec
This events counts the cycles where at least one uop was
executed. It is counted per thread.

The following errata may apply to this: HSD144, HSD30, HSM31

uops_executed.cycles_ge_2_uops_exec
This events counts the cycles where at least two uop were
executed. It is counted per thread.

The following errata may apply to this: HSD144, HSD30, HSM31

uops_executed.cycles_ge_3_uops_exec
This events counts the cycles where at least three uop were
executed. It is counted per thread.

The following errata may apply to this: HSD144, HSD30, HSM31

uops_executed.cycles_ge_4_uops_exec
Cycles where at least 4 uops were executed per-thread.

The following errata may apply to this: HSD144, HSD30, HSM31

uops_executed.core
Counts total number of uops to be executed per-core each cycle.

The following errata may apply to this: HSD30, HSM31

uops_executed.core_cycles_ge_1
Cycles at least 1 micro-op is executed from any thread on
physical core.

The following errata may apply to this: HSD30, HSM31

uops_executed.core_cycles_ge_2
Cycles at least 2 micro-op is executed from any thread on
physical core.

The following errata may apply to this: HSD30, HSM31

uops_executed.core_cycles_ge_3
Cycles at least 3 micro-op is executed from any thread on
physical core.

The following errata may apply to this: HSD30, HSM31

uops_executed.core_cycles_ge_4
Cycles at least 4 micro-op is executed from any thread on
physical core.

The following errata may apply to this: HSD30, HSM31

uops_executed.core_cycles_none
Cycles with no micro-ops executed from any thread on physical
core.

The following errata may apply to this: HSD30, HSM31

offcore_requests_buffer.sq_full
Offcore requests buffer cannot take more entries for this
thread core.

page_walker_loads.dtlb_l1
Number of DTLB page walker loads that hit in the L1+FB.

page_walker_loads.dtlb_l2
Number of DTLB page walker loads that hit in the L2.

page_walker_loads.dtlb_l3
Number of DTLB page walker loads that hit in the L3.

The following errata may apply to this: HSD25

page_walker_loads.dtlb_memory
Number of DTLB page walker loads from memory.

The following errata may apply to this: HSD25

page_walker_loads.itlb_l1
Number of ITLB page walker loads that hit in the L1+FB.

page_walker_loads.itlb_l2
Number of ITLB page walker loads that hit in the L2.

page_walker_loads.itlb_l3
Number of ITLB page walker loads that hit in the L3.

The following errata may apply to this: HSD25

page_walker_loads.itlb_memory
Number of ITLB page walker loads from memory.

The following errata may apply to this: HSD25

page_walker_loads.ept_dtlb_l1
Counts the number of Extended Page Table walks from the DTLB
that hit in the L1 and FB.

page_walker_loads.ept_dtlb_l2
Counts the number of Extended Page Table walks from the DTLB
that hit in the L2.

page_walker_loads.ept_dtlb_l3
Counts the number of Extended Page Table walks from the DTLB
that hit in the L3.

page_walker_loads.ept_dtlb_memory
Counts the number of Extended Page Table walks from the DTLB
that hit in memory.

page_walker_loads.ept_itlb_l1
Counts the number of Extended Page Table walks from the ITLB
that hit in the L1 and FB.

page_walker_loads.ept_itlb_l2
Counts the number of Extended Page Table walks from the ITLB
that hit in the L2.

page_walker_loads.ept_itlb_l3
Counts the number of Extended Page Table walks from the ITLB
that hit in the L2.

page_walker_loads.ept_itlb_memory
Counts the number of Extended Page Table walks from the ITLB
that hit in memory.

tlb_flush.dtlb_thread
DTLB flush attempts of the thread-specific entries.

tlb_flush.stlb_any
Count number of STLB flush attempts.

inst_retired.any_p
Number of instructions at retirement.

The following errata may apply to this: HSD11, HSD140

inst_retired.prec_dist
Precise instruction retired event with HW to reduce effect of
PEBS shadow in IP distribution.

The following errata may apply to this: HSD140

inst_retired.x87
This is a non-precise version (that is, does not use PEBS) of
the event that counts FP operations retired. For X87 FP
operations that have no exceptions counting also includes flows
that have several X87, or flows that use X87 uops in the
exception handling.

other_assists.avx_to_sse
Number of transitions from AVX-256 to legacy SSE when penalty
applicable.

The following errata may apply to this: HSD56, HSM57

other_assists.sse_to_avx
Number of transitions from SSE to AVX-256 when penalty
applicable.

The following errata may apply to this: HSD56, HSM57

other_assists.any_wb_assist
Number of microcode assists invoked by HW upon uop writeback.

uops_retired.all
Counts the number of micro-ops retired. Use Cmask=1 and invert
to count active cycles or stalled cycles.

uops_retired.stall_cycles
Cycles without actually retired uops.

uops_retired.total_cycles
Cycles with less than 10 actually retired uops.

uops_retired.core_stall_cycles
Cycles without actually retired uops.

uops_retired.retire_slots
This event counts the number of retirement slots used each
cycle. There are potentially 4 slots that can be used each
cycle - meaning, 4 uops or 4 instructions could retire each
cycle.

machine_clears.cycles
Cycles there was a Nuke. Account for both thread-specific and
All Thread Nukes.

machine_clears.memory_ordering
This event counts the number of memory ordering machine clears
detected. Memory ordering machine clears can result from memory
address aliasing or snoops from another hardware thread or core
to data inflight in the pipeline. Machine clears can have a
significant performance impact if they are happening
frequently.

machine_clears.smc
This event is incremented when self-modifying code (SMC) is
detected, which causes a machine clear. Machine clears can
have a significant performance impact if they are happening
frequently.

machine_clears.maskmov
This event counts the number of executed Intel AVX masked load
operations that refer to an illegal address range with the mask
bits set to 0.

br_inst_retired.all_branches
Branch instructions at retirement.

br_inst_retired.conditional
Counts the number of conditional branch instructions retired.

br_inst_retired.near_call
Direct and indirect near call instructions retired.

br_inst_retired.near_call_r3
Direct and indirect macro near call instructions retired
(captured in ring 3).

br_inst_retired.all_branches_pebs
All (macro) branch instructions retired.

br_inst_retired.near_return
Counts the number of near return instructions retired.

br_inst_retired.not_taken
Counts the number of not taken branch instructions retired.

br_inst_retired.near_taken
Number of near taken branches retired.

br_inst_retired.far_branch
Number of far branches retired.

br_misp_retired.all_branches
Mispredicted branch instructions at retirement.

br_misp_retired.conditional
Mispredicted conditional branch instructions retired.

br_misp_retired.all_branches_pebs
This event counts all mispredicted branch instructions retired.
This is a precise event.

br_misp_retired.near_taken
Number of near branch instructions retired that were taken but
mispredicted.

avx_insts.all
Note that a whole rep string only counts AVX_INST.ALL once.

hle_retired.start
Number of times an HLE execution started.

hle_retired.commit
Number of times an HLE execution successfully committed.

hle_retired.aborted
Number of times an HLE execution aborted due to any reasons
(multiple categories may count as one).

hle_retired.aborted_misc1
Number of times an HLE execution aborted due to various memory
events (e.g., read/write capacity and conflicts).

hle_retired.aborted_misc2
Number of times an HLE execution aborted due to uncommon
conditions.

hle_retired.aborted_misc3
Number of times an HLE execution aborted due to HLE-unfriendly
instructions.

hle_retired.aborted_misc4
Number of times an HLE execution aborted due to incompatible
memory type.

The following errata may apply to this: HSD65

hle_retired.aborted_misc5
Number of times an HLE execution aborted due to none of the
previous 4 categories (e.g. interrupts).

rtm_retired.start
Number of times an RTM execution started.

rtm_retired.commit
Number of times an RTM execution successfully committed.

rtm_retired.aborted
Number of times an RTM execution aborted due to any reasons
(multiple categories may count as one).

rtm_retired.aborted_misc1
Number of times an RTM execution aborted due to various memory
events (e.g. read/write capacity and conflicts).

rtm_retired.aborted_misc2
Number of times an RTM execution aborted due to various memory
events (e.g., read/write capacity and conflicts).

rtm_retired.aborted_misc3
Number of times an RTM execution aborted due to HLE-unfriendly
instructions.

rtm_retired.aborted_misc4
Number of times an RTM execution aborted due to incompatible
memory type.

The following errata may apply to this: HSD65

rtm_retired.aborted_misc5
Number of times an RTM execution aborted due to none of the
previous 4 categories (e.g. interrupt).

fp_assist.x87_output
Number of X87 FP assists due to output values.

fp_assist.x87_input
Number of X87 FP assists due to input values.

fp_assist.simd_output
Number of SIMD FP assists due to output values.

fp_assist.simd_input
Number of SIMD FP assists due to input values.

fp_assist.any
Cycles with any input/output SSE* or FP assists.

rob_misc_events.lbr_inserts
Count cases of saving new LBR records by hardware.

mem_uops_retired.stlb_miss_loads
Retired load uops that miss the STLB.

The following errata may apply to this: HSD29, HSM30

mem_uops_retired.stlb_miss_stores
Retired store uops that miss the STLB.

The following errata may apply to this: HSD29, HSM30

mem_uops_retired.lock_loads
Retired load uops with locked access.

The following errata may apply to this: HSD76, HSD29, HSM30

mem_uops_retired.split_loads
Retired load uops that split across a cacheline boundary.

The following errata may apply to this: HSD29, HSM30

mem_uops_retired.split_stores
Retired store uops that split across a cacheline boundary.

The following errata may apply to this: HSD29, HSM30

mem_uops_retired.all_loads
All retired load uops.

The following errata may apply to this: HSD29, HSM30

mem_uops_retired.all_stores
All retired store uops.

The following errata may apply to this: HSD29, HSM30

mem_load_uops_retired.l1_hit
Retired load uops with L1 cache hits as data sources.

The following errata may apply to this: HSD29, HSM30

mem_load_uops_retired.l2_hit
Retired load uops with L2 cache hits as data sources.

The following errata may apply to this: HSD76, HSD29, HSM30

mem_load_uops_retired.l3_hit
Retired load uops with L3 cache hits as data sources.

The following errata may apply to this: HSD74, HSD29, HSD25,
HSM26, HSM30

mem_load_uops_retired.l1_miss
Retired load uops missed L1 cache as data sources.

The following errata may apply to this: HSM30

mem_load_uops_retired.l2_miss
Retired load uops missed L2. Unknown data source excluded.

The following errata may apply to this: HSD29, HSM30

mem_load_uops_retired.l3_miss
Retired load uops missed L3. Excludes unknown data source .

The following errata may apply to this: HSD74, HSD29, HSD25,
HSM26, HSM30

mem_load_uops_retired.hit_lfb
Retired load uops which data sources were load uops missed L1
but hit FB due to preceding miss to the same cache line with
data not ready.

The following errata may apply to this: HSM30

mem_load_uops_l3_hit_retired.xsnp_miss
Retired load uops which data sources were L3 hit and cross-core
snoop missed in on-pkg core cache.

The following errata may apply to this: HSD29, HSD25, HSM26,
HSM30

mem_load_uops_l3_hit_retired.xsnp_hit
Retired load uops which data sources were L3 and cross-core
snoop hits in on-pkg core cache.

The following errata may apply to this: HSD29, HSD25, HSM26,
HSM30

mem_load_uops_l3_hit_retired.xsnp_hitm
Retired load uops which data sources were HitM responses from
shared L3.

The following errata may apply to this: HSD29, HSD25, HSM26,
HSM30

mem_load_uops_l3_hit_retired.xsnp_none
Retired load uops which data sources were hits in L3 without
snoops required.

The following errata may apply to this: HSD74, HSD29, HSD25,
HSM26, HSM30

mem_load_uops_l3_miss_retired.local_dram
This event counts retired load uops where the data came from
local DRAM. This does not include hardware prefetches.

The following errata may apply to this: HSD74, HSD29, HSD25,
HSM30

baclears.any
Number of front end re-steers due to BPU misprediction.

l2_trans.demand_data_rd
Demand data read requests that access L2 cache.

l2_trans.rfo
RFO requests that access L2 cache.

l2_trans.code_rd
L2 cache accesses when fetching instructions.

l2_trans.all_pf
Any MLC or L3 HW prefetch accessing L2, including rejects.

l2_trans.l1d_wb
L1D writebacks that access L2 cache.

l2_trans.l2_fill
L2 fill requests that access L2 cache.

l2_trans.l2_wb
L2 writebacks that access L2 cache.

l2_trans.all_requests
Transactions accessing L2 pipe.

l2_lines_in.i
L2 cache lines in I state filling L2.

l2_lines_in.s
L2 cache lines in S state filling L2.

l2_lines_in.e
L2 cache lines in E state filling L2.

l2_lines_in.all
This event counts the number of L2 cache lines brought into the
L2 cache. Lines are filled into the L2 cache when there was an
L2 miss.

l2_lines_out.demand_clean
Clean L2 cache lines evicted by demand.

l2_lines_out.demand_dirty
Dirty L2 cache lines evicted by demand.

sq_misc.split_lock
tbd

NAME

DESCRIPTION

SEE ALSO